Posted On: Apr 28, 2024
Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health metrics from your Amazon Web Services accelerators Trainium and Inferentia, and Amazon Web Services high performance network adapters (Elastic Fabric Adapters) as well as NVIDIA GPUs. You can visualize these out-of-the-box metrics in curated dashboards to help you monitor accelerated infrastructure and optimize your AI workloads for operational excellence.
Using Enhanced Container Insights you can now easily correlate compute and memory metrics with your internode network metrics to help understand the traffic impact on tasks running on your EKS clusters, such as monitoring latency sensitive training jobs. Enhanced Container Insights enables you to easily monitor the efficiency of resource consumption by your distributed deep learning and inference algorithms such that you can optimize resource allocation and minimize long disruptions in your applications. Enhanced Container Insights delivers accelerated compute observability with automatic visualizations and removes the need for manual dashboard creations and alarm set-ups.
Getting started with accelerated compute observability is easy. You can onboard Enhanced Container Insights either by installing CloudWatch Observability Add-on into your clusters or by manually installing the CloudWatch Agent to enable enhanced observability. Once configured you can navigate to Container Insights console and view your accelerated compute telemetry out-of-the-box.
Accelerated Compute Observability is now available in Enhanced Container Insights for EKS in Amazon Web Services China (Beijing) Region, operated by Sinnet and Amazon Web Services China (Ningxia) Region, operated by NWCD. Accelerated Compute metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.