Posted On: Mar 29, 2021
Amazon SageMaker Debugger’s new capabilities are now available in the Amazon Web Services China (Beijing) Region, operated by Sinnet and the Amazon Web Services China (Ningxia) Region, operated by NWCD. Amazon SageMaker Debugger’s new capabilities have real-time monitoring of system resources for efficient utilization. With these new capabilities, you can now get automatic recommendations to re-allocate resources for your training jobs, helping you train better and reduce time and costs.
Amazon SageMaker Debugger is a capability of Amazon SageMaker that makes it easy to train ML models faster by capturing real-time metrics such as learning gradients and weights, providing transparency into the training process, so you can correct anomalies such as losses, over-fitting, and over-training. SageMaker Debugger provides built-in techniques called rules to easily analyze emitted data including tensors that are critical for the success of training jobs such as identifying why your ML model is predicting a right traffic signal as left even though it trained at over 90% accuracy.
With new profiling capabilities, SageMaker Debugger now automatically monitors system resources such as CPU, GPU, network, I/O, and memory providing a complete resource utilization view of training jobs. You can also profile your entire training job, or portions thereof, to emit detailed framework metrics during different phases of the training job. Framework metrics are metrics that are captured from within the training script such as step duration, data-loading, pre-processing, and operator execution time on CPUs and GPUs. SageMaker Debugger correlates system and framework metrics which helps you identify possible root causes to issues such as GPU utilization dropping down to zero so you can inspect your training scripts and troubleshoot suitably. You can reallocate resources based on recommendations from the profiling report resulting in improving training time and reducing costs. Metrics and insights are captured and monitored programmatically using the SageMaker Python SDK or visually through Amazon SageMaker Studio.
Amazon SageMaker Debugger is also generally available in all Amazon Web Services regions in the Americas and Europe, and some regions in Asia Pacific with additional regions coming soon. Read the documentation for more information. To learn how to use the new profiling functionality in SageMaker Debugger, visit the blog post.