Amazon SageMaker Debugger helps you to debug machine learning models during training by identifying and detecting problems with the models in near real-time.
Amazon SageMaker Debugger lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors ‘flowing through the graph’ during training. Furthermore, it helps you monitor your training in near real-time using rules and provides you alerts, once it has detected inconsistency in training flow.
With these concepts in mind, let's understand the overall flow of things that Amazon SageMaker Debugger uses to orchestrate debugging
The tensors captured by the debug hook are stored in the S3 location specified by you.
As we are using the Amazon SageMaker provided TensorFlow container, we don't need to make any changes to our training script for the tensors to be stored. Amazon SageMaker Debugger will use the configuration you provide through Amazon SageMaker SDK's Tensorflow
Estimator when creating your job to save the tensors in the fashion you specify.
To run this lab:
MLAI/Script-mode/keras_cifar10_debuggerfolder, double click on
TF_keras_CIFAR10_debugger.ipynbto open it.