Detect Stalled Training Job And Actions
Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
This notebook shows you how to use the StalledTrainingRule built-in rule. This rule can take an action to stop your training job or send you an email/SMS, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.
How the StalledTrainingRule Built-in Rule Works
Amazon Sagemaker Debugger captures tensors that you want to watch from training jobs on AWS Deep Learning Containers or your local machine. If you use one of the Debugger-integrated Deep Learning Containers, you don't need to make any changes to your training script to use the functionality of built-in rules. For information about Debugger-supported SageMaker frameworks and versions, see Debugger-supported framework versions for zero script change.
If you want to run a training script that uses partially supported framework by Debugger or your own custom container, you need to manually register the Debugger hook to your training script. The smdebug library provides tools to help the hook registration, and the sample script provided in the src folder includes the hook registration code as comment lines. For more information about how to manually register the Debugger hooks for this case, see the training script at ./src/simple_stalled_training.py, and documentation at smdebug TensorFlow hook, smdebug PyTorch hook, smdebug MXNet hook, and smdebug XGBoost hook.
The Debugger StalledTrainingRule watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the StopTrainingJob API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger StalledTrainingRule to watch the losses pre-built tensor collection.
Install custom packages
These packages were built manually with the changes needed to run rules with actions, since the changes have not been released yet. Remember to refresh the kernel after installing these packages
Import SageMaker Python SDK
Import SageMaker Debugger classes for rule configuration
Create the actions to be used in the rules
The following code cells include:
- a code line to create the action objects
- a stalled training job rule configuration object that uses these actions
- a SageMaker TensorFlow estimator configuration with the Debugger
rulesparameter to run the built-in rule
Valid action objects are individual actions (StopTraining, Email, SMS) or an ActionList with a combination of these.
Note: Debugger collects loss tensors by default every 500 steps.
Monitoring Training and Rule Evaluation Status
Once you execute the estimator.fit() API, SageMaker initiates a training job in the background, and Debugger initiates a StalledTrainingRule rule evaluation job in parallel.
Because the training scripts has a few lines of code at the end to force a sleep mode for 10 minutes, the RuleEvaluationStatus for StalledTrainingRule will change to IssuesFound in 2 minutes after the sleep mode is on and trigger the StopTrainingJob API.
Print the training job name
The following cell outputs the training job name and its training status running in the background.
Output the current job status and the rule evaluation status
The following cell tracks the status of training job until the SecondaryStatus changes to Stopped or Completed. While training, Debugger collects output tensors from the training job and monitors the training job with the rules.
Conclusion
This notebook showed how you can use the Debugger StalledTrainingRule built-in rule for your training job to take action on rule evaluation status changes. To find more information about Debugger, see Amazon SageMaker Debugger Developer Guide and the smdebug GitHub documentation.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.