Manage Runs
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Introduction
When you're building enterprise-grade machine learning models, it is important to track, organize, monitor and reproduce your training runs. For example, you might want to trace the lineage behind a model deployed to production, and re-run the training experiment to troubleshoot issues.
This notebooks shows examples how to use Azure Machine Learning services to manage your training runs.
Setup
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the configuration Notebook first if you haven't already to establish your connection to the AzureML Workspace. Also, if you're new to Azure ML, we recommend that you go through the tutorial first to learn the basic concepts.
Let's first import required packages, check Azure ML SDK version, connect to your workspace and create an Experiment to hold the runs.
Start, monitor and complete a run
A run is an unit of execution, typically to train a model, but for other purposes as well, such as loading or transforming data. Runs are tracked by Azure ML service, and can be instrumented with metrics and artifact logging.
A simplest way to start a run in your interactive Python session is to call Experiment.start_logging method. You can then log metrics from within the run.
Use get_status method to get the status of the run.
Also, you can simply enter the run to get a link to Azure Portal details
Method get_details gives you more details on the run.
Use complete method to end the run.
You can also use Python's with...as pattern. The run will automatically complete when moving out of scope. This way you don't need to manually complete the run.
Next, let's look at submitting a run as a separate Python process. To keep the example simple, we submit the run on local computer. Other targets could include remote VMs and Machine Learning Compute clusters in your Azure ML Workspace.
We use hello.py script as an example. To perform logging, we need to get a reference to the Run instance from within the scope of the script. We do this using Run.get_context method.
Submitted runs take a snapshot of the source_directory to use when executing. You can control which files are available to the run by using an .amlignore file.
Let's submit the run on a local computer. A standard pattern in Azure ML SDK is to create run configuration, and then use Experiment.submit method.
You can view the status of the run as before
Submitted runs have additional log files you can inspect using get_details_with_logs.
Use wait_for_completion method to block the local execution until remote run is complete.
Add properties and tags
Properties and tags help you organize your runs. You can use them to describe, for example, who authored the run, what the results were, and what machine learning approach was used. And as you'll later learn, properties and tags can be used to query the history of your runs to find the important ones.
For example, let's add "author" property to the run:
Properties are immutable. Once you assign a value it cannot be changed, making them useful as a permanent record for auditing purposes.
Tags on the other hand can be changed:
You can also add a simple string tag. It appears in the tag dictionary with value of None
Query properties and tags
You can query runs within an experiment that match specific properties and tags.
Start and query child runs
You can use child runs to group together related runs, for example different hyperparameter tuning iterations.
Let's use hello_with_children script to create a batch of 5 child runs from within a submitted run.
You can start child runs one by one. Note that this is less efficient than submitting a batch of runs, because each creation results in a network call.
Child runs too complete automatically as they move out of scope.
To query the child runs belonging to specific parent, use get_children method.
Cancel or fail runs
Sometimes, you realize that the run is not performing as intended, and you want to cancel it instead of waiting for it to complete.
As an example, let's create a Python script with a delay in the middle.
You can use cancel method to cancel a run.
You can also mark an unsuccessful run as failed.
Reproduce a run
When updating or troubleshooting on a model deployed to production, you sometimes need to revisit the original training run that produced the model. To help you with this, Azure ML service by default creates snapshots of your scripts a the time of run submission:
You can use restore_snapshot to obtain a zip package of the latest snapshot of the script folder.
You can then extract the zip package, examine the code, and submit your run again.
Next steps
- To learn more about logging APIs, see logging API notebook
- To learn more about remote runs, see train on AML compute notebook