Notebooks
H
Hugging Face
Sagemaker Notebook

Sagemaker Notebook

hf-notebooks04_distributed_training_model_parallelismsagemaker

Huggingface Sagemaker-sdk - Distributed Training Demo

Model Parallelism using SageMakerTrainer

Introduction

Welcome to our end-to-end distributed Text-Classification example. In this demo, we will use the Hugging Face transformers and datasets library together with a Amazon sagemaker-sdk extension to run GLUE mnli benchmark on a multi-node multi-gpu cluster using SageMaker Model Parallelism Library. The demo will use the new smdistributed library to run training on multiple gpus. We extended the Trainer API to a the SageMakerTrainer to use the model parallelism library. Therefore you only have to change the imports in your train.py.

from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments
from transformers.sagemaker import SageMakerTrainer as Trainer

NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances

Development Environment and Permissions

Installation

Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed

[ ]

Development environment

[ ]

Permissions

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

[ ]

Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as entry_point, which instance_type should be used, which hyperparameters are passed in .....

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the huggingface container, uploads the provided fine-tuning script train.py and downloads the data from our sagemaker_session_bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32

The hyperparameters you define in the HuggingFace estimator are passed in as named arguments.

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

  • SM_MODEL_DIR: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

  • SM_NUM_GPUS: An integer representing the number of GPUs available to the host.

  • SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named train and test, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

To run your training job locally you can define instance_type='local' or instance_type='local_gpu' for gpu usage. Note: this does not working within SageMaker Studio

Creating an Estimator and start a training job

In this example we are going to use the run_glue.py from the transformers example scripts. We modified it and included SageMakerTrainer instead of the Trainer to enable model-parallelism. You can find the code here.

from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments, SageMakerTrainer as Trainer
[ ]
[ ]
[ ]
[ ]
[ ]

Deploying the endpoint

To deploy our endpoint, we call deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

[ ]

Then, we use the returned predictor object to call the endpoint.

[ ]

Finally, we delete the endpoint again.

[ ]