Notebooks
A
Amazon Web Services
Djl Accelerate Deploy G5 12x GPT NeoX

Djl Accelerate Deploy G5 12x GPT NeoX

data-sciencelab3-optimize-llminferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Serve GPT-NeoX-20b on SageMaker with Accelerate using DJL container.


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


In this notebook, we explore how to host a large language model on SageMaker using the latest container that packages some of the most popular open source libraries for model parallel inference like DeepSpeed and Hugging Face Accelerate. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

SageMaker has rolled out Deep Learning containers container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source GPT-NeoX-20B model across GPUs on a ml.g5.12xlarge instance. The model is loaded using layer-wise model partitioning through Hugging Face Accelerate. You can also quantize the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this blog post from Hugging Face for additional information.

Licence agreement

Import the relevant libraries and configure several global variables using boto3

[ ]
[ ]
[ ]

Create SageMaker compatible Model artifact, upload model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions. We used that approach in Lab1 to host the models where we leveraged the In-Built containers.

In this notebook, we demonstrate how to bring your own inference script which leverages Accelerate to shard the model.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties and model.py.

The tarball is in the following format

code
├──── 
│   └── serving.properties
│   └── model.py

  • serving.properties is the configuration file that can be used to configure the model server.
  • model.py is the file that handles any requests for serving.
[ ]

Create serving.properties

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

  • engine: The engine for DJL to use. In this case, we intend to use Accelerate and hence set it to Python.
  • option.entryPoint: The entrypoint python file or module. This should align with the engine that is being used.
  • option.s3url: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages s5cmd to download the model from s3. This is extremely fast and useful when downloading large models like this one.

If you want to download the model from huggingface.co, you can set option.modelid. The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

  • option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

The approach here is to utilize the built-in functionality within Hugging Face Transformers to enable Large Language Model hosting. The sharding approach taken here is layer wise as individual model layers are placed onto different GPU devices and data flows sequentially from the input to the final output layer as illustated below

[ ]

In the below cell, we leverage Jinja to create a template for serving.properties. Specifically, we parameterize option.s3url so that it can be changed based on the pretrained model location.

[ ]
[ ]

Create a model.py with custom inference code

In this script, we load the model and generate predictions using the transformers library. Note the use of the following parameters while loading the model -

  • device_map: Using one of the supported versions lets Accelerate handle the device_map computation. With balanced_low_0, the model is split evenly across all GPUs except the first one. For other supported options, you can refer to designing a device map. You can also create one yourself.
  • load_in_8bit: Setting this to True quantizes the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this blog post from Hugging Face for additional information.

The container also makes a warmup call without an payload to the handler

[ ]

Image URI for the DJL container is being used here

[ ]

Create the Tarball and then upload to S3 location

[ ]
[ ]

To create the end point the steps are:

  1. Create the Model using the Image container and the Model Tarball uploaded earlier

  2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.12xlarge

    b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready

  3. Create the end point using the endpoint config created

Create the Model

Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages s5cmd(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. The size of this mount is large enough to hold the model.

[ ]
[ ]
[ ]

This step can take ~ 10 min or longer so please be patient

[ ]

Leverage Boto3 to invoke the endpoint.

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a batch of prompts and also sets some parameters.

[ ]

Conclusion

In this post, we demonstrated how to use SageMaker large model inference containers to host GPT-NeoX. We used Hugging Face Accelerate’s model parallel techniques to host the model on multiple GPUs on a single SageMaker machine learning instance. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

Clean Up

[ ]
[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable