Notebooks
A
Amazon Web Services
Intro To Llm Deployment

Intro To Llm Deployment

data-sciencelab1-deploy-llminferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Introduction to Large Language Model Hosting on SageMaker with DeepSpeed Container


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


In this notebook, we explore how to host a large language model on SageMaker using the Large Model Inference container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post.

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

In this notebook, we deploy the open source GPT-J Model which is comprised of 6B parameters on a single GPU. Along the way we will explore approaches that will allow us to scale to larger models with practically no code changes.

This notebook was tested on a ml.t3.medium instance using the Python 3 (Data Science) kernel on SageMaker Studio.

Create a SageMaker Model for Deployment

As a first step, we'll import the relevant libraries and configure several global variables such as the hosting image that will be used nd the S3 location of our model artifacts

[ ]
[ ]
[ ]
[ ]
[ ]

Deploying a Large Language Model using Hugging Face Accelerate

The DJL Inference Image which we will be utilizing ships with a number of built-in inference handlers for a wide variety of tasks including:

  • text-generation
  • question-answering
  • text-classification
  • token-classification

You can refer to this GitRepo for a list of additional handlers and available NLP Tasks.
These handlers can be utilized as is without having to write any custom inference code. We simply need to create a serving.properties text file with our desired hosting options and package it up into a tar.gz artifact.

Lets take a look at the serving.properties file that we'll be using for our first example

[ ]

There are a few options specified here. Lets go through them in turn

  1. engine - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the DJL Python Engine
  2. option.entryPoint - specifies the entrypoint code that will be used to host the model. djl_python.huggingface refers to the huggingface.py module from djl_python repo.
  3. option.s3url - specifies the location of the model files. Alternativelly an option.model_id option can be used instead to specifiy a model from Hugging Face Hub (e.g. EleutherAI/gpt-j-6B) and the model will be automatically downloaded from the Hub. The s3url approach is recommended as it allows you to host the model artifact within your own environment and enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance
  4. option.task - This is specific to the huggingface.py inference handler and specifies for which task this model will be used
  5. option.device_map - Enables layer-wise model partitioning through Hugging Face Accelerate. With option.device_map=auto, Accelerate will determine where to put each layer to maximize the use of your fastest devices (GPUs) and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM). Even if the model is split across several devices, it will run as you would normally expect.
  6. option.load_in_8bit - Quantizes the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this blog post from Hugging Face for additional information

For more information on the available options, please refer to the SageMaker Large Model Inference Documentation

Our initial approach here is to utilize the built-in functionality within Hugging Face Transformers to enable Large Language Model hosting. These are exposed through the device_map and load_in_8bit parameters which enable sharding and shrinking of the model. The sharding approach taken here is layer wise as individual model layers are placed onto different GPU devices and data flows sequentially from the input to the final output layer as illustated below

Even though in this example the model will be running on a single GPU and will not be sharded, this parameter would automatically apply sharding as we scale to larger models on multi-GPU instances.

We place the serving.properties file into a tarball and upload it to S3

[ ]
[ ]

Deploy Model to a SageMaker Endpoint

With a helper function we can now deploy our endpoint and invoke it with some sample inputs

[ ]
[ ]
[ ]

Let's run an example with a basic text generation prompt Large model inference is

[ ]

Now let's try another example where we provide the a few samples of text and sentiment pairs and ask it to classify a new example

[ ]

You can see that the model filled in a Sentiment value for the last example. You can take a look at a blog post here for more examples of prompts. Finally Let's do a quick benchmark to see what kind of latency we can expect from this model

[ ]
[ ]

Bonus: Deploying a Large Language Model Using DeepSpeed

Now we will explore another approach for deploying Large Language Models using DeepSpeed. DeepSpeed provides various inference optimizations for compatible transformer based models including model sharding, optimized inference kernels, and quantization. To leverage DeepSpeed, we simply need to modify our serving.properties file

[ ]

Notice that the engine parameter is now set to DeepSpeed and the option.entryPoint has been modified to use the deepspeed.py module. Python scripts that use DeepSpeed can not be launched as traditional python scripts (i.e. python deepspeed.py would not work.) Setting engine=DeepSpeed will automatically configure the environment and launch the inference script appropriatelly.
The only other new parameter here is option.tensor_parallel_degree where we have to specify the number of GPU devices to which the model will be sharded.

Unlike Accelerate where the model was partitioned along the layers, DeepSpeed uses TensorParallelism where individual layers (Tensors) are sharded accross devices. For example each GPU can have a slice of each layer. The diagram below provides a high level illustartion of how this works

Where with the layer-wise approach, the data fllowed through each GPU device sequeantially, here data is sent to all GPU devices where a partial result is compute on each GPU. The partial results are then collected though an All-Gather operation to compute the final result. TensorParallelism generally provides higher GPU utilization and better performance.

[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable