Notebooks
A
Amazon Web Services
Falcon 40b Accelerate

Falcon 40b Accelerate

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningdeploy-falcon-40b-and-7bWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Serve Falcon 40B model with Amazon SageMaker Hosting


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


In this example we walk through how to deploy and perform inference on the Falcon 40B model using the Large Model Inference(LMI) container provided by AWS using DJL Serving and DeepSpeed. The Falcon 40B model is a casual decoder model that was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances. Because this is a large language model (LLM) that does not fit on a single GPU, we will use an 'ml.g5.12xlarge" instance which has 4 GPUs for deploying this model

Setup

Installs the dependencies required to package the model and run inferences using Amazon SageMaker. Update SageMaker, boto3 etc

[ ]
[ ]

Imports and variables

[ ]

1. Create SageMaker compatible model artifacts

In order to prepare our model for deployment to a SageMaker Endpoint for hosting, we will need to prepare a few things for SageMaker and our container. We will use a local folder as the location of these files including serving.properties that defines parameters for the LMI container.

[ ]

In the serving.properties files define the the engine to use and model to host. Note the tensor_parallel_degree parameter which is also required in this scenario. We will use tensor parallelism to divide the model into multiple parts because no single GPU has enough memory for the entire model. In this case we will use a 'ml.g5.24xlarge' instance which provides 4 GPUs. Be careful not to specify a value larger than the instance provides or your deployment will fail.

[ ]

2. Create a model.py with custom inference code

SageMaker allows you to bring your own script for inference. Here we create our model.py file with the appropriate code for the Falcon 40B model.

[ ]

3. Create the Tarball and then upload to S3 location

Next, we will package our artifacts as *.tar.gz files for uploading to S3 for SageMaker to use for deployment

[ ]

4. Define a serving container, SageMaker Model and SageMaker endpoint

Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint.

Define the serving container

Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using Accelerate.

[ ]

Create SageMaker model, endpoint configuration and endpoint.

[ ]
[ ]
[ ]
[ ]
[ ]
[ ]

Run Inference

Large models such as Falcon have very high accelerator memory footprint. Thus, a very large input payload or generating a large output can cause out of memory errors. The inference examples below are calibrated such that they will work on the ml.g5.12xlarge instance within the SageMaker response time limit of 60 seconds. If you find that increasing the input length or generation length leads to CUDA Out Of Memory errors, we recommend that you try one of the following solutions:

  • Use 8 bit quantization. In the model.py script, you can enable load_in_8bit=True in the call to AutoModelForCausalLM.from_pretrained. This will reduce the memory footprint of the model on the GPUs, allowing for larger input and generation sizes.
    • Using 8bit quantization may result in lower quality generated output.
  • Deploy to an instance with more GPUs, and/or GPUs with more memory. The ml.g5.48xlarge, ml.p4d.24xlarge, and ml.p4de.24xlarge instances are all good options here.

When attempting to generate more tokens, you might run into issues with the SageMaker runtime client timing out after 60 seconds. To get around this issue, we recommend that you check out our example for Large Language Models with streaming via pagination. You can find that example here. When using streaming, you still have to be conscious of memory constraints.

In the following example inference requests, we limit the sequence length such that we return a response within 60 seconds and don't exceed the GPU memory available.

You can pass additional generation arguments as part of the properties dictionary in the request (e.g. temperature, top_k etc.).

[ ]
[ ]

Clean Up

[ ]
[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable