Notebooks
A
Amazon Web Services
2c Meta Llama2 7b Lmi Autoscaling

2c Meta Llama2 7b Lmi Autoscaling

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopslab-inference-components-with-scalingawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Llama2-7b

In this notebook we will create and deploy a Llama2-7b using inference components on the endpoint you created in the first notebook. For this model we will be using the SageMaker Large Model Inference (LMI) container. We will also be using one GPU for each model copy of the inference component we create. After creating the inference component we also show you how to set auto scaling policies to manage the number of copies of your inference component. We also use managed instance scaling which will scale the number of instances in your endpoint properlly in relation to your inference componenets. This is the 4rth notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


Tested using the Python 3 (Data Science) kernel on SageMaker Studio and conda_python3 kernel on SageMaker Notebook Instance.

Licence agreement

Install dependencies

Upgrade the SageMaker Python SDK.

[ ]

Import libraries

[ ]

Set configurations

REPLACE the endpoint_name value with the created endpoint from the first notebook

[ ]

We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook.

[ ]
[ ]
[ ]

Create SageMaker compatible Model artifact, upload Model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

Create serving.properties

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

engine: The engine for DJL to use. In this case, we have set it to MPI.
option.model_id: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artefacts. 
option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

[ ]
[ ]

Image URI for the DJL container is being used here

[ ]

Create the Tarball and then upload to S3 location

[ ]
[ ]
[ ]
[ ]

Create Inference Component

[ ]

Create Inference Component

[ ]
[ ]
[ ]
[ ]

Leverage the Boto3 to invoke the endpoint.

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a prompt as input to the model. This done by setting inputs to a prompt. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a text prompt and also sets some parameters.

Note that we also apply an InferenceComponentName input to determine whch Inference Component the request should be directed to.

[ ]

Scalable Target

AAS creates two alarms for each autoscaling target

  • one to trigger scale-out: 3 minutes (3 one-minute data points)
  • another one to trigger scale-in: 15 minutes (15 one-minute data points)

The time to trigger is usually 1 to 2 minutes longer than those because it take time for the endpoint to publish metrics to CloudWatch, and it also takes time for AAS to react.

Application Auto Scaling

In the following cells we will go through how to use Application Auto Scaling to scale your inference component copies. In addition, please note that in our first notebook we set ManagedInstanceScaling to be enabled. By doing this SageMaker will automatically scale your endpoint based on the needs of your inference components.

We can first start by setting the number of desired initial and max copies for an inference component. We will also specify a folder for our test results for our scaling test.

[ ]
[ ]

We can now set the values we will need to register a scalable target (in this case an inference component) with Application Auto Scaling.

[ ]
[ ]
[ ]

Scalable Policy

Now that we have registered our scalable targets we can specify a scaling policy for our target. NOTE: If the scale-out cooldown is shorter than that the endpoint update time then it takes no effect, as it is not possible to update a SageMaker endpoint which is already in “Updating” status.

[ ]
[ ]

Run The Test

We can now run a test to see the behavior of instance and managed auto scaling on SageMaker endpoints.

[ ]
[ ]
[ ]
[ ]

Cleanup

We can delete and deregisterer our scaling policy and targets with Application Auto Scaling

[ ]
[ ]

Thats it! You can now proceed to the third notebook where we will show you some miscellaneous functions and clean up our resources.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

[ ]