Open Llama 7b

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learninglab10-open-llamasagemakerjupyter-notebooktrainingmlops

Open-LLAMA 7B implementation using LMI container on SageMaker

Model source: https://github.com/openlm-research/open_llama ;

Model download hub: https://huggingface.co/openlm-research/open_llama_7b;

License: Apache-2.0

In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it. Please make sure the following permission granted before running the notebook:

  • ECR Push/Pull access
  • S3 bucket push access
  • SageMaker access

Attribution: this notebook is based on the content of https://github.com/deepjavalibrary/djl-demo/tree/master and was debugged with the help of lanking520.

Step 1: Let's bump up SageMaker and import stuff

[1]
Note: you may need to restart the kernel to use updated packages.
[ ]
[3]
[4]
arn:aws:iam::328296961357:role/service-role/AmazonSageMaker-ExecutionRole-20191125T182032 us-west-2 328296961357
[5]
'2.161.0'

Step 2 pull and push the docker from Docker hub to ECR repository (optional)

*Note: you can either use a prebuilt container or use the cell below (change cell type to 'code' from 'raw")

Note: Please make sure you have the permission in AWS credential to push to ECR repository

This process may take a while, depends on the container size and your network bandwidth

Note: you only need to build this container once. Once you pushed it in ECR, you can pull the image via

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest"

Step 3: Start preparing model artifacts

In LMI container, we expect some artifacts to help set up the model

  • serving.properties (required): Defines the model server settings
  • model.py (optional): A python file to define the core inference logic
  • requirements.txt (optional): Any additional pip wheel need to install
[7]
Writing serving.properties
[8]
Writing model.py
[9]
Writing requirements.txt
[10]
mymodel/
mymodel/requirements.txt
mymodel/model.py
mymodel/serving.properties

Step 4: Start building SageMaker endpoint

In this step, we will build SageMaker endpoint from scratch

4.1 Upload artifact on S3 and create SageMaker model

[12]
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-328296961357/large-model-lmi/code/mymodel.tar.gz
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118

4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

[13]
--------------!

Step 5a: Test and benchmark inference latency

The latency is heavily dependent on 'max_new_tokens' parameter

[14]
2.2340340614318848

Let us define a helper function to get a histogram of invocation latency distribution

[15]
Matplotlib is building the font cache; this may take a moment.
[16]
100%|██████████| 10/10 [01:53<00:00, 11.35s/it]
Output
114.2704861164093
CPU times: user 258 ms, sys: 39.5 ms, total: 298 ms
Wall time: 1min 54s
[17]
open-llama-lmi-model-2023-06-02-00-16-24-723
us-west-2

Step 5b: Analyze Inference Latency via CloudWatch

[18]
[19]
[20]
2023-06-02 00:26:07.841647
2023-06-02 00:23:13.571161
[21]
[22]
Output
[23]

Clean up the environment

[ ]