Notebooks
A
Amazon Web Services
Lab6 Token Streaming Eleutherai Gpt J 6b Lmi

Lab6 Token Streaming Eleutherai Gpt J 6b Lmi

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Serve and stream tokens from EleutherAI's gpt-j-6b hosted on Amazon SageMaker using LMI (Large Model Inference) DJL-based container


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


Recommended kernel(s): This notebook can be run with any Amazon SageMaker Studio kernel.

This notebook focuses on deploying the EleutherAI/gpt-j-6b HuggingFace model to an Amazon SageMaker endpoint for a text generation task. In this example, you will use the SageMaker-managed LMI (Large Model Inference) Docker image (Amazon Sagemaker Deep Learning Container - DLC) as inference image. LMI images features a DJL serving stack powered by the Deep Java Library.

Once the model has been deployed, you will submit text generation requests and get a streamed response in return using Amazon SageMaker Runtime's native response streaming capability.

Notice: The model artifacts (checkpoints, configuration, etc.) are not downloaded from the HuggingFace Hub but from an Amazon S3 bucket managed by AWS.

In this notebook, we make an extensive use of the higher-level abstractions provided by the sagemaker Python SDK to which we delegate the management of as many resources and configuration as we can, hence demonstrating that the deployment of LLMs to Amazon SageMaker can be performed with great simplicity and minimal amount of code.

You will successively deploy the EleutherAI/gpt-j-6b model twice using the HuggingFace Accelerate engine on a ml.g5.2xlarge GPU instance (1 device with 24 GiB of device memory):

  • Once without writing any custom server-side Python handler script and therefore leveraging the fact that the default Python handlers of the LMI DLC natively support streaming for the HuggingFace Accelerate engine (among others).
  • Once with a custom server-side Python handler script.

Notice that when using the default handlers, streaming cannot be disabled once the endpoint has been deployed with streaming_enabled set to True, i.e. the endpoint can only be invoked using sagemaker::InvokeWithStreamingResponse (and not sagemaker::InvokeEndpoint). On the other hand, when implementing a custom handler script, we will be able to choose between streaming our responses or not on a per-request basis.

Notices:

  • Make sure that the ml.g5.2xlarge instance type is available in your AWS Region.
  • Make sure that the value of your "ml.g5.2xlarge for endpoint usage" Amazon SageMaker service quota allows you to deploy one Endpoint using this instance type.

Additional resources

License agreement

  • This model and the dataset it has been trained on are both under the Apache 2.0 license.
  • This notebook is a sample notebook and not intended for production use.

1. Execution environment setup

1.1. Dependencies installation

This notebook requires the following third-party Python dependencies:

  • AWS boto3. Since the distribution must support Amazon SageMaker Runtime streaming feature, the minimal boto3 (resp. botocore) version is 1.28.39 (resp. 1.31.39).
  • AWS sagemaker. Since we use the 0.23.0 version of the DJL LMI DLC, the minimal SDK version is 2.173.0.

Let's install or upgrade these dependencies using the following commands:

[ ]
[ ]

1.2. Imports & global variables assignment

[ ]
[ ]
[ ]

1.3. Utilities

[ ]

2. Deployment to a SageMaker Endpoint using a SageMaker LMI Docker image and the HuggingFace Accelerate engine

Start up of LLM inference containers can last longer than for smaller models mainly because of longer model downloading and loading times. Timeout values need to be increased accordingly from their default values. Each endpoint deployment takes a few minutes.

[ ]
[ ]

2.1. Inference using the default HuggingFace Accelerate handler

In this section, you deploy the EleutherAI/gpt-j-6b model to a SageMaker endpoint consisting of a single ml.g5.2xlarge instance. The inference engine used by the DJL Serving stack is HuggingFace Accelerate. Chosen precision is FP16 (native precision). and using the HuggingFace Accelerate handler as inference engine (referred as the Python engine in the DJL Serving general settings).

To each engine corresponds a dedicated sagemaker.model.Model class. In the present case, you will use the sagemaker.djl_inference.HuggingFaceAccelerateModel class. The model server configuration is generated by the HuggingFaceAccelerateModel class from the arguments we pass to its constructor and from an optional and already-existing serving.properties file.

Since the HuggingFace Accelerate default handler script natively supports response streaming, we do not implement any custom server-side handler script.

[ ]
[ ]
[ ]

Notice that the HuggingFace Accelerate (and DeepSpeed) default DJL handler support two streaming modes:

  • A legacy streaming mode (option.enable_streaming=true)
  • An recommended alternative based on HuggingFace's streamers (option.enable_streaming=huggingface)
[ ]
[ ]
[ ]

Notices:

  • Requests with response streaming currently do not support multiple input prompts
  • The Predictor object returned by the deploy method is currently not capable of invoking the endpoint it is tied to with response streaming. We therefore use the lower-level boto3 client to invoke the endpoint.
[ ]
[ ]

Now let's delete the endpoint to redeploy the model with a custom server-side handler script.

[ ]

2.2. Inference using a custom server-side handler script

In this section, you will redeploy the same model but first, you will add a custom Python server-side handler script to the code artifacts that are to be deployed to the container (gathered in the source_dir / SOURCE_DIR_ACCELERATE directory).

The custom handler script below allows to enable or disable streaming on a per-request basis. Default behavior is set using the option.enable_streaming field to true in the model server's configuration file serving.properties.

The custom handler script allows to showcase the main differences when enabling streaming in the LMI container compared to sending the full generated sequences once:

  • We use a streamer object which implements the interface defined by transformers.TextStreamer like djl_python.streaming_utils.HFStreamer or transformers.generation.streamers.TextIteratorStreamer. The streamer object uses the model's tokenizer to decode the generated token Ids before pushing them to the streamer's internal queue. The transformers.generation.streamers.TextIteratorStreamer streamer can be used as an alternative.
  • Instead of adding the result to the djl_python.Output object, we attach the streamer object using its add_stream_content.
  • Generation is executed in a background thread. The streamer object is passed to the generate method together with the GenerationConfig. The generation logic then uses the streamer to post tokens to the streamer's queue. On the other side, since the Output object has access to the streamer object, it is able to retrieve the tokens from its queue to dispatch them to the client.
[ ]
[ ]
[ ]
[ ]
[ ]

The custom handler script allow to either stream the response tokens (default behavior), i.e. invoke the endpoint using sagemaker:InvokeEndpointWithResponseStreaming or to disable streaming at the request level, i.e. invoke the endpoint using sagemaker:InvokeEndpoint. Let's first invoke the endpoint with the streaming feature enabled.

[ ]

Now let's add a stream_response: False entry to our request parameters to allow our endpoint to be invoked using the sagemaker:InvokeEndpoint API call and let's use the Predictor object returned by Model.deploy to perform this call.

[ ]
[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable