Lab6 Token Streaming Eleutherai Gpt J 6b Lmi
Serve and stream tokens from EleutherAI's gpt-j-6b hosted on Amazon SageMaker using LMI (Large Model Inference) DJL-based container
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Recommended kernel(s): This notebook can be run with any Amazon SageMaker Studio kernel.
This notebook focuses on deploying the EleutherAI/gpt-j-6b HuggingFace model to an Amazon SageMaker endpoint for a text generation task. In this example, you will use the SageMaker-managed LMI (Large Model Inference) Docker image (Amazon Sagemaker Deep Learning Container - DLC) as inference image. LMI images features a DJL serving stack powered by the Deep Java Library.
Once the model has been deployed, you will submit text generation requests and get a streamed response in return using Amazon SageMaker Runtime's native response streaming capability.
Notice: The model artifacts (checkpoints, configuration, etc.) are not downloaded from the HuggingFace Hub but from an Amazon S3 bucket managed by AWS.
In this notebook, we make an extensive use of the higher-level abstractions provided by the sagemaker Python SDK to which we delegate the management of as many resources and configuration as we can, hence demonstrating that the deployment of LLMs to Amazon SageMaker can be performed with great simplicity and minimal amount of code.
You will successively deploy the EleutherAI/gpt-j-6b model twice using the HuggingFace Accelerate engine on a ml.g5.2xlarge GPU instance (1 device with 24 GiB of device memory):
- Once without writing any custom server-side Python handler script and therefore leveraging the fact that the default Python handlers of the LMI DLC natively support streaming for the HuggingFace Accelerate engine (among others).
- Once with a custom server-side Python handler script.
Notice that when using the default handlers, streaming cannot be disabled once the endpoint has been deployed with streaming_enabled set to True, i.e. the endpoint can only be invoked using sagemaker::InvokeWithStreamingResponse (and not sagemaker::InvokeEndpoint). On the other hand, when implementing a custom handler script, we will be able to choose between streaming our responses or not on a per-request basis.
Notices:
- Make sure that the
ml.g5.2xlargeinstance type is available in your AWS Region. - Make sure that the value of your "ml.g5.2xlarge for endpoint usage" Amazon SageMaker service quota allows you to deploy one Endpoint using this instance type.
Additional resources
- AWS Machine Learning blog - Elevating the generative AI experience: Introducing streaming support in Amazon SageMaker hosting
- Amazon SageMaker docs - Invoke real-time endpoints
License agreement
- This model and the dataset it has been trained on are both under the Apache 2.0 license.
- This notebook is a sample notebook and not intended for production use.
1. Execution environment setup
1.1. Dependencies installation
This notebook requires the following third-party Python dependencies:
- AWS
boto3. Since the distribution must support Amazon SageMaker Runtime streaming feature, the minimalboto3(resp.botocore) version is1.28.39(resp.1.31.39). - AWS
sagemaker. Since we use the 0.23.0 version of the DJL LMI DLC, the minimal SDK version is 2.173.0.
Let's install or upgrade these dependencies using the following commands:
1.2. Imports & global variables assignment
1.3. Utilities
2. Deployment to a SageMaker Endpoint using a SageMaker LMI Docker image and the HuggingFace Accelerate engine
Start up of LLM inference containers can last longer than for smaller models mainly because of longer model downloading and loading times. Timeout values need to be increased accordingly from their default values. Each endpoint deployment takes a few minutes.
2.1. Inference using the default HuggingFace Accelerate handler
In this section, you deploy the EleutherAI/gpt-j-6b model to a SageMaker endpoint consisting of a single ml.g5.2xlarge instance. The inference engine used by the DJL Serving stack is HuggingFace Accelerate. Chosen precision is FP16 (native precision). and using the HuggingFace Accelerate handler as inference engine (referred as the Python engine in the DJL Serving general settings).
To each engine corresponds a dedicated sagemaker.model.Model class. In the present case, you will use the sagemaker.djl_inference.HuggingFaceAccelerateModel class. The model server configuration is generated by the HuggingFaceAccelerateModel class from the arguments we pass to its constructor and from an optional and already-existing serving.properties file.
Since the HuggingFace Accelerate default handler script natively supports response streaming, we do not implement any custom server-side handler script.
Notice that the HuggingFace Accelerate (and DeepSpeed) default DJL handler support two streaming modes:
- A legacy streaming mode (
option.enable_streaming=true) - An recommended alternative based on HuggingFace's streamers (
option.enable_streaming=huggingface)
Notices:
- Requests with response streaming currently do not support multiple input prompts
- The
Predictorobject returned by thedeploymethod is currently not capable of invoking the endpoint it is tied to with response streaming. We therefore use the lower-levelboto3client to invoke the endpoint.
Now let's delete the endpoint to redeploy the model with a custom server-side handler script.
2.2. Inference using a custom server-side handler script
In this section, you will redeploy the same model but first, you will add a custom Python server-side handler script to the code artifacts that are to be deployed to the container (gathered in the source_dir / SOURCE_DIR_ACCELERATE directory).
The custom handler script below allows to enable or disable streaming on a per-request basis. Default behavior is set using the option.enable_streaming field to true in the model server's configuration file serving.properties.
The custom handler script allows to showcase the main differences when enabling streaming in the LMI container compared to sending the full generated sequences once:
- We use a streamer object which implements the interface defined by
transformers.TextStreamerlikedjl_python.streaming_utils.HFStreamerortransformers.generation.streamers.TextIteratorStreamer. The streamer object uses the model's tokenizer to decode the generated token Ids before pushing them to the streamer's internal queue. Thetransformers.generation.streamers.TextIteratorStreamerstreamer can be used as an alternative. - Instead of adding the result to the
djl_python.Outputobject, we attach the streamer object using itsadd_stream_content. - Generation is executed in a background thread. The streamer object is passed to the
generatemethod together with theGenerationConfig. The generation logic then uses the streamer to post tokens to the streamer's queue. On the other side, since theOutputobject has access to the streamer object, it is able to retrieve the tokens from its queue to dispatch them to the client.
The custom handler script allow to either stream the response tokens (default behavior), i.e. invoke the endpoint using sagemaker:InvokeEndpointWithResponseStreaming or to disable streaming at the request level, i.e. invoke the endpoint using sagemaker:InvokeEndpoint. Let's first invoke the endpoint with the streaming feature enabled.
Now let's add a stream_response: False entry to our request parameters to allow our endpoint to be invoked using the sagemaker:InvokeEndpoint API call and let's use the Predictor object returned by Model.deploy to perform this call.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.