NoCode SD21 INF2
Deploy Stable Diffusion on a Inferentia2 Custom chip with SageMaker and LMI Containers with Neuron compilers
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this notebook we will host Stable Diffusion SageMaker using LMI containers
In this notebook, we explore how to host a large language model on SageMaker using the Large Model Inference container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post.
The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs for inference and a growing need for powerful compute resources. The high cost of inference for generative AI models can be a barrier to entry for businesses and researchers with limited resources, necessitating the need for more efficient and cost-effective solutions. Furthermore, the majority of generative AI use cases involve human interaction or real-world scenarios, necessitating hardware that can deliver low-latency performance. AWS has been innovating with purpose-built chips to address the growing need for powerful, efficient, and cost-effective compute hardware.
This notebook was tested on a inf2.8xlarge instance
Model license: By using this model, please review and agree to the https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL
Overview of Inferentia and trn1 instances
Overview of ml.trn1 and ml.inf2 instances ml.trn1 instances are powered by the Trainium accelerator, which is purpose built mainly for high-performance deep learning training of generative AI models, including LLMs. However, these instances also support inference workloads for models that are even larger than what fits into Inf2. The largest instance size, trn1.32xlarge instances, features 16 Trainium accelerators with 512 GB of accelerator memory in a single instance delivering up to 3.4 petaflops of FP16/BF16 compute power. 16 Trainium accelerators are connected with ultra-high-speed NeuronLinkv2 for streamlined collective communications.
ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, a purpose built accelerator for inference. It delivers three times higher compute performance, up to four times higher throughput, and up to 10 times lower latency compared to first-generation AWS Inferentia. The largest instance size, Inf2.48xlarge, features 12 AWS Inferentia2 accelerators with 384 GB of accelerator memory in a single instance for a combined compute power of 2.3 petaflops for BF16/FP16. It enables you to deploy up to a 175-billion-parameter model in a single instance. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that don’t fit into a single accelerator, data flows directly between accelerators with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.
Both AWS Inferentia2 and Trainium accelerators have two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective-compute engines, which automatically optimize runtime by overlapping computation and communication when doing multi-accelerator inference. For more details on the architecture, refer to Trainium and Inferentia devices.
For more details refer to Neuron Docs

Create a SageMaker Model for Deployment
As a first step, we'll import the relevant libraries and configure several global variables such as the hosting image that will be used nd the S3 location of our model artifacts
If you are running this notebook from outside of AWS
Please configure the Appropriate access keys and the Role you would like to assume and ensure the access to that role is there
Define the BF16 weights and the code prefix where the code will go in S3
Use boto3
Part 2 - Create the model.tar.gz
This file is the custom inference script for generating images. The model weights have been compiled for specific Hardware as explained below. For convinience the compiled weights will be out in a public S3 location for easy reference. However it is important to note that DJL comes with a pre-built handlers which can be found here Default handlers
Model weights
This notebook leverages the compiled weights for HuggingFace Stable Diffusion 2.1 (512x512) model for accelerated inference on Neuron. For Stable Diffusion 768x768, please see the notebook named hf_pretrained_sd2_<image_size>_inference
Some important points for compiling the model
Please refer to the model compilation notebooks under Github and the files are by the image dimensions for compile parts of the Stable Diffusion pipeline for execution on Neuron. Note that this only needs to be done once: After you have compiled and saved the model by running the following section of code, you can reuse it any number of times without having to recompile. We will Compile the model into an optimized TorchScript and save the TorchScript. In particular, we will compile:
- The CLIP text encoder;
- The VAE decoder;
- The UNet, and
- The VAE_post_quant_conv These blocks are chosen because they represent the bulk of the compute in the pipeline, and performance benchmarking has shown that running them on Neuron yields significant performance benefit.
Several points worth noting are:
- In order to save RAM (these compiles need lots of RAM!), before tracing each model, we make a deepcopy of the part of the pipeline (i.e. the UNet or the VAE decoder) that is to be traced, and then delete the pipeline object from memory with
del pipe. This trick allows the compile to succeed on instance types with a smaller amount of RAM. - When compiling each part of the pipeline, we need to pass
torch_neuronx.tracesample input(s), When there are multiple inputs, they are passed together as a tuple. For details on how to usetorch_neuronx.trace, please refer to our documentation here: Trace Neuron Compilers - Note that while compiling the UNet, we make use of the double-wrapper structure defined above. In addition, we also use the optimized
get_attention_scoresfunction to replace the originalget_attention_scoresfunction in theCrossAttentionclass. - The following section defines some utility classes and functions. In particular, we define a double-wrapper for the UNet and another wrapper for the text encoder. These wrappers enable
torch_neuronx.traceto trace the wrapped models for compilation with the Neuron compiler. - In addition, the
get_attention_scoresutility function performs optimized attention score calculation and is used to replace the origianlget_attention_scoresfunction in thediffuserspackage via a monkey patch
Environment for compilations
Following the Inf2 set up you will find a VENV pre created with the following pip packages installed:
torch-neuronxneuronx-ccdiffusers==0.14.0transformers==4.26.1accelerate==0.16.0
torch-neuronx and neuronx-cc will be installed when you configure your environment following the Inf2 setup guide. The remaining dependencies can be installed as:
diffusers==0.14.0
transformers==4.26.1
accelerate==0.16.0
There are a few options specified here. Lets go through them in turn
engine- specifies the engine that will be used for this workload. In this case we'll be hosting a model using the DJL Python Engineoption.entryPoint- specifies the entrypoint code that will be used to host the model. djl_python.transformers-neuronx refers to thetransformers-neuronx.py.pymodule from djl_python repo.option.s3url- specifies the location of the model files. Alternativelly anoption.model_idoption can be used instead to specifiy a model from Hugging Face Hub (e.g.StableDiffusion) and the model will be automatically downloaded from the Hub. The s3url approach is recommended as it allows you to host the model artifact within your own environment and enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instanceoption.tensor_parallel_degree- This will help to calculate the no of worker processes. This value devided by the no of GPU's gives us the workers
For more information on the available options, please refer to the SageMaker Large Model Inference Documentation
Leverage In Built Containers
So now our Model.tar.gz will consist of only Serving.properties file
compiled_model/text_encoder.pt does not exist We will load and replace the following in the base model In particular, we will load the compiled weights for :
- The CLIP text encoder;
- The VAE decoder;
- The UNet, and
- The VAE_post_quant_conv
These blocks are chosen because they represent the bulk of the compute in the pipeline. Further we will also load the Replace original cross-attention module with custom cross-attention module for better performance CrossAttention.get_attention_scores = get_attention_scores
Upload the Tar file to S3 for Creation of End points
To create the end point the steps are:
-
Create the TAR ball with just the serving and the model.py files and upload to S3
-
Create the Model using the Image container and the Model Tarball uploaded earlier
-
Create the endpoint config using the following key parameters
a) Instance Type is ml.inf2.xlarge
b) ContainerStartupHealthCheckTimeoutInSeconds is 240 to ensure health check starts after the model is ready
c) VolumeInGB set to a larger value so it can be used for loading the model weights which are 32 GB in size
-
Create the end point using the endpoint config create
Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.
The container downloads the model into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages s5cmd(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.
For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. The size of this mount is large enough to hold the model.
Getting the container image URI
Available framework are:
- djl-deepspeed (0.20.0, 0.21.0, 0.22.1, 0.23.0)
- djl-fastertransformer (0.21.0, 0.22.1, 0.23.0)
- fastertransformer (5.3.0)
- StableDiffusion 2.1
Creating end point in SageMaker
Create a SageMaker endpoint configuration.
Create the endpoint, and wait for it to be up and running.
While you wait for the endpoint to be created, you can read more about:
- StableDiffusion
- Deep Learning containers for large model inference
- DeepSpeed
- Quantization in HuggingFace Accelerate
- Handling big models for inference using Accelerate
Leverage the Boto3 to invoke the endpoint.
This is a generative model so we pass in a Text as a prompt and Model will generate the image and return that to us using 50 denoising steps
Invoke model
Let's run an example with a basic text generation prompt Mountains Landscape
this will create a 512 x 512 resolution picture
P95 numbers
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.