Build Cahtbot Applications Using Rag On Sagemaker
Deploy open-source Large Language Models on Amazon SageMaker
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this notebook, we will show you how to deploy the open-source LLMs from HuggingFace on Amazon SageMaker. The notebook contains three sections:
- Section 1: Deploy Falcon model and embedding model to Amazon SageMaker
- Section 2: Use RAG based approach with LangChain and SageMaker endpoints to build a simplified question and answering application.
This notebook is designed to run on Python 3 Data Science 3.0 kernel in Amazon SageMaker Studio
1. Setup development environment
We are going to use the sagemaker python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed. You can safely ignore the errors from the pip install if there is any.
Secton 1: Deploy Falcon model and embedding model to Amazon SageMaker
In this section, we will deploy the open-source Falcon 7b instruct model on SageMaker for real-time inference.
To deploy Falcon-7B-Instruct to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.2xlarge instance type.
This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using the Large Model Inference (LMI) container from DLC to SageMaker and run inference with it. We will deploy the 7B-Instruct Falcon an open-source Chat LLM trained by TII.
Start preparing model artifacts
In LMI container, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install
In the serving.properties file defines the engine to use and model to host. Note the tensor_parallel_degree parameter which is also required in this scenario. We will use tensor parallelism to divide the model into multiple parts because no single GPU has enough memory for the entire model. In this case we will use a 'ml.g5.2xlarge' instance which provides 1 GPU. Be careful not to specify a value larger than the instance provides or your deployment will fail.
Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch
Getting the container image URI
Then we upload the artifacts on S3 and create SageMaker model
Create SageMaker endpoint
We now can call the deploy function to create the LLM endpoint. You need to specify the instance to use and endpoint names.
SageMaker will now create our endpoint and deploy the model to it. This can take 10-15 minutes. During this time, please continue the following section to deploy the embedding model for the RAG solution. We will invoke the deployed endpoint when all the models are deployed successfully.
To see more model deployment examples, you can find an example notebook here at the SageMaker examples gitrepo.
Deploy the GPT-J 6B embedding on SageMaker using SageMaker Jumpstart
In this section, we host the pre-trained GPT-J-6B Hugging Face sentence transformer model, into SageMaker and generate an embedding vector with 4096 dimensions of the input text string. In this lab, we will use the GPT-J 6B Embedding FP16 provided by SageMaker Jumpstart which loads a 16-bit quantized version of the original model by specifying the half-precision dtype, torch.float16. By using half precision, this model consumes less GPU memory and performs faster inference than the full precision version. For more information, please view the Hugging Face documentation for FP16 optimization.
There are different ways you can choose to deploy the GPT-J-6B model. Here we show you two options:
- deploy the GPT-J-6B embedding model from the Jumpstart UI
- deploy the GPT-J-6B embedding model using SageMaker python SDK
Please choose only one of the below two options to deploy the embedding model.
Option 1: Deploy the GPT-J-6B embedding model from the Jumpstart UI
On the left-hand-side navigation pane, got to Home, under SageMaker JumpStart, choose Model, notebooks, solutions. You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case. If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs. To start exploring the Stable Diffusion models, complete the following steps:
- Go to the Foundation Models section. In the search bar, search for the embedding model and select the GPT-J 6B Embedding FP16.

- A new tab is opened with the options to train, deploy and view model details as shown below. In the Deploy Model section, expand Deployment Configuration. For SageMaker hosting instance, choose the hosting instance (for this lab, we use ml.g5.4xlarge). You can also change the Endpoint name as needed. Then click the Deploy button.

- The deploy action will start a new tab showing the model creation status and the model deployment status.
While the endpoint is deploying, update the embedding endpoint name in the following cell.
Now you can directly go to section Wait until all the endpoints are up and running.
Option 2: deploy the GPT-J-6B embedding model using SageMaker python SDK
Now we will show you how to use code to deploy the pretrained models from SageMaker Jumpstart using the SageMaker Python SDK.
We can directly load the pretrained model artifacts from SageMaker JumpStart. The SageMaker Python SDK uses model IDs and model versions to access the necessary utilities for pre-trained models. The table provided by the readme doc serves to provide the core material plus some extra information that can be useful in selecting the correct model ID and corresponding parameters.
Wait until all the endpoints are up and running
Test endpoint outputs
Now we can invoke each endpoint to test the endpoint outputs. First, let's check the text-to-text endpoint using Falcon model.
Then run the follow code to generate embeddings of the input using the embedding model.
Section2: Use RAG based approach with LangChain and SageMaker endpoints to build a simplified question and answering application.
We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.
To achieve that, we will do following.
- Generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.
- Identify top K most relevant documents based on user query.
- 2.1 For a query of your interest, generate the embedding of the query using the same embedding model.
- 2.2 Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.
- 2.3 Use the indexes to retrieve the corresponded documents.
- Combine the retrieved documents with prompt and question and send them into SageMaker LLM.
Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length.
To build a simplied QA application with LangChain, we need:
- Wrap up our SageMaker endpoints for embedding model and LLM into
langchain.embeddings.SagemakerEndpointEmbeddingsandlangchain.llms.sagemaker_endpoint.SagemakerEndpoint. That requires a small overwritten ofSagemakerEndpointEmbeddingsclass to make it compatible with SageMaker embedding mdoel. - Prepare the dataset to build the knowledge data base.
Wrap up our SageMaker endpoints for embedding model into langchain.embeddings.SagemakerEndpointEmbeddings. That requires a small overwritten of SagemakerEndpointEmbeddings class to make it compatible with SageMaker embedding model.
Next, we wrap up our SageMaker endpoints for LLM into langchain.llms.sagemaker_endpoint.SagemakerEndpoint.
Use langchain to read the txt data. There are multiple built-in functions in LangChain to read different format of files such as csv, html, and pdf. For details, see LangChain document loaders.
We generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.documents
Based on the question above, we then identify top K most relevant documents based on user query, where K = 3 in this setup.
Finally, we combine the retrieved documents with prompt and question and send them into SageMaker LLM.
We define a customized prompt as below.
Run the Question and Answering chatbot application
Once all the endpoints are deployed successfully, you can open a terminal in SageMaker Studio and use the below command to run the chatbot Streamlit application. Note that you need to install the required python packages that are specified in the “requirements.txt” file. You also need to update the environment variables with the endpoint names deployed in your account accordingly. When you execute the chatbot-steamlit.py file, it will automatically update the endpoint names based on the environment variables.
$ pip install -r requirements.txt
$ export nlp_ep_name=<the falcon endpoint name deployed in your account>
$ export embed_ep_name=<the embedding endpoint name deployed in your account>
$ streamlit run chatbot-streamlit.py --server.port 6006 --server.maxUploadSize 6
To access the Streamlit UI, copy your SageMaker Studio url and replace lab? with proxy/[PORT NUMBER]/. Because we specified the server port to 6006, so the url should look like:
https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/
Replace the domain ID and region with the correct value in your account to access the UI as below:

You can find some suggested prompt on the left-hand-side sidebar. When you upload the sample files (you can find the sample files in the test folder), the chatbot will automatically provide prompt suggestions based on the input data type.
Congratulations on finishing lab 1 !!!
clean up
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.