2b Flant5 Xxl Tgi
Flan-T5-XXL
In this notebook we will create and deploy a Flan-T5-XXL using inference components on the endpoint you created in the first notebook. For this model we will be using HuggingFace's TGI container. We will also be using two GPU's for each model copy of the inference component we create. This is the 3rd notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Tested using the Python 3 (Data Science) kernel on SageMaker Studio and conda_python3 kernel on SageMaker Notebook Instance.
Install dependencies
Upgrade the SageMaker Python SDK.
Import libraries
Set configurations
REPLACE the endpoint_name value with the created endpoint from the first notebook
We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook.
Create Model Artifact
We will be deploying the the FlanT5-XXL model using the TGI container. In order to do so you need to set the image you would like to use with the proper configuartion. You can also create a SageMaker model to be referenced when you create your inference component
We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the 'ComputeResourceRequirements' to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available.
Note that in this example we set the NumberOfAcceleratorDevicesRequired to a value of 2. By doing so we reserve 2 accelerators for each copy of this inference component so that we can use tensor parallel.
Wait until the inference endpoint is InService
Now that the Inference Components are 'InService' they are availble to service requests. Here we invoke the endpoint but please notice that we add an additional parameter called 'InferenceComponentName'. This allows SageMaker to direct your request to the proper Inference Component
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.