Xgboost Multi Model Endpoint Home Value
Amazon SageMaker Multi-Model Endpoints using XGBoost
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
With Amazon SageMaker multi-model endpoints, customers can create an endpoint that seamlessly hosts up to thousands of models. These endpoints are well suited to use cases where any one of a large number of models, which can be served from a common inference container to save inference costs, needs to be invokable on-demand and where it is acceptable for infrequently invoked models to incur some additional latency. For applications which require consistently low inference latency, an endpoint deploying a single model is still the best choice.
At a high level, Amazon SageMaker manages the loading and unloading of models for a multi-model endpoint, as they are needed. When an invocation request is made for a particular model, Amazon SageMaker routes the request to an instance assigned to that model, downloads the model artifacts from S3 onto that instance, and initiates loading of the model into the memory of the container. As soon as the loading is complete, Amazon SageMaker performs the requested invocation and returns the result. If the model is already loaded in memory on the selected instance, the downloading and loading steps are skipped and the invocation is performed immediately.
To demonstrate how multi-model endpoints are created and used, this notebook provides an example using a set of XGBoost models that each predict housing prices for a single location. This domain is used as a simple example to easily experiment with multi-model endpoints.
The Amazon SageMaker multi-model endpoint capability is designed to work across with Mxnet, PyTorch and Scikit-Learn machine learning frameworks (TensorFlow coming soon), SageMaker XGBoost, KNN, and Linear Learner algorithms.
In addition, Amazon SageMaker multi-model endpoints are also designed to work with cases where you bring your own container that integrates with the multi-model server library. An example of this can be found here and documentation here.
Generate synthetic data
The code below contains helper functions to generate synthetic data in the form of a 1x7 numpy array representing the features of a house.
The first entry in the array is the randomly generated price of a house. The remaining entries are the features (i.e. number of bedroom, square feet, number of bathrooms, etc.).
These functions will be used to generate synthetic data for training, validation, and testing. It will also allow us to submit synthetic payloads for inference to test our multi-model endpoint.
2.42.1
Train multiple house value prediction models
In the follow section, we are setting up the code to train a house price prediction model for each of 4 different cities.
As such, we will launch multiple training jobs asynchronously, using the XGBoost algorithm.
In this notebook, we will be using the AWS Managed XGBoost Image for both training and inference - this image provides native support for launching multi-model endpoints.
Split a given dataset into train, validation, and test
The code below will generate 3 sets of data. 1 set to train, 1 set for validation and 1 for testing.
Launch a single training job for a given housing location
There is nothing specific to multi-model endpoints in terms of the models it will host. They are trained in the same way as all other SageMaker models. Here we are using the XGBoost estimator and not waiting for the job to complete.
Kick off a model training job for each housing location
Training data uploaded: s3://sagemaker-us-west-2-688520471316/XGBOOST_BOSTON_HOUSING/model_prep/NewYork_NY Training data uploaded: s3://sagemaker-us-west-2-688520471316/XGBOOST_BOSTON_HOUSING/model_prep/LosAngeles_CA Training data uploaded: s3://sagemaker-us-west-2-688520471316/XGBOOST_BOSTON_HOUSING/model_prep/Chicago_IL Training data uploaded: s3://sagemaker-us-west-2-688520471316/XGBOOST_BOSTON_HOUSING/model_prep/Houston_TX 4 training jobs launched: ['xgb-NewYork-NY-2021-05-28-20-27-47-850', 'xgb-LosAngeles-CA-2021-05-28-20-27-48-350', 'xgb-Chicago-IL-2021-05-28-20-27-51-370', 'xgb-Houston-TX-2021-05-28-20-27-53-744']
Wait for all model training to finish
Waiting for job: xgb-NewYork-NY-2021-05-28-20-27-47-850 xgb-NewYork-NY-2021-05-28-20-27-47-850 job status: InProgress xgb-NewYork-NY-2021-05-28-20-27-47-850 job status: InProgress xgb-NewYork-NY-2021-05-28-20-27-47-850 job status: InProgress xgb-NewYork-NY-2021-05-28-20-27-47-850 job status: InProgress DONE. Status for xgb-NewYork-NY-2021-05-28-20-27-47-850 is Completed Waiting for job: xgb-LosAngeles-CA-2021-05-28-20-27-48-350 DONE. Status for xgb-LosAngeles-CA-2021-05-28-20-27-48-350 is Completed Waiting for job: xgb-Chicago-IL-2021-05-28-20-27-51-370 DONE. Status for xgb-Chicago-IL-2021-05-28-20-27-51-370 is Completed Waiting for job: xgb-Houston-TX-2021-05-28-20-27-53-744 DONE. Status for xgb-Houston-TX-2021-05-28-20-27-53-744 is Completed
Create the multi-model endpoint with the SageMaker SDK
Create a SageMaker Model from one of the Estimators
Create the Amazon SageMaker MultiDataModel entity
We create the multi-model endpoint using the MultiDataModel class.
You can create a MultiDataModel by directly passing in a sagemaker.model.Model object - in which case, the Endpoint will inherit information about the image to use, as well as any environmental variables, network isolation, etc., once the MultiDataModel is deployed.
In addition, a MultiDataModel can also be created without explictly passing a sagemaker.model.Model object. Please refer to the documentation for additional details.
Deploy the Multi Model Endpoint
You need to consider the appropriate instance type and number of instances for the projected prediction workload across all the models you plan to host behind your multi-model endpoint. The number and size of the individual models will also drive memory requirements.
-------------------!
Our endpoint has launched! Let's look at what models are available to the endpoint!
By 'available', what we mean is, what model artfiacts are currently stored under the S3 prefix we defined when setting up the MultiDataModel above i.e. model_data_prefix.
Currently, since we have no artifacts (i.e. tar.gz files) stored under our defined S3 prefix, our endpoint, will have no models 'available' to serve inference requests.
We will demonstrate how to make models 'available' to our endpoint below.
[]
Lets deploy model artifacts to be found by the endpoint
We are now using the .add_model() method of the MultiDataModel to copy over our model artifacts from where they were initially stored, during training, to where our endpoint will source model artifacts for inference requests.
model_data_source refers to the location of our model artifact (i.e. where it was deposited on S3 after training completed)
model_data_path is the relative path to the S3 prefix we specified above (i.e. model_data_prefix) where our endpoint will source models for inference requests.
Since this is a relative path, we can simply pass the name of what we wish to call the model artifact at inference time (i.e. Chicago_IL.tar.gz)
Dynamically deploying additional models
It is also important to note, that we can always use the .add_model() method, as shown below, to dynamically deploy more models to the endpoint, to serve up inference requests as needed.
We have added the 4 model artifacts from our training jobs!
We can see that the S3 prefix we specified when setting up MultiDataModel now has 4 model artifacts. As such, the endpoint can now serve up inference requests for these models.
['Chicago_IL.tar.gz', , 'Houston_TX.tar.gz', , 'LosAngeles_CA.tar.gz', , 'NewYork_NY.tar.gz']
Get predictions from the endpoint
Recall that mme.deploy() returns a RealTimePredictor that we saved in a variable called predictor.
We will use predictor to submit requests to the endpoint.
XGBoost supports text/csv for the content type and accept type. For more information on XGBoost Input/Output Interface, please see here.
Since the default RealTimePredictor does not have a serializer or deserializer set for requests, we will also set these.
This will allow us to submit a python list for inference, and get back a float response.
Invoking models on a multi-model endpoint
Notice the higher latencies on the first invocation of any given model. This is due to the time it takes SageMaker to download the model to the Endpoint instance and then load the model into the inference container. Subsequent invocations of the same model take advantage of the model already being loaded into the inference container.
$395,909.25, took 1,469 ms
$372,641.38, took 23 ms
$344,676.03, took 1,145 ms
$479,065.41, took 19 ms
Updating a model
To update a model, you would follow the same approach as above and add it as a new model. For example, if you have retrained the NewYork_NY.tar.gz model and wanted to start invoking it, you would upload the updated model artifacts behind the S3 prefix with a new name such as NewYork_NY_v2.tar.gz, and then change the target_model field to invoke NewYork_NY_v2.tar.gz instead of NewYork_NY.tar.gz. You do not want to overwrite the model artifacts in Amazon S3, because the old version of the model might still be loaded in the containers or on the storage volume of the instances on the endpoint. Invocations to the new model could then invoke the old version of the model.
Alternatively, you could stop the endpoint and re-deploy a fresh set of models.
Using Boto APIs to invoke the endpoint
While developing interactively within a Jupyter notebook, since .deploy() returns a RealTimePredictor it is a more seamless experience to start invoking your endpoint using the SageMaker SDK. You have more fine grained control over the serialization and deserialization protocols to shape your request and response payloads to/from the endpoint.
This is great for iterative experimentation within a notebook. Furthermore, should you have an application that has access to the SageMaker SDK, you can always import RealTimePredictor and attach it to an existing endpoint - this allows you to stick to using the high level SDK if preferable.
Additional documentation on RealTimePredictor can be found here.
The lower level Boto3 SDK may be preferable if you are attempting to invoke the endpoint as a part of a broader architecture.
Imagine an API gateway frontend that uses a Lambda Proxy in order to transform request payloads before hitting a SageMaker Endpoint - in this example, Lambda does not have access to the SageMaker Python SDK, and as such, Boto3 can still allow you to interact with your endpoint and serve inference requests.
Boto3 allows for quick injection of ML intelligence via SageMaker Endpoints into existing applications with minimal/no refactoring to existing code.
Boto3 will submit your requests as a binary payload, while still allowing you to supply your desired Content-Type and Accept headers with serialization being handled by the inference container in the SageMaker Endpoint.
Additional documentation on .invoke_endpoint() can be found here.
Clean up
Here, to be sure we are not billed for endpoints we are no longer using, we clean up.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.