Vilt B32 Finetuned Vqa
Hugging Face Multimodel Inference (Visual question answering) with vilt-b32-finetuned-vqa
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Overview
This notebook demonstrates how to deploy and run inference for Hugging Face Multimodal vilt-b32-finetuned-vqa for visual question answering on Amazon SageMaker.
Visual Question Answering (VQA) is a task where a model answers questions about an image. The input consists of an image and a textual question about the image. The output is the model's answer to the question, bridging the gap between computer vision and natural language understanding.
Vilt-b32-finetuned-vqa is a Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. Please visit the model card on HuggingFace here for more information.
Setup
Install or update the SageMaker Python SDK
First, we need to make sure the latest version of the SageMaker Python SDK is installed.
Setup Python Modules and roles
Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.
Create the Hugging Face model
Next we configure the HuggingFaceModel object by specifying a unique model name, transformers_version, pytorch_version, py_version, and the execution role for the endpoint. Additionally, we specify some environment variables including the HF_MODEL_ID which corresponds to the model in the HuggingFace Hub, and the HF_TASK which configures the inference task to be performed.
Creating a SageMaker Endpoint
Next we deploy the model by invoking the deploy() function. Here we use a ml.m5.xlarge instance with 4 vCPUs and 16 GiB of memory.
Run Inference
To run inference for visual question answering model, we first need to prepare the input for inference. The input consists of an image and a question (text string). The image can be stored in S3 and supplied through S3 presigned url.
Please replace BUCKET_NAME, IMAGE_NAME, QUESTION_INPUT with your input S3 bucket, image name, and question.
Next we can call the Sagemaker endpoint we created in this notebook, and provide the image url and question for inference.
Cleanup
After you've finished testing the endpoint, it's important to delete the model and endpoint resources to avoid incurring charges.
Conclusion
In this tutorial, we deployed a Hugging Face Multimodal vilt-b32-finetuned-vqa to an Amazon SageMaker real-time endpoint.
With SageMaker Hosting, you can easily host Multimodal and run inference.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.