Amazon Web Services Text Embedding Sentence Similarity

Text Embedding Sentence Similarity

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / text-embedding-sentence-similarity.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Amazon SageMaker Jumpstart - Text Embedding & Sentence Similarity

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Welcome to Amazon SageMaker Jumpstart! You can use Amazon SageMaker Jumpstart to solve many Machine Learning tasks through one-click in SageMaker Studio, or through SageMaker Python SDK.

In this demo notebook, we demonstrate how to use the SageMaker Python SDK for Text Embedding and Sentence Similarity. Sentence similarity involves assessing the likeness between two pieces of text. Models designed for sentence similarity transform input texts into vectors or embeddings, capturing semantic details, and then compute their proximity or similarity. We demonstrate the following here:

How to run inference on a Text Embedding model.
How to find the nearest neighbors for an input sentence with your own dataset
How to run the batch Transform

The following text embedding models are available currently in the SageMaker Jumpstart-

Model Name	JumpStart Model ID
bge-large-en	huggingface-sentencesimilarity-bge-large-en
bge-base-en	huggingface-sentencesimilarity-bge-base-en
gte-large	huggingface-sentencesimilarity-gte-large
gte-base	huggingface-sentencesimilaritygte-base
e5-large-v2	huggingface-sentencesimilarity-e5-large-v2
bge-small-en	huggingface-sentencesimilarity-bge-small-en
e5-base-v2	huggingface-sentencesimilarity-e5-base-v2
multilingual-e5-large	huggingface-sentencesimilarity-multilingual-e5-large
e5-large	huggingface-sentencesimilarity-e5-large
gte-small	huggingface-sentencesimilarity-gte-small
e5-base	huggingface-sentencesimilarity-e5-base
e5-small-v2	huggingface-sentencesimilarity-e5-small-v2
multilingual-e5-base	huggingface-sentencesimilarity-multilingual-e5-base
all-MiniLM-L6-v2	huggingface-sentencesimilarity-all-MiniLM-L6-v2

Set Up
Select a model
Deploy an endpoint & Query endpoint
Getting Nearest Neighbor On Your Own Dataset
Getting the Accuracy of deployed model on the Amazon_SageMaker_FAQs dataset
Run Batch Transform

1. Set Up

Before executing the notebook, there are some initial steps required for set up

[ ]

To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3.

[ ]

2. Select a pre-trained model

[ ]

3. Deploy an Endpoint & Query Endpoint

Using SageMaker, we can perform inference on the pre-trained model.

[ ]

3.1 Query Endpoint to Get Embeddings

You can query the endpoint with a batch of input texts within a json payload. Here, we send a single request to the endpoint and the parsed response is a list of the embedding vectors.

[ ]

3.2 Query endpoint for Getting Nearest Neighbor

The deployed model facilitates the process of identifying the nearest neighbors to input queries within the corpus. When provided with queries and a corpus, the model will produce a list. For each query, the output will provide both the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. Please keep in mind that when making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5MB, and the request timeout is set to 1 minute. If your corpus size exceeds these limits, please utilize the approach outlined in the "4. Getting Nearest Neighbor On Your Own Dataset" section.

corpus: Provide the list of inputs from which to find the nearest neighbour
queries: Provide the list of inputs for which to find the nearest neighbour from the corpus
top_k: The number of nearest neighbour to find from the corpus
mode: Supply it as "nn_corpus" for getting the nearest neighbors to input queries within the corpus

[ ]

Clean up the endpoint

[ ]

4. Getting Nearest Neighbor On Your Own Dataset

To find the nearest neighbor from your own dataset, you must provide it in the specified format during the training process. The training job will then generate embeddings for your dataset and save them along with the model. These embeddings will be utilized during inference to find the nearest neighbors for an input sentence. The process of finding the nearest neighbors once we have the embeddings is carried out using the Sentence Transformer and its util function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and already computed sentence embeddings during the training job.

Required Data Format for the training job

Input: A directory containing a 'data.csv' file.
- Each row of the first column of 'data.csv' should have unique id
- Each row of the second column should have the corresponding text.
Output: A model prepackaged with input data embeddings that can be deployed for inference to get the nearest neighbor embedding id for an input sentence

Below is an example of 'data.csv' file showing values in its first two columns. Note that the file should not have any header.


1	"Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows."
2	"For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide."

4.1. Getting Dataset

In this section, we'll be fetching and prepping the Amazon_SageMaker_FAQs dataset to utilize it in finding the nearest neighbour to an input question.

[ ]

4.2. Set Training parameters

There are two kinds of parameters that need to be set for training.

The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training.

The second set of parameters are algorithm specific training hyper-parameters.

[ ]

4.3. Getting the Embeddings for the Input Data

We start by creating the estimator object with all the required assets and then launch the training job.

[ ]

4.4. Deploy & run Inference on the model

The deployed model can be used for running inference. We support two types of the inference methods on the model. We follow the same steps as in 3. Deploy an Endpoint & Query Endpoint

[ ]

4.5 Query endpoint

Query Endpoint to Get Embeddings

You can query the endpoint with a batch of input texts within a json payload. Here, we send a single request to the endpoint and the parsed response is a list of the embedding vectors.

[ ]

Query Endpoint to Get Nearest Neighbor

You also have the option to make queries to the endpoint using a JSON payload containing a batch of input texts, to find the nearest neighbors of the input text from the dataset which is provided during the training job.

queries: Provide the list of inputs for which to find the closest match from the training data
top_k: The number of closest match to find from the training data
mode: Supply it as "nn_train_data" for getting the nearest neighbors to input queries within the dataset provided

[ ]

5. Getting the Accuracy of deployed model on the Amazon_SageMaker_FAQs dataset

We will Query the endpoint for the questions in our Amazon_SageMaker_FAQs dataset and will compare if we get the correct corresponding answer using our sentence similarity model.

[ ]

6. Run Batch Transform to Get Embeddings On Large Datasets

Using SageMaker, we can perform batch inference on the model for large datasets. For this example, that means on an input sentence providing the embedding. When you start a batch transform job, Amazon SageMaker launches the necessary compute resources to process the data, including CPU or GPU instances depending on the selected instance type. During the batch transform job, Amazon SageMaker automatically provisions and manages the compute resources required to process the data, including instances, storage, and networking resources. Once the batch transform job has completed, the compute resources are automatically cleaned up by Amazon SageMaker. This means that the instances and storage used during the job are terminated and removed, freeing up resources and minimizing costs

Batch Transform is useful in the following scenarios:
- Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
- Get inferences from large datasets.
- Run inference when you don't need a persistent endpoint.
- Associate input records with inferences to assist the interpretation of results.

The input format for the batch transform job is a jsonl file with entries as ->

{"id":1,"text_inputs":"How cute your dog is!"}
{"id":2,"text_inputs":"The mitochondria is the powerhouse of the cell."}

While the output format is ->

{"id":1, "embedding":[0.025507507845759392, 0.009654928930103779, -0.01139055471867323, .........]}
{"id":2, "embedding":[-0.018594933673739433, -0.011756304651498795, -0.006888044998049736,.....]}

6.1. Prepare data for Batch Transform

[ ]

6.2. Run Batch Transform

[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.