OpenAI Customizing Embeddings

Customizing Embeddings

Export

Run Notebooks

idle

Contents

No cells yet

Add cells to see them here

Customizing embeddings

This notebook demonstrates one way to customize OpenAI embeddings to a particular task.

The input is training data in the form of [text_1, text_2, label] where label is +1 if the pairs are similar and -1 if the pairs are dissimilar.

The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a 'custom embedding' that will better emphasize aspects of the text relevant to your use case. In binary classification use cases, we've seen error rates drop by as much as 50%.

In the following example, I use 1,000 sentence pairs picked from the SNLI corpus. Each pair of sentences are logically entailed (i.e., one implies the other). These pairs are our positives (label = 1). We generate synthetic negatives by combining sentences from different pairs, which are presumed to not be logically entailed (label = -1).

For a clustering use case, you can generate positives by creating pairs from texts in the same clusters and generate negatives by creating pairs from sentences in different clusters.

With other data sets, we have seen decent improvement with as little as ~100 training examples. Of course, performance will be better with more examples.

0. Imports

[1]

1. Inputs

Most inputs are here. The key things to change are where to load your datset from, where to save a cache of embeddings to, and which embedding engine you want to use.

Depending on how your data is formatted, you'll want to rewrite the process_input_data function.

[2]

2. Load and process input data

[3]

/var/folders/r4/x3kdvs816995fnnph2gdpwp40000gn/T/ipykernel_17509/1977422881.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label"] = df["label"].apply(lambda x: {"entailment": 1, "contradiction": -1}[x])

3. Split data into training test sets

Note that it's important to split data into training and test sets before generating synethetic negatives or positives. You don't want any text strings in the training data to show up in the test data. If there's contamination, the test metrics will look better than they'll actually be in production.

[4]

4. Generate synthetic negatives

This is another piece of the code that you will need to modify to match your use case.

If you have data with positives and negatives, you can skip this section.

If you have data with only positives, you can mostly keep it as is, where it generates negatives only.

If you have multiclass data, you will want to generate both positives and negatives. The positives can be pairs of text that share labels, and the negatives can be pairs of text that do not share labels.

The final output should be a dataframe with text pairs, where each pair is labeled -1 or 1.

[5]

[6]

5. Calculate embeddings and cosine similarities

Here, I create a cache to save the embeddings. This is handy so that you don't have to pay again if you want to run the code again.

[ ]

6. Plot distribution of cosine similarity

Here we measure similarity of text using cosine similarity. In our experience, most distance functions (L1, L2, cosine similarity) all work about the same. Note that our embeddings are already normalized to length 1, so cosine similarity is equivalent to dot product.

The graphs show how much the overlap there is between the distribution of cosine similarities for similar and dissimilar pairs. If there is a high amount of overlap, that means there are some dissimilar pairs with greater cosine similarity than some similar pairs.

The accuracy I compute is the accuracy of a simple rule that predicts 'similar (1)' if the cosine similarity is above some threshold X and otherwise predicts 'dissimilar (0)'.

[8]

train accuracy: 89.1% ± 2.4%
test accuracy: 88.8% ± 2.4%

7. Optimize the matrix using the training data provided

[9]

[10]

[11]

Epoch 1/30: train accuracy: 89.1% ± 2.4%
Epoch 1/30: test accuracy: 88.4% ± 2.4%
Epoch 2/30: train accuracy: 89.5% ± 2.3%
Epoch 2/30: test accuracy: 88.8% ± 2.4%
Epoch 3/30: train accuracy: 90.6% ± 2.2%
Epoch 3/30: test accuracy: 89.3% ± 2.3%
Epoch 4/30: train accuracy: 91.2% ± 2.2%
Epoch 4/30: test accuracy: 89.7% ± 2.3%
Epoch 5/30: train accuracy: 91.5% ± 2.1%
Epoch 5/30: test accuracy: 90.0% ± 2.3%
Epoch 6/30: train accuracy: 91.9% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.2% ± 2.0%
Epoch 7/30: test accuracy: 90.7% ± 2.2%
Epoch 8/30: train accuracy: 92.7% ± 2.0%
Epoch 8/30: test accuracy: 90.9% ± 2.2%
Epoch 9/30: train accuracy: 92.7% ± 2.0%
Epoch 9/30: test accuracy: 91.0% ± 2.2%
Epoch 10/30: train accuracy: 93.0% ± 1.9%
Epoch 10/30: test accuracy: 91.6% ± 2.1%
Epoch 11/30: train accuracy: 93.1% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.4% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.6% ± 1.9%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.7% ± 2.0%
Epoch 15/30: train accuracy: 93.7% ± 1.8%
Epoch 15/30: test accuracy: 92.7% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.3% ± 1.8%
Epoch 20/30: test accuracy: 93.0% ± 1.9%
Epoch 21/30: train accuracy: 94.5% ± 1.7%
Epoch 21/30: test accuracy: 93.1% ± 1.9%
Epoch 22/30: train accuracy: 94.5% ± 1.7%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.6% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.6% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.3% ± 1.9%
Epoch 26/30: train accuracy: 94.8% ± 1.7%
Epoch 26/30: test accuracy: 93.4% ± 1.9%
Epoch 27/30: train accuracy: 94.8% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.9% ± 1.7%
Epoch 29/30: test accuracy: 93.4% ± 1.9%
Epoch 30/30: train accuracy: 94.9% ± 1.7%
Epoch 30/30: test accuracy: 93.3% ± 1.9%
Epoch 1/30: train accuracy: 89.7% ± 2.3%
Epoch 1/30: test accuracy: 89.1% ± 2.4%
Epoch 2/30: train accuracy: 89.8% ± 2.3%
Epoch 2/30: test accuracy: 89.9% ± 2.3%
Epoch 3/30: train accuracy: 90.3% ± 2.2%
Epoch 3/30: test accuracy: 90.0% ± 2.3%
Epoch 4/30: train accuracy: 91.0% ± 2.2%
Epoch 4/30: test accuracy: 90.3% ± 2.2%
Epoch 5/30: train accuracy: 91.3% ± 2.1%
Epoch 5/30: test accuracy: 90.3% ± 2.2%
Epoch 6/30: train accuracy: 91.8% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.4% ± 2.0%
Epoch 7/30: test accuracy: 91.0% ± 2.2%
Epoch 8/30: train accuracy: 92.8% ± 2.0%
Epoch 8/30: test accuracy: 91.3% ± 2.1%
Epoch 9/30: train accuracy: 93.1% ± 1.9%
Epoch 9/30: test accuracy: 91.6% ± 2.1%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 91.9% ± 2.1%
Epoch 11/30: train accuracy: 93.4% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.6% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.7% ± 1.8%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.5% ± 2.0%
Epoch 15/30: train accuracy: 93.9% ± 1.8%
Epoch 15/30: test accuracy: 92.8% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 92.8% ± 2.0%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 92.8% ± 2.0%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 92.8% ± 2.0%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 92.8% ± 2.0%
Epoch 20/30: train accuracy: 94.2% ± 1.8%
Epoch 20/30: test accuracy: 93.1% ± 1.9%
Epoch 21/30: train accuracy: 94.3% ± 1.8%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.3% ± 1.8%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.5% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.5% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.6% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.6% ± 1.7%
Epoch 26/30: test accuracy: 93.3% ± 1.9%
Epoch 27/30: train accuracy: 94.6% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.8% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.8% ± 1.7%
Epoch 29/30: test accuracy: 93.3% ± 1.9%
Epoch 30/30: train accuracy: 94.8% ± 1.7%
Epoch 30/30: test accuracy: 93.4% ± 1.9%
Epoch 1/30: train accuracy: 90.7% ± 2.2%
Epoch 1/30: test accuracy: 89.9% ± 2.3%
Epoch 2/30: train accuracy: 90.9% ± 2.2%
Epoch 2/30: test accuracy: 90.3% ± 2.2%
Epoch 3/30: train accuracy: 91.6% ± 2.1%
Epoch 3/30: test accuracy: 90.3% ± 2.2%
Epoch 4/30: train accuracy: 92.2% ± 2.0%
Epoch 4/30: test accuracy: 90.7% ± 2.2%
Epoch 5/30: train accuracy: 92.4% ± 2.0%
Epoch 5/30: test accuracy: 91.3% ± 2.1%
Epoch 6/30: train accuracy: 92.5% ± 2.0%
Epoch 6/30: test accuracy: 91.8% ± 2.1%
Epoch 7/30: train accuracy: 93.0% ± 1.9%
Epoch 7/30: test accuracy: 92.2% ± 2.0%
Epoch 8/30: train accuracy: 93.1% ± 1.9%
Epoch 8/30: test accuracy: 92.7% ± 2.0%
Epoch 9/30: train accuracy: 93.3% ± 1.9%
Epoch 9/30: test accuracy: 92.5% ± 2.0%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 92.7% ± 2.0%
Epoch 11/30: train accuracy: 93.6% ± 1.9%
Epoch 11/30: test accuracy: 92.8% ± 2.0%
Epoch 12/30: train accuracy: 93.7% ± 1.8%
Epoch 12/30: test accuracy: 92.8% ± 2.0%
Epoch 13/30: train accuracy: 94.0% ± 1.8%
Epoch 13/30: test accuracy: 93.0% ± 1.9%
Epoch 14/30: train accuracy: 93.9% ± 1.8%
Epoch 14/30: test accuracy: 93.0% ± 1.9%
Epoch 15/30: train accuracy: 94.2% ± 1.8%
Epoch 15/30: test accuracy: 93.0% ± 1.9%
Epoch 16/30: train accuracy: 94.2% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.3% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.5% ± 1.7%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.5% ± 1.7%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.6% ± 1.7%
Epoch 20/30: test accuracy: 93.3% ± 1.9%
Epoch 21/30: train accuracy: 94.8% ± 1.7%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.8% ± 1.7%
Epoch 22/30: test accuracy: 93.4% ± 1.9%
Epoch 23/30: train accuracy: 94.8% ± 1.7%
Epoch 23/30: test accuracy: 93.4% ± 1.9%
Epoch 24/30: train accuracy: 94.8% ± 1.7%
Epoch 24/30: test accuracy: 93.4% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.9% ± 1.7%
Epoch 26/30: test accuracy: 93.6% ± 1.9%
Epoch 27/30: train accuracy: 94.9% ± 1.7%
Epoch 27/30: test accuracy: 93.6% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.6% ± 1.9%
Epoch 29/30: train accuracy: 95.1% ± 1.6%
Epoch 29/30: test accuracy: 93.6% ± 1.9%
Epoch 30/30: train accuracy: 95.1% ± 1.6%
Epoch 30/30: test accuracy: 93.6% ± 1.9%

[12]

8. Plot the before & after, showing the results of the best matrix found during training

The better the matrix is, the more cleanly it will separate the similar and dissimilar pairs.

[13]

[14]

Test accuracy: 88.8% ± 2.4%

Test accuracy after customization: 93.6% ± 1.9%

[15]

array([[-1.2566795e+00, -1.5297449e+00, -1.3271648e-01, ...,
,        -1.2859761e+00, -5.3254390e-01,  4.8364732e-01],
,       [-1.4826347e+00,  9.2656955e-02, -4.2437232e-01, ...,
,         1.1872858e+00, -1.0831847e+00, -1.0683593e+00],
,       [-2.2029283e+00, -1.9703420e+00,  3.1125939e-01, ...,
,         2.2947595e+00,  5.5780332e-03, -6.0171342e-01],
,       ...,
,       [-1.1019799e-01,  1.3599515e+00, -4.7677776e-01, ...,
,         6.5626711e-01,  7.2359240e-01,  3.0733588e+00],
,       [ 1.6624762e-03,  4.2648423e-01, -1.1380885e+00, ...,
,         8.7202555e-01,  9.3173909e-01, -1.6760436e+00],
,       [ 7.7449006e-01,  4.9213606e-01,  3.5407653e-01, ...,
,         1.3460466e+00, -1.9509128e-01,  7.7514690e-01]], dtype=float32)

[ ]