Gif Search

vector-databasesemantic-searchlearnAIgif-searchLLMPythonsearchjupyter-notebookpinecone-examples

Open In Colab Open nbviewer

NLP Powered GIF search

We will use the Tumblr GIF Description Dataset, which contains over 100k animated GIFs and 120K sentences describing its visual content. Using this data with a vector database and retriever we are able to create an NLP-powered GIF search tool.

There are a few packages that must be installed for this notebook to run:

[ ]

We must also set the following notebook parameters to display the GIF images we will be working with.

[1]

Download and Extract Dataset

First let's download and extract the dataset. The dataset is available here on GitHub. We can use the link below to download the dataset directly. We can also access the link from a browser to directly download the files.

[ ]
[ ]

Explore the Dataset

Now let's explore the downloaded files. The data we want is in tgif-v1.0.tsv file in the data folder. We can use pandas library to open the file. We need to set delimiter as \t as the file contains tab separated values.

[2]
[3]

Note the dataset does not contain the actual GIF files. But it has URLs we can use to download/access the GIF files. This is great as we do not need to store/download all the GIF files. We can directly load the required GIF files using the URL when displaying the search results.

There are some duplicate descriptions in the dataset.

[4]
125782
[5]
102068
[6]
https://38.media.tumblr.com/ddbfe51aff57fd8446f49546bc027bd7/tumblr_nowv0v6oWj1uwbrato1_500.gif    4
,https://33.media.tumblr.com/46c873a60bb8bd97bdc253b826d1d7a1/tumblr_nh7vnlXEvL1u6fg3no1_500.gif    4
,https://38.media.tumblr.com/b544f3c87cbf26462dc267740bb1c842/tumblr_n98uooxl0K1thiyb6o1_250.gif    4
,https://33.media.tumblr.com/88235b43b48e9823eeb3e7890f3d46ef/tumblr_nkg5leY4e21sof15vo1_500.gif    4
,https://31.media.tumblr.com/69bca8520e1f03b4148dde2ac78469ec/tumblr_npvi0kW4OD1urqm0mo1_400.gif    4
,Name: url, dtype: int64

Let's take a look at one of these duplicated URLs and it's descriptions.

[7]
two girls are singing music pop in a concert
a woman sings sang girl on a stage singing
two girls on a stage sing into microphones.
two girls dressed in black are singing.

There is no reason for us to remove these duplicates, as shown here, every description is accurate. You can spot check a few of the other URLs but they all seem to be the same where we have several accurate descriptions for a single GIF.

That leaves us with 125,781 descriptions for 102,067 GIFs. We will use these descriptions to create context vectors that will be indexed in a vector database to create our GIF search tool. Let's take a look at a few more examples of GIFs and their descriptions.

[55]
a man is glaring, and someone with sunglasses appears.
a cat tries to catch a mouse on a tablet
a man dressed in red is dancing.
an animal comes close to another in the jungle
a man in a hat adjusts his tie and makes a weird face.

We can see that the description of the GIF accurately describes what is happening in the GIF, we can use these descriptions to search through our GIFs.

Using this data, we can build the GIF search tool with just two components:

  • a retriever to embed GIF descriptions
  • a vector database to store GIF description embeddings and retrieve relevant GIFs

Initialize Pinecone Index

The vector database stores vector representations of our GIF descriptions which we can retrieve using another vector (query vector). We will use the Pinecone vector database, a fully managed vector database that can store and search through billions of records in milliseconds. You could use any other vector database such as FAISS to build this tool. But you may need to manage the database yourself.

To initialize the database, we sign up for a free Pinecone API key and pip install pinecone-client. Once ready, we initialize our index with:

[ ]

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions here.

[ ]

Create the index:

[ ]
[ ]

Here we specify the name of the index where we will store our GIF descriptions and their URLs, the similarity metric, and the embedding dimension of the vectors. The similarity metric and embedding dimension can change depending on the embedding model used. However, most retrievers use "cosine" and 768.

Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

  1. Generate embeddings for all the GIF descriptions (context vectors/embeddings)
  2. Generate embeddings for the query (query vector/embedding)

The retriever will generate the embeddings in a way that the queries and GIF descriptions with similar meanings are in a similar vector space. Then we can use cosine similarity to calculate this similarity between the query and context embeddings and find the most relevant GIF to our query.

We will use a SentenceTransformer model trained based on Microsoft's MPNet as our retriever. This model performs well out-of-the-box when searching based on generic semantic similarity.

[10]
[11]
SentenceTransformer(
,  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
,  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
,  (2): Normalize()
,)

Generate Embeddings and Upsert

Now our retriever and the pinecone index are initialized. Next, we need to generate embeddings for the GIF descriptions. We will do this in batches to help us more quickly generate embeddings. This means our retriever will generate embeddings for 64 GIF descriptions at once instead of generating them individually (much faster) and send a single API call for each batch of 64 (also much faster).

When passing the documents to pinecone, we need an id (a unique value), embedding (embeddings for the GIF descriptions we have generated earlier), and metadata for each document representing GIFs in the dataset. The metadata is a dictionary containing data relevant to our embeddings. For the GIF search tool, we only need the URL and description.

[18]
  0%|          | 0/1966 [00:00<?, ?it/s]
{'dimension': 384,
, 'index_fullness': 0.05,
, 'namespaces': {'': {'vector_count': 125782}}}

We can see all our documents are now in the pinecone index. Let's run some queries to test our GIF search tool.

Querying

We have two functions, search_gif, to handle our search query, and display_gif, to display the search results.

The search_gif function generates vector embedding for the search query using the retriever model and then runs the query on the pinecone index. index.query will compute the cosine similarity between the query embedding and the GIF description embeddings as we set the metric type as "cosine" when we initialize the pinecone index. The function will return the URL of the top 10 most relevant GIFs to our search query.

[33]

The display_gif can display multiple GIFs using its URLs in the jupyter notebook in a grid style. We use this function to display the top 10 GIFs returned by the search_gif function.

[34]

Let's begin testing some queries.

[52]
[40]
[41]
[42]
[43]
[48]

Let's describe the third GIF with the ginger dog dancing on his hind legs.

[49]

These look like pretty cool results.


Delete the Index

If you're done with the index, we delete it to save resources.

[ ]