Notebooks
M
Milvus
White House Speeches

White House Speeches

image-searchvector-databaseRetrievalsemantic-searchmilvusembeddingsunstructured-dataquestion-answeringLLMmilvus-bootcampdeep-learningimage-recognitionimage-classificationaudio-searchPythonbootcampragNLP

Search White House Speeches from 2021 to 2022 Based On Content

A semantic search example based on White House Speeches from 2021 to 2022. Many of these speeches were made after GPT 3.5 was trained. The White House (Speeches and Remarks) 12/10/2022 dataset can be found on Kaggle. For this example, we've also made this available on Google Drive. We put together a system to semantically search these speeches using a vector database and the sentence-transformers library. For this example, we use Milvus Lite to run our vector database locally.

We begin by installing the necessary libraries:

[ ]

Download Dataset

Next, we download and extract our dataset

[ ]

Clean the Data

This dataset is not a precleaned dataset so we need to clean it up before we can work on it. Our first preprocessing step is to drop all rows with any Null or NaN data using .dropna(). Next, we ensure that we aren't picking up any partial speeches by only taking speeches that have more than 50 characters. We also get rid of all the return and newline characters in the speeches. Finally, we convert the dates into the universally accepted datetime format.

[ ]
[ ]
[ ]
[ ]
[ ]
[ ]

Establish a Vector Database and Schema

With all of our datacleaning done, it's time to set up our vector database, Milvus Lite. We start by declaring some constants before starting a server and establishing a connection.

[ ]
[ ]

Just to make sure that we are starting from a blank slate, we check for the existence of any collection with the same name as the one we chose and drop it.

[ ]

Now we establish our schema. For this data set, we have four attributes to work off - the title of the speech, the date the speech was given, the location where the speech was given, and the speech itself. We want to perform a semantic search on the content of the actual speech so the schema will contain the title, the date, the location, and a vector embedding of the actual speech.

For each VARCHAR datatype (string format) we give a max length. In this case, none of these max lengths are hit, but serve as a rough upper bound estimate.

[ ]

With a vector database server up and running as well as a collection and schema established, the final thing to do before inserting the vectors is to establish our vector index. For this example, we use an IVF_FLAT index on an L2 distance metric and 128 clusters (nlist).

[ ]

Get Vector Embeddings and Populate the Database

Here we use the SentenceTransformer library to get our vector embeddings for the speeches and populate our Milvus instance with our newly generated vector embeddings. For this example, we use the MiniLM L6 v2 transformer to get a vector embedding.

[ ]

We create a function, embed_insert, that gets the embeddings for a batch of speeches, and then inserts that batch into our Milvus instance.

[ ]

With our helper function written, we are ready to embed and insert the text. First, we turn our pandas dataframe into the right format, a list of lists, to insert. For this example, we need a list of four lists. The inner lists correspond to the title, date, location, and speech respectively. We batch the lists and call the embed_insert function we wrote above on each of them. Finally, when all of the data has been inserted, we flush the collection to ensure that everything is indexed.

[ ]

Run a Semantic Search

With the database populated, it's now possible to search all of the speeches based on their content. In this example, we search for a speech where the President speaks about renewable energy at NREL, and a speech where the Vice President and the Prime Minister of Canada both speak. We get the embeddings for these descriptions, and then search our vector database for the 3 speeches with the closest embeddings.

We expect the first description to have the speech titled "Remarks by President Biden During a Tour of the National Renewable Energy Laboratory" in its results and the second description to have the speech titled "REMARKS BY VICE PRESIDENT HARRIS AND PRIME MINISTER TRUDEAU OF CANADA BEFORE BILATERAL MEETING" in its results.

[ ]

Clean up the server.

[ ]
[ ]