TwelveLabs Weaviate RAG Colab
Twelve Labs Video RAG with Weaviate
Set Up Our Environment
Install Dependencies
Set Up Twelve Labs and Weaviate SDKs
Setting Up Our Video Data
Setting Up Our Video Data
Some of our videos are too low resolution to use in the embedding engine, so we will double their their resolution with upscale_video.
read_video_pyav comes directly from the LLaVa-NeXT-Video collab notebook and it formats videos in the correct numpy representation for inference.
Here we upscale all of our videos
Compare Pegasus and LLaVA-NeXT-Video on a Single Video
We will start by comparing Pegausus and LLaVA-NeXT-Video on generating insights from a single video
Using Pegasus to Chat with our Video
To chat with our video, we first need to have Pegasus index it.
We will create an index named sports_videos and then upload our video to this index to be indexed before chatting with it. We only need to do this once per video.
In more complex workflows with multiple videos, we can upload all of can be done way ahead of time to reduce overhead and speed up the end-to-end workflow.
First we create the index.
Then we create a funciton to upload our video to be indexed.
Next, we'll upload our video.
Finally we'll query it.
Here is Pegasus' response:
The video showcases a pivotal moment in a football game between the New York Giants and the New England Patriots. Eli Manning, the Giants' quarterback, throws a pass that David Tyree catches spectacularly by pinning the ball against his helmet as he falls out of bounds. Multiple angles replay the catch, emphasizing its difficulty and precision. Tyree briefly celebrates after the play, and the video ends with him and other players walking off the field.
From the above response, we can see that Pegagus 1.2 can coherently resopnd to the question. Now, lets check and see if we can get a similar response from the Open Source model.
Using LLaVa-NeXT-Video to Chat with our Video
For the Open Source model, we will need to setup up a video sampling for the model to consume and load the model from the Hugging Face Hub, format the input for inference, and then run the model on our inputs. We will modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 40 frames for each video.
Here we'll set up our LLaVa-NeXT-Video model.
Next, we'll create a function to query our model.
Output:
Here is LLaVa-NeXT-Video's ouput:
What is happening in this video? Be concise ASSISTANT: The video shows a football game in progress, with various players on the field. It appears to be the Super Bowl III between the New York Giants and the New England Patriots, judging by the jersey numbers and the old-fashioned helmets worn by some players. One player is in mid-action, grabbing the ball and getting tackled by another player, while a referee is signaling a first down. There are also coaches and other game
While this model does recognize that there is a football game happening between the Giants and the Patriots, it tends to hallucinate other facts.
RAG for Segment-Level Queries on a Single Video
We see that Pegasus is the clear winner on time and accuracy for this query when querying the entire video.
The open source model would likely perform better if we could constrict the video in question to a smaller segment. We can do this by creating queries that only need a subset of the video, and using RAG to get the relevant subset.
This is where the Marengo model will come in. We can use it to create embeddings for each segment of the video, and then use RAG to get the most relevant segment based on our queries.
We will start by creating embeddings for each segment of the video.
Using Marengo to Create Full Video and Video Clip Embeddings
Marengo allows us to retrieve embeddings for the entire video and for clips at a set clip length all in one call.
We'll save the task ID for use later when uploading our embeddings to Weaviate.
Prepare Video Segments for RAG
Next, we will split this video up into segments that mirror the timestamps for each embedding. This lets us later submit only this video chunk to our model for a RAG use case
Next, we'll upload the video segments to Pegaus to get their video ids. We will upload these to Weaviate along with the embeddings, so we can easily chat with the returned video. This is a great way to speed up results when you have videos that users will chat with.
Here we'll create and populate a dictionary mapping file names with pegasus video IDs.
We'll also add the video ID for the full video that we retrieved earlier
We'll also sample all of our videos for use with the LLaVa-NeXT-Video model
Uploading Embeddings to Weaviate
Now we'll create a function to prepare our data to be uploaded to Weaviate
Now, we'll upload the data to our collection
Testing the Vector Search
Now that we have everything in the collection, we can test and see that it properly returns the correct sample 'video_name':5.0
Querying our Vector Database with Text Embeddings
To query the database, first we'll embed our text query with Marengo's text embedding feature. Then we will query the Weaviate database for the clip embedding that best matches our question embeddings. We will then use the pegasus video ID to ask our question for that clip.
Chatting with our Video Segment: Pegasus vs LLaVa-NeTX-Video
Pegasus:
LLaVa-NeXT-Video
Multi Video RAG with Marengo, Weaviate, and Pegasus
Now that we know how Marengo embeddings perform on individual clips from a single video, we will show how to use embeddings across mutiple videos for a more realistic RAG use case
Get Marengo Embeddings for All Videos
Split our Remaining Videos into Segments
Get Pegasus Video IDs for All Videos and their Segments
Finally, we will upload the full videos and their segments to Pegasus so we can chat with them. We will paralellize this task to speed it up.
Upload Data to Weaviate
First we'll prepare our data to be uploaded
Then, we will upload it to our collection.
RAG Questions
We now have Marengo embeddings and Pegasus video IDs upload to Weaviate.
We can assess the performance of running queries on the clips and the full video in terms of answer accuracy and speed.
Multi Video RAG with Pegasus
Multi Video RAG with LLaVa-NeXT-Video
Now we can run our model on the full video, which outputs some more interesting answers
First we'll sample the rest of our video segments