Notebooks
M
Meta Llama
Multimodal RAG With Nvidia Investor Slide Deck

Multimodal RAG With Nvidia Investor Slide Deck

llamaAIvllmmachine-learning3p-integrationstogetheraillama2LLMllama-cookbookPythonfinetuningpytorchlangchain

MultiModal Document RAG with ColQwen2 and Llama 3.2 90B Vision

Open In Colab

Hardware Requirements

To ensure the notebook runs faster please change the runtime type to T4 GPU: Runtime -> Change runtime type -> T4 GPU

Introduction

In this notebook we will see how to use Multimodal RAG to chat with Nvidia's invester slide deck from last year. The slide deck is 39 pages with a combination of text, visuals, tables, charts and annotations. The document structure and templates vary from page to page and is quite difficult to RAG over using traditional methods.

We will be using a new multimodal approach!

MultiModal RAG Workflow

ColPali is a new multimodal retrieval system that seamlessly enables image retrieval.

By directly encoding image patches, it eliminates the need for optical character recognition (OCR), or image captioning to extract text from PDFs.

We will use byaldi, a library from AnswerAI, that makes it easier to work with an upgraded version of ColPali, called ColQwen2, to embed and retrieve images of our PDF documents.

Retrieved pages will then be passed into the Llama-3.2 90B Vision model served via a Together AI inference endpoint for it to answer questions.

To get a better explanation of how ColPali and the new Llama 3.2 Vision models work checkout the blog post connected to this notebook.

Install relevant libraries

[ ]
[ ]
[ ]

Initialize the ColPali Model

[ ]
Verbosity is set to 1 (active). Pass verbose=0 to make quieter.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
adapter_config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
model.safetensors.index.json:   0%|          | 0.00/56.5k [00:00<?, ?B/s]
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00002.safetensors:   0%|          | 0.00/3.85G [00:00<?, ?B/s]
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]
adapter_model.safetensors:   0%|          | 0.00/74.0M [00:00<?, ?B/s]
preprocessor_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/4.30k [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]
added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]
chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

The document we will be retrieving from is a 39 page Nvidia investor presentation from 2023: Investor Presentation October 2023

[ ]
--2024-10-04 14:34:23--  https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf
Resolving s201.q4cdn.com (s201.q4cdn.com)... 68.70.205.3, 68.70.205.4, 68.70.205.1, ...
Connecting to s201.q4cdn.com (s201.q4cdn.com)|68.70.205.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8609256 (8.2M) [application/pdf]
Saving to: ‘ndr_presentation_oct_2023_final.pdf’

ndr_presentation_oc 100%[===================>]   8.21M  24.4MB/s    in 0.3s    

2024-10-04 14:34:24 (24.4 MB/s) - ‘ndr_presentation_oct_2023_final.pdf’ saved [8609256/8609256]

Lets create our index that will store the embeddings for the page images.

Caution: This cell below takes ~5 mins to index the whole PDF!

[ ]
Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Added page 16 of document 0 to index.
Added page 17 of document 0 to index.
Added page 18 of document 0 to index.
Added page 19 of document 0 to index.
Added page 20 of document 0 to index.
Added page 21 of document 0 to index.
Added page 22 of document 0 to index.
Added page 23 of document 0 to index.
Added page 24 of document 0 to index.
Added page 25 of document 0 to index.
Added page 26 of document 0 to index.
Added page 27 of document 0 to index.
Added page 28 of document 0 to index.
Added page 29 of document 0 to index.
Added page 30 of document 0 to index.
Added page 31 of document 0 to index.
Added page 32 of document 0 to index.
Added page 33 of document 0 to index.
Added page 34 of document 0 to index.
Added page 35 of document 0 to index.
Added page 36 of document 0 to index.
Added page 37 of document 0 to index.
Added page 38 of document 0 to index.
Added page 39 of document 0 to index.
Index exported to .byaldi/nvidia_index
Index exported to .byaldi/nvidia_index
{0: '/content/nvidia_presentation.pdf'}

This concludes the indexing of the PDF phase - everything below happens at query time.

Let's query our indexed document.

Here the important thing to note is that the query is asking for details that are found on page 25 of the PDF!

[ ]
Search results for 'What are the half year data centre renevue results and the 5 year CAGR for Nvidia data centre revenue?':
Doc ID: 0, Page: 25, Score: 25.875
Doc ID: 0, Page: 24, Score: 25.0
Doc ID: 0, Page: 28, Score: 23.75
Doc ID: 0, Page: 32, Score: 23.75
Doc ID: 0, Page: 31, Score: 23.75
Test completed successfully!

Notice that ColQwen2 is able to retrieve that correct page with the highest similarity!

How does this work? What happens under the hood between the different pages and query token?

The interaction operation between page image patch and query text token representations to score each page of the document is what allows this great retrieval performance.

Typically each image is resized and cut into patch sizes of 16x16 pixels. These patches are then embedded into 128 dimensional vectors which are stored and used to perform the MaxSim and late interaction operations between the image and text tokens. ColPali is a multi-vector approach because it produces multiple vectors for each image/query; one vector for each token instead of just one vector for all tokens.

The retrieval step takes about 185 ms.

[ ]
182 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Lets now pass in the retrieved page to the Llama-3.2 90B Vision Model.

This model will read the question: "What are the half year data centre renevue results and the 5 year CAGR for Nivida data centre revenue?"

And take in the retrieved page and produce an answer!

You can pass in a URL to the image of the retrieved page or a base64 encoded version of the image.

[ ]
[ ]

We'll use a Together AI inference endpoint to access the Llama-3.2 90B Vision model

[ ]
The half-year data center revenue results for Nvidia are $14,607 million. The 5-year CAGR for Nvidia's data center revenue is 51%.

Here we can see that the combination of ColQwen2 as a image retriever and Llama-3.2 90B Vision is a powerful duo for multimodal RAG applications specially with PDFs.

Not only was ColQwen2 able to retrieve the correct page that had the right answer on it but then Llama-3.2 90B Vision was also able to find exactly where on the page this answer was, ignoring all the irrelevant details!

Voila!🎉🎉

Learn more about Llama 3.2 Vision in the docs here!