Notebooks
N
NVIDIA
Vanna With NVIDIA

Vanna With NVIDIA

gpu-accelerationVanna_with_NVIDIA_AI_Endpointsretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-servercommunityLLMragnemo

Accelerating Text-to-SQL Inference on Vanna with NVIDIA NIM

This notebook demonstrates how to optimize Vanna’s open-source text-to-SQL pipeline using NVIDIA NIM and NeMo Retriever for faster and more efficient analytics.

1. Prerequisites

Make sure to install the following:

  • Python 3.8+
  • vanna, langchain, milvus, openai, nvidia-langchain,pandas, kagglehub,numpy
  • Access to NVIDIA NIM endpoints
[ ]

2. Data Preparation

We use two Steam datasets from Kaggle - Steam Games Dataset and Games on Steam. We follow the preprocess steps mentioned in the excellent Kaggle notebook. After preprocessing, three CSVs will be generated:

  • tableau_games.csv
  • tableau_categories.csv
  • tableau_tags.csv

You can follow the steps outlined below or directly use the the processed data in processed_dataset folder and jump to Step 3.

2.1 Download Steam Datasets

[ ]

2.2 Preprocess first dataset

[ ]
[ ]

Great! We've got the data we need, but it seems we have more rows than there are games on Steam.

Some games just seem to be developer tests. Let's remove them. We'll also remove games with no reviews or no categories

[ ]

To keep things simple, we will also remove games older than 2013, since there are very few games more than 10 year old on Steam:

[ ]

Next, split the 'estimated_owners' column into two different variables.

[ ]

Delete the outlier game with more than $800 price

[ ]

2.3 Merge the two datasets

The second dataset we'll use contains information about game duration. Let's read the second dataset

[ ]

We're only interested in the column 'hltb_single', that contains the information on game length we need. Each Steam game has a unique identifier we can use to join the data from both datasets. This unique identifier is found in the column 'app_id' of the first dataset, and in the column 'sid' of the second dataset. First we'll have to convert 'app_id' to integer since it is currently an object. Let's join the data and see the result:

[ ]

Some games are extreme outliers in terms of duration. This is not caused by these games being extremely long but by the fact that some games can be played indefinitely and very few users have reported game length for these types of games. This might distort our analysis, so we'll limit the maximum duration of games at 100 hours, which is a reasonable upper limit for most games:

[ ]

2.4 Normalizing data

The DataFrame contains fields such as 'categories' and 'tags' that consist of lists of values. To normalize the data for storage in a SQL database, we need to break these fields into separate tables. Each table will maintain a relationship with the main table through the 'app_id' foreign key, following standard database normalization practices.

[ ]

The main DataFrame is ready. We will remove any categories and tags with less than 50 games, since they are not relevant enough.

[ ]
[ ]

2.5 Save preprocessing results as CSV

Finally, we'll save the results as CSV files that we'll ingest into the SQL Database.

[ ]

3. Setting Up Vanna with NVIDIA NIM and NeMo Retriever

First, set the NVIDIA API Key. If you don't find it here

[ ]

Define Vanna Class using Milvus and OpenAI

[ ]

Create NIM Client based on OpenAI Wrapper

[ ]

Create Nvidia Embedder (Langchain)

[ ]

Define the Vector DB Client

To keep things simple, we will use a local Milvus vector DB.

[ ]

Create a Vanna instance using the LLM, Embedder and the Vector DB defined above

[ ]

4. Ingest processed Steam data into a SQL DB

[ ]

Great, the data is ingested in the SQL DB. Now, let's connect the SQL DB to Vanna and start giving more context on our data.

[ ]

5. Let's train Vanna on our dataset

[ ]
[ ]

Let's verify the training data once.

[ ]

6. Let's ask questions

[ ]
[ ]
[ ]