Vanna With NVIDIA
Accelerating Text-to-SQL Inference on Vanna with NVIDIA NIM
This notebook demonstrates how to optimize Vanna’s open-source text-to-SQL pipeline using NVIDIA NIM and NeMo Retriever for faster and more efficient analytics.
1. Prerequisites
Make sure to install the following:
- Python 3.8+
vanna,langchain,milvus,openai,nvidia-langchain,pandas,kagglehub,numpy- Access to NVIDIA NIM endpoints
2. Data Preparation
We use two Steam datasets from Kaggle - Steam Games Dataset and Games on Steam. We follow the preprocess steps mentioned in the excellent Kaggle notebook. After preprocessing, three CSVs will be generated:
tableau_games.csvtableau_categories.csvtableau_tags.csv
You can follow the steps outlined below or directly use the the processed data in processed_dataset folder and jump to Step 3.
2.1 Download Steam Datasets
2.2 Preprocess first dataset
Great! We've got the data we need, but it seems we have more rows than there are games on Steam.
Some games just seem to be developer tests. Let's remove them. We'll also remove games with no reviews or no categories
To keep things simple, we will also remove games older than 2013, since there are very few games more than 10 year old on Steam:
Next, split the 'estimated_owners' column into two different variables.
Delete the outlier game with more than $800 price
2.3 Merge the two datasets
The second dataset we'll use contains information about game duration. Let's read the second dataset
We're only interested in the column 'hltb_single', that contains the information on game length we need. Each Steam game has a unique identifier we can use to join the data from both datasets. This unique identifier is found in the column 'app_id' of the first dataset, and in the column 'sid' of the second dataset. First we'll have to convert 'app_id' to integer since it is currently an object. Let's join the data and see the result:
Some games are extreme outliers in terms of duration. This is not caused by these games being extremely long but by the fact that some games can be played indefinitely and very few users have reported game length for these types of games. This might distort our analysis, so we'll limit the maximum duration of games at 100 hours, which is a reasonable upper limit for most games:
2.4 Normalizing data
The DataFrame contains fields such as 'categories' and 'tags' that consist of lists of values. To normalize the data for storage in a SQL database, we need to break these fields into separate tables. Each table will maintain a relationship with the main table through the 'app_id' foreign key, following standard database normalization practices.
The main DataFrame is ready. We will remove any categories and tags with less than 50 games, since they are not relevant enough.
2.5 Save preprocessing results as CSV
Finally, we'll save the results as CSV files that we'll ingest into the SQL Database.
3. Setting Up Vanna with NVIDIA NIM and NeMo Retriever
First, set the NVIDIA API Key. If you don't find it here
Define Vanna Class using Milvus and OpenAI
Create NIM Client based on OpenAI Wrapper
Create Nvidia Embedder (Langchain)
Define the Vector DB Client
To keep things simple, we will use a local Milvus vector DB.
Create a Vanna instance using the LLM, Embedder and the Vector DB defined above
4. Ingest processed Steam data into a SQL DB
Great, the data is ingested in the SQL DB. Now, let's connect the SQL DB to Vanna and start giving more context on our data.
5. Let's train Vanna on our dataset
Let's verify the training data once.