Text2sql Experiment
Experiments: Text2SQL
Let's work through a Text2SQL use case where we are starting from scratch without a nice and clean dataset of questions, SQL queries, or expected responses.
ℹ️This notebook requires:
- An OpenAI API key
- An Arize Space ID & Developer Key (explained below)
Setup Config
Copy the Arize developer API Key and Space ID from the Datasets page (shown below) to the variables in the cell below.

Let's make sure we can run async code in the notebook.
Lastly, let's make sure we have our openai API key set up.
Download Data
We are going to use the NBA dataset that information from 2014 - 2018. We will use DuckDB as our database.
Implement Text2SQL
Let's start by implementing a simple text2sql logic.
Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.
Evaluation
Evaluation consists of three parts — data, task, and scores. We'll start with data.
Let's store the data above as a versioned dataset in Arize.
Let's now pull down the dataset from Arize in this environment.
Next, we'll define the task. The task is to generate SQL queries from natural language questions.
Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.
Run Experiment
Run experiment and log results to Arize
Ok! It looks like 3/5 of our queries are valid.
#Interpreting the results
Now that we ran the initial evaluation, it looks like two of the results are valid, two produce SQL errors, and one is incorrect.
-
The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model (e.g. few shot example).
-
There are is a binder error, which may also have to do with not understanding the data format.
Let's try to improve the prompt with few-shot examples and see if we can get better results.
Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.
Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.