Notebooks
A
Arize AI
Text2sql Experiment

Text2sql Experiment

arize-tutorialsLLMPythonexperiments

arize logo
Docs | GitHub | Slack Community

Experiments: Text2SQL


Let's work through a Text2SQL use case where we are starting from scratch without a nice and clean dataset of questions, SQL queries, or expected responses.

ℹ️This notebook requires:

  • An OpenAI API key
  • An Arize Space ID & Developer Key (explained below)
[ ]

Setup Config

Copy the Arize developer API Key and Space ID from the Datasets page (shown below) to the variables in the cell below.

[ ]
[ ]

Let's make sure we can run async code in the notebook.

[ ]

Lastly, let's make sure we have our openai API key set up.

[ ]

Download Data

We are going to use the NBA dataset that information from 2014 - 2018. We will use DuckDB as our database.

[ ]

Implement Text2SQL

Let's start by implementing a simple text2sql logic.

[ ]
[ ]

Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.

[ ]

Evaluation

Evaluation consists of three parts — data, task, and scores. We'll start with data.

[ ]

Let's store the data above as a versioned dataset in Arize.

[ ]

Let's now pull down the dataset from Arize in this environment.

[ ]

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

[ ]

Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.

[ ]

Run Experiment

Run experiment and log results to Arize

[ ]

Ok! It looks like 3/5 of our queries are valid.

#Interpreting the results

Now that we ran the initial evaluation, it looks like two of the results are valid, two produce SQL errors, and one is incorrect.

  • The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model (e.g. few shot example).

  • There are is a binder error, which may also have to do with not understanding the data format.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

[ ]

Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.

[ ]

Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.