Using Arize with RAG
This guide shows you how to create a retrieval augmented generation chatbot and evaluate performance with Arize. RAG is typically to respond to queries using a specified set of documents instead of using the LLM's own training data, reducing hallucination and incorrect generations.
We'll go through the following steps:
-
Create a RAG chatbot using LlamaIndex
-
Trace the retrieval and llm calls using Arize
-
Create a dataset to benchmark performance
-
Evaluate performance using LLM as a judge
Create a RAG chatbot using LlamaIndex
Let's start with all of our boilerplate setup:
- Install packages for tracing and retrieval
- Setup our API keys
- Setup Phoenix for tracing
- Create our LlamaIndex query engine
- See your results in Phoenix
Install packages for tracing and retrieval
Setup our API Keys
Setup Arize for Tracing
To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the guide here.
Create our LlamaIndex query engine
See your results in the Arize UI
Once you've run a single query, you can see the trace in the Arize UI with each step taken by the retriever, the embedding, and the llm query.
Click through the queries to better understand how the query engine is performing. Arize can be used to understand and troubleshoot your RAG pipeline by surfacing:
- Application latency
- Token usage
- Runtime exceptions
- Retrieved documents
- Embeddings
- LLM parameters
- Prompt templates
- Tool descriptions
- LLM function calls
- And more!

Create synthetic dataset of questions
Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.
Now let's run it and manually inspect the traces!
Evaluating your RAG app
Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.
We will be creating an LLM as a judge using the prompt templates above by taking the spans recorded by Phoenix, and then giving them labels using the llm_classify function. This function uses LLMs to evaluate your LLM calls and gives them labels and explanations. You can read more detail here.
Let's look at and inspect the results of our evaluatiion!
Experiment with different k-values
We can also experiment with different k-values for the retriever. This is the number of documents retrieved from the vector store. We can also experiment with different chunk sizes, chunk overlaps, and rerankers. We'll be using the ColbertReranker from LlamaIndex. You can read more about it here.
Let's setup our evaluators to see how the performance changes.
Let's log these results to Arize and see how they compare.
First we'll create a dataset to store our questions.
Next we'll define which columns of our dataframe will be mapped to outputs and which will be mapped to evaluation labels and explanations..
Now let's run it for each of our experiments.
Experiment with HyDE
We can also experiment with HyDE, a retrieval augmentation technique that uses LLMs to generate synthetic queries to retrieve more relevant documents. You can read more about it here.