RAG Evaluation
Authored by: Aymeric Roucher
This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.
For an introduction to RAG, you can check this other cookbook!
RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:
Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance! So let's see how to evaluate our RAG system.
Evaluating RAG performance
Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.
For our evaluation pipeline, we will need:
- An evaluation dataset with question - answer couples (QA couples)
- An evaluator to compute the accuracy of our system on the above evaluation dataset.
β‘οΈ It turns out, we can use LLMs to help us all along the way!
- The evaluation dataset will be synthetically generated by an LLM π€, and questions will be filtered out by other LLMs π€
- An LLM-as-a-judge agent π€ will then perform the evaluation on this synthetic dataset.
Let's dig into it and start building our evaluation pipeline! First, we install the required model dependancies.
Load your knowledge base
1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.
Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.
1.1. Prepare source documents
1.2. Setup agents for question generation
We use Mixtral for QA couple generation because it it has excellent performance in leaderboards such as Chatbot Arena.
'This is a test context for the `@mui/material` library.\n\n## Installation\n\n```sh\nnpm install @mui/material\n```\n\n## Usage\n\n```jsx\nimport React from \'react\';\nimport { Button } from \'@mui/material\';\n\nfunction App() {\n return (\n <div className="App">\n <Button variant="contained" color="primary">\n Hello World\n </Button>\n </div>\n );\n}\n\nexport default App;\n```\n\n## Documentation\n\n- [Material-UI](https://material-ui.com/)\n- [Material Design](https://material.io/)' Now let's generate our QA couples. For this example, we generate only 10 QA couples and will load the rest from the Hub.
But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.
1.3. Setup critique agents
The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.
We thus build critique agents that will rate each question on several criteria, given in this paper:
- Groundedness: can the question be answered from the given context?
- Relevance: is the question relevant to users? For instance,
"What is the date when transformers 4.29.1 was released?"is not relevant for ML practitioners.
One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like "What is the name of the function used in this guide?".
We also build a critique agent for this criteria:
- Stand-alone: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be
What is the function used in this article?for a question generated from a specific blog article.
We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.
π‘ When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.
We now build and run these critique agents.
Now let us filter out bad questions based on our critique agent scores:
Evaluation dataset before filtering:
============================================ Final evaluation dataset:
Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.
We have generated only a few QA couples here to reduce time and cost. But let's kickstart the next part by loading a pre-generated dataset:
2. Build our RAG System
2.1. Preprocessing documents to build our vector database
- In this part, we split the documents from our knowledge base into smaller chunks: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.
- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.
Many options exist for text splitting:
- split every
nwords / characters, but this has the risk of cutting in half paragraphs or even sentences - split after
nwords / character, but only on sentence boundaries - recursive split tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).
To learn more about chunking, I recommend you read this great notebook by Greg Kamradt.
This space lets you visualize how different splitting options affect the chunks you get.
In the following, we use Langchain's
RecursiveCharacterTextSplitter.
π‘ To measure chunk length in our Text Splitter, our length function will not be the count of characters, but the count of tokens in the tokenized text: indeed, for subsequent embedder that processes token, measuring length in tokens is more relevant and empirically performs better.
2.2. Retriever - embeddings ποΈ
The retriever acts like an internal search engine: given the user query, it returns the most relevant documents from your knowledge base.
For the knowledge base, we use Langchain vector databases since it offers a convenient FAISS index and allows us to keep document metadata throughout the processing.
π οΈ Options included:
- Tune the chunking method:
- Size of the chunks
- Method: split on different separators, use semantic chunking...
- Change the embedding model
2.3. Reader - LLM π¬
In this part, the LLM Reader reads the retrieved documents to formulate its answer.
π οΈ Here we tried the following options to improve results:
- Switch reranking on/off
- Change the reader model
3. Benchmarking the RAG system
The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evaluation dataset.
To this end, we setup a judge agent. βοΈπ€
Out of the different RAG evaluation metrics, we choose to focus only on Answer Correctness since it is the best end-to-end metric of our system's performance.
We use GPT4 as a judge for its empirically good performance, but you could try with other models such as kaist-ai/prometheus-13b-v1.0 or BAAI/JudgeLM-33B-v1.0.
π‘ In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in Prometheus's prompt template: this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples.
π‘ Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement.
π Let's run the tests and evaluate answers!π
Inspect results
settings ,./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json 0.884328 ,./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:False_reader-model:zephyr-7b-beta.json 0.906716 ,./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:True_reader-model:zephyr-7b-beta.json 0.906716 ,./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral.json 0.906716 ,./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json 0.921642 ,./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral0.json 0.947761 ,Name: eval_score_GPT4, dtype: float64
Example results
Let us load the results that I obtained by tweaking the different options available in this notebook. For more detail on why these options could work or not, see the notebook on advanced_RAG.
As you can see in the graph below, some tweaks do not bring any improvement, some give huge performance boosts.
β‘οΈ There is no single good recipe: you should try several different directions when tuning your RAG systems.
As you can see, these had varying impact on performance. In particular, tuning the chunk size is both easy and very impactful.
But this is our case: your results could be very different: now that you have a robust evaluation pipeline, you can set on to explore other options! πΊοΈ