01 Synthetic Data Generation
Synthetic Data Generation for RAG Evaluation
This notebook demonstrates how to use LLMs to generate question-answer pairs on a knowledge dataset using LLMs. This notebook uses the dataset of PDF files that contain NVIDIA blogs.

Step 1: Load the PDF Data
LangChain library provides document loader functionalities that handle several data format (HTML, PDF, code) from different sources and locations (private S3 buckets, public websites, etc).
LangChain Document loaders provide a load method and output a piece of text (page_content) and associated metadata. Learn more about LangChain document loaders from their documentation.
This notebook uses a LangChain UnstructuredFileLoader instance to load a PDF of NVIDIA blog post.
Step 2: Transform the Data
The goal of this step is tp break large documents into smaller chunks.
LangChain library provides a variety of document transformers, such as text splitters.
This example uses the generic RecursiveCharacterTextSplitter with the chunk size set to 3K and overlap set to 100.
Let's check the number of chunks of the document.
Let's check the first chunk of the document.
Step 3: Generate Question-Answer Pairs
Instruction prompt:
Given the previous paragraph, create one very good question answer pair.
Your output should be in a json format of individual question answer pairs.
Restrict the question to the context information provided.
The NVIDIA API Catalog on NGC enables developers to experience state-of-the-art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT and Triton Inference Server. Developers get free credits for 10K requests to any of the models. Sign up by going to https://build.ngc.nvidia.com/explore/discover?signin=true.
After you sign in, go to the Llama 3 70B Instruct page. Click Get API Key and save the generated API key.
This notebook uses the LLM to generate the question-answer pairs.
Use the LangChain connector to generate the question-answer pair from the previous context prompt, document chunk and instruction prompt. Populate your API key in the following cell.
End-to-End Synthetic Data Generation
We have run the above steps and on 600 pdfs of NVIDIA blogs dataset and saved the data in json format below. Where gt_context is the ground truth context and gt_answer is ground truth answer.
{
'gt_context': chunk,
'document': filename,
'question': "xxxx",
'gt_answer': "xxxx"
}
Synthetic Data Post-processing
So far, the generated JSON file structure embeds gt_context, document, the question and gt_answer pair.
To evaluate retrieval augmented generation (RAG) systems, we need to add the RAG results fields (populated in the next notebook):
contexts: Retrieved documents by the retrieveranswer: Generated answer
The new dataset JSON format should be:
{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts':
'answer':
}