Synthetic Data Generation Nemo
Retriever Customization - Synthetic Data Generation (Part 1/2)
Authors - Aditya Malte, Dora Li, Vinay Raman
Introduction
Text retrievers and embedding models play a crucial role in modern information retrieval systems by converting both queries and documents into dense numerical vectors (embeddings) that capture their semantic meaning. This allows the system to find relevant documents by measuring the similarity between a query's embedding and document embeddings in the database.
The accuracy of these models directly impacts their usefulness. When a retriever has been trained primarily on one type of content (like general web text or news articles) but is asked to retrieve documents from a specialized domain (such as medical literature), its performance can degrade significantly.
This is why many organizations fine-tune domain-specific retrievers for their particular use cases, ensuring more accurate and relevant document retrieval. As with all fine-tuning, high-quality domain-specific data is required and can be generated with LLMs such as NVIDIA's Nemotron-4-340B-Instruct that are specially trained and licensed for synthetic data generation. Other models like Llama-3.1-405B or Mixtral-8x22B-Instruct can also produce good results.
Overview
This two-part tutorial demonstrates how to improve retrieval performance by fine-tuning embedding models using synthetic training data. The process is split across two notebooks:
-
synthetic_data_generation_nemo.ipynb(this notebook):- Use an LLM from build.nvidia.com (or deploy your own using NIM!) to create training examples containing generated queries and positive chunks. By default the notebook will use nfcorpus, but you can easily swap in your own data.
- Save results to a
.csvfile
-
retriever_customization.ipynb:- Implement hard negative mining to find challenging negative examples
- Use the generated training data in the
.csvfile to fine-tune a retriever model using Nemo Framework - Evaluate the results of your fine-tuned embedding model against the original using BeIR Benchmark
NOTE: This tutorial is only meant as a demo, and hence only a small subset of the corpus is used for training data generation - in order for the notebook run to complete in a reasonable time. A GPU is required to run notebook 2, but not notebook 1 if an LLM endpoint is used.
Setup Instructions
NeMo Framework Docker Container
This notebook runs in a Docker environment built from the NeMo FW repo. Refer https://github.com/NVIDIA/NeMo/tree/main for instructions on how to build and run the docker containers. Ensure that the docker container you run this notebook in is built from the main branch of the NeMo repository. The current notebooks were tested on Nemo Framework 24.07 on a single-GPU machine (L40s).
Run docker when inside the synthetic-data-retriever-customization directory using this command:
docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.07
NVIDIA AI Endpoints
You'll need access to an LLM for generating queries. By default, this notebook uses the Nemotron-4-340b-Instruct API endpoint from www.build.nvidia.com.
An API Key is required. Get your API Key by following the link above to the model and clicking on "Build with this NIM". All new users will get a number of tokens upon registering. Set the environment variable NVIDIA_API_KEY with your API key value.
Optionally, you can self-host either model using NIM (NVIDIA Inference Microservice) and pass in the local url when creating your LLM client later on. Follow the instructions in the link. Note that system GPU requirements will depend on the model you choose to deploy.
Import Libraries
Specify the directory where the final .csv with generated QA pairs will be saved
Download Nfcorpus Dataset
Just as an example, we have chosen the nfcorpus public text dataset to generate the synthetic data from. But you can choose any other existing dataset or ideally provide your own proprietary documents to generate data from.
Synthetic Data Generation from Knowledge Base
In this section we will:
- Break each text sample in the nfcorpus dataset that we downloaded into smaller chunks.
- Compose an LLM prompt that provides detailed instructions on how to generate queries based on each chunk.
- Send the queries to our LLM as an asynchronous batch job.
- Parse the queries and populate our synthetic dataset with query + positive chunks.
1. Chunk Knowledge Base
Chunking is required to break large documents into smaller chunks that an LLM can take as input. In this case we chunk the texts into samples of around word count 300, ensuring that sentences are not broken.
Notes:
-
We are only sampling 100 out of around 5000 documents in the corpus, in order to allow the notebook to complete in a reasonable time for this tutorial. Feel free to increase it, especially if you are running this with your own data.
-
Most of the nfcorpus documents are already very short passages so they will only contain a single chunk.
2. Prompt Generation
A prompt serves the purpose of providing context to the LLM for generation. You should modify this prompt as appropriate for your specific domain. Prompt engineering is incredibly important and can greatly impact the quality of the generated quries.
The default prompt in this example is from the NVIDIA documentation/help page. It provides detailed instructions and provides examples of the types of queries the model should generate. In this prompt we ask the LLM to generate three unique queries for each chunk.
With NVIDIA AI Endpoints, you can request the queries to be returned as JSON by specifying a json schema as follows.
3. Synthetic Data Generation
Now we'll use Nemotron-4-340B-Instruct from NVIDIA AI Endpoints (www.build.nvidia.com) to generate synthetic data. Make sure you have a valid API key stored as the environment variable NVIDIA_API_KEY, or you can generate one following the link earlier.
The NVIDIA AI endpoint follows the same schemas as the OpenAI API standard, so we'll go ahead and use the AsyncOpenAI() client in order to asynchronously send many requests to the server.
The output should look like this:
{
"queries": [
"What is the total number of reported cases of cardiac tamponade resulting from acupuncture, as identified in the systematic review?",
"How many of the reported cases of cardiac tamponade caused by acupuncture had fatal outcomes, according to the literature review?",
"What measure does the systematic review suggest to reduce the risk of cardiac tamponade in acupuncture practice?"
]
}
4. Parsing Generations
We'll do some simple text parsing to extract the generated queries, then store them as individual entries in the dataset.
Example output:
['What is the range of BMAA concentrations found in cyanobacterial blooms in South Florida?', 'Which neurodegenerative diseases have been linked to BMAA exposure?', 'What is the highest BMAA concentration found in resident animals used as human food in South Florida?']
Save QA Pair Data
Congratulations, you've now successfully generated a synthetic dataset for Fine-Tuning a text embedding model! In the next notebook you'll use the .csv file you've just generated to fine-tune NV-EmbedQA-V4 using NeMo Framework.