Notebooks
N
NVIDIA
Retriever Customization

Retriever Customization

gpu-accelerationsynthetic-data-retriever-customizationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-servercommunityLLMragnemo

Retriever Customization - Fine-Tuning & Evaluation (2/2)

Authors - Aditya Malte, Vinay Raman, Ali Taghibakhshi, Dora Li

Overview

This is part two of a two-part series.

  1. synthetic_data_generation_nemo.ipynb:

    • Use an LLM from build.nvidia.com (or deploy your own using NIM!) to create training examples containing generated queries and positive chunks. By default the notebook will use nfcorpus, but you can easily swap in your own data.
    • Save results to a .jsonl file
  2. retriever_customization.ipynb (this notebook):

    • Implement hard negative mining to find challenging negative examples
    • Use the generated training data in the .jsonl file to fine-tune a retriever model using Nemo Framework
    • Evaluate the results of your fine-tuned embedding model against the original using BeIR Benchmark

A GPU is required to run this notebook.

Setup Instructions

NeMo Framework Docker container

This notebook requires the NeMo Framework Docker container. Download the appropriate Docker image and build the container when inside the synthetic-data-retriever-customization directory using this command:

docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.07

This notebook was tested on a setup comprising of 1xL40S GPUs with CUDA setup.

NVIDIA AI Endpoints

As in Notebook 1, you'll use another API endpoint from www.build.nvidia.com in Notebook 2, this time for generating embeddings with the text embedding model NV-EmbedQA-E5-V5. You can reuse the same API Key as before, or generate a new one by clicking the link to the model.

Download NV-Embed-QA-4 model weights from NGC

Use the command ngc registry model download-version "ohlfw0olaadg/ea-participants/nv-embed-qa:4" to download the NeMo Retriever model. It must be downloaded to the directory files/models. The same model - NeMo Retriever - has been used as an example in this notebook. If you do not have NVAIE access, then you may download and convert a HF embedding like intfloat/e5-large-unsupervised for your purpose as follows:

/NeMo/scripts/nlp_language_modeling/convert_bert_hf_to_nemo.py \
       --input_name_or_path "intfloat/e5-large-unsupervised" \
       --output_path /workspace/files/models/my_model.nemo

For the purpose of this notebook, we have used the NeMo Retriever model. If you use another model, or convert an HF model, ensure that the model path is updated accordingly

[ ]

Import libraries and set configuration

[ ]
[ ]

Parameters for Fine-Tuning

[ ]
[ ]
[ ]

Mining Hard Negatives

Hard negative mining refers to the creation of negative examples that are 'hard'. Essentially, what this means is that rather than performing random sampling - which would lead to easy negatives - we mine for harder negative examples.

This has an advantage that the negatives would not be obvious to the model during training, and hence would actually be more helpful.

However, hard negative mining has a higher probability of generating false negatives. To avoid this, we set a safety margin. This margin is a hyperparameter and you may change it depending on if more false negatives are being generated. For instance, a larger corpus has a higher probability of generating false negatives than a smaller one, as the probability of finding another positive increases. In such cases a lower margin value may be more helpful.

NV-EmbedQA-E5-V4

To do hard negative mining, we'll need to create embeddings for all of our text chunks using the NV-EmbedQA-E5-V5 model from www.build.nvidia.com. You can reuse the same NVIDIA_API_KEY as before.

Since the NV-EmbedQA-E5-V5 model is quite small, you can also easily host it as self-deployed NIM Docker container following the instructions here. If you already have the model weights for .nemo format embedding model downloaded in preparation for fine-tuning, you can also restore the model using NeMo Framework. To do that, simply copy the encode_text() function from the evaluation section of this notebook and use it here.

BeIR

BeIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark source. First we'll do some basic processing so that our synthetic dataset matches the BeIR format.

[ ]
[ ]

Generate Embeddings for all Queries and Positive Passages

[ ]
[ ]
[ ]

NV-EmbedQA-V4 uses the keys "query" and "passage" but this may differ between models. Ensure you are using the correct keys for your model, otherwise you'll hit an error during fine-tuning.

Find Hard Negatives Using Similarity Score

[ ]
[ ]

Use similarity score with the margin variable to generate hard negatives. For this example we generate 5 hard negatives, but you can change this number. Ultimately the data will be stored in the following format:

[
    {
        "query": "Query",
        "pos_doc": ["Positive"],
        "neg_doc": ["Negative_1", "Negative_2", ..., "Negative_n"]
    },
    {
        // Next data instance
    },
    ...,
    {
        // Subsequent data instance
    }
]
[ ]
[ ]
[ ]

Training

Run the megatron_bert_embedding_finetuning.py script. This script sets up and trains a Megatron-BERT model using NVIDIA NeMo Framework, with configurations managed by Hydra. It loads the pre-trained .nemo model from a checkpoint, adjusts settings like batch size, and sets up parallel processing for multi-GPU training. Finally, it initializes the trainer and starts the training process with the NeMo Framework Megatron Trainer.

Note model.global_batch_size = model.micro_batch_size * trainer.devices (aka # of GPUs). Please keep micro_batch_size=4 and set the other parameters accordingly.

model.data.hard_negatives_to_train should be set to the number of neg_docs corresponding to each query in your synthetic dataset.

[ ]
[ ]

If your training completed, you should see a megatron_bert.nemo in your SAVE_DIR directory.

If training failed due to memmap-related errors, delete any output_data.jsonl.idx* (index) files that have been generated in the OUTPUT_DATA_PATH directory where output_data.jsonl is located. To save memory, NeMo Framework doesn't rebuild index files if they already exist. So if you've changed any parameters related to the data or changed the data itself, this will cause errors.

Model Evaluation

For this tutorial, we'll use the scifact dataset from BeIR to compare the retrieval accuracy between the original model and the fine-tuned model. For a true apples to apples comparison, you should create your own domain-specific evaluation dataset that matches the domain of the synthetic fine-tuning dataset. This evaluation dataset should comprise of corpus, queries, and qrel (query relevance) scores.

We will use NeMo Framework to restore both the original and fine-tuned models from their respective checkpoints and BeIR libraries to easily evaluate the retrieval accuracy.

Finally we'll evaluate the model with NDCG@k, MAP@K, Recall@K and Precision@K scores. These metrics assess different aspects of retrieval performance, where NDCG and MAP focus on the quality of rankings, with higher values indicating better-ranked relevant documents.Recall measures how many relevant documents are retrieved at different ranks, improving as k increases. Precision evaluates the accuracy of the top k documents, with higher precision indicating more relevant results at the top.

[ ]

Create a wrapper NeMo model for retrieval evaluation on this dataset

[ ]

Evaluate the Fine-tuned model:

NOTE: there may be a bug in Nemo 24.07 where certain global variables are set by default and must match the passed in config variables. One example is global_batch_size=8. So even though we set global_batch_size=4 during fine-tuning, we need to manually override it here to successfully restore the model. This does not impact the model performance.

[ ]

The output should look like this:

{'NDCG@1': 0.43808, 'NDCG@3': 0.4094, 'NDCG@5': 0.39159, 'NDCG@10': 0.35777, 'NDCG@100': 0.33154, 'NDCG@1000': 0.41858} {'MAP@1': 0.05692, 'MAP@3': 0.09939, 'MAP@5': 0.11412, 'MAP@10': 0.13414, 'MAP@100': 0.17271, 'MAP@1000': 0.18817} {'Recall@1': 0.05692, 'Recall@3': 0.11421, 'Recall@5': 0.13637, 'Recall@10': 0.17648, 'Recall@100': 0.33741, 'Recall@1000': 0.64782} {'P@1': 0.45511, 'P@3': 0.38803, 'P@5': 0.34365, 'P@10': 0.26656, 'P@100': 0.08508, 'P@1000': 0.02163}

Evaluate the original model:

[ ]

As you can see, there is some improvement in the results on evaluation. Using a larger amount of data for fine-tuning and proprietary, domain-specific data is likely to make the improvement much more significant. From some initial testing with proprietary corporate data, we've seen around 5-10% accuracy improvement. Your results may vary depending on the other configurations set.

Congratulations! You've officially created synthetic data and fine-tuned a text embedding model using NeMo Framework!

[ ]