Embedding Finetuning
Embedding Fine-tuning with NeMo Microservices
Fine-tune an embedding model and improve retrieval by 6-10% in ~1 hour.
Prerequisites
Hardware: 2 NVIDIA GPUs. See Developer Setup Requirements for details.
Setup:
-
Deploy NeMo Microservices 25.8.0+: Follow the Minikube setup guide.
-
Register base model (~2-3 minutes):
helm upgrade nemo nmp/nemo-microservices-helm-chart --namespace default --reuse-values \ --set customizer.customizationTargets.overrideExistingTargets=false \ --set 'customizer.customizationTargets.targets.nvidia/llama-3\.2-nv-embedqa-1b@v2.enabled=true' && \ kubectl delete pod -n default -l app.kubernetes.io/name=nemo-customizer && \ kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=nemo-customizer -n default --timeout=5m -
HuggingFace token: https://huggingface.co/settings/tokens (read access).
-
Service URLs: Run
cat /etc/hoststo verify hostnames (http://nemo.test,http://data-store.test,http://nim.test).
Overview
Use case: Adapt a general embedding model to find related scientific papers.
Fine-tuning an embedding model on your domain data improves retrieval accuracy. In a Retrieval-Augmented Generation (RAG) pipeline, this means the LLM receives more relevant context, producing better answers. For search applications, users find what they need more often.
This notebook walks through the complete workflow: fine-tune a base embedding model on scientific paper data, deploy it as a production NVIDIA Inference Microservice (NIM), and measure the improvement.
Objectives
By the end of this notebook, you will:
- Test the baseline model on a retrieval task
- Fine-tune
nvidia/llama-3.2-nv-embedqa-1b-v2on 65K scientific paper triplets from SPECTER dataset - Deploy the fine-tuned model as a production-ready NIM inference service
- Compare before/after retrieval rankings on your original task
- Measure aggregate improvement on SciDocs benchmark: Recall@5 improves from 0.159 to ~0.17 (+6-10%)
Recall@5 measures the fraction of relevant documents that appear in the top 5 search results.
About the baseline: The 0.159 baseline was measured by running the same SciDocs evaluation on the pretrained model. In Step 6, you can set EVALUATE_BASELINE = True to run this evaluation yourself as long as you ensure the base model from Step 0 is deployed.
Note: Time estimates are approximate and depend on cluster configuration and GPU type(s).
Step 0: Identify the Opportunity
Let's start with a real-world scenario: searching scientific papers by meaning, not keywords.
We'll deploy the base embedding model, run a test query, and see where it struggles. Then we'll fine-tune on scientific paper data and measure the improvement.
Step 1: Prepare Data
Download 10% of the SPECTER dataset containing ~684K scientific paper triplets (query, positive, negative) and format for embedding fine-tuning.
Dataset format: Each triplet teaches the model via contrastive learning to maximize similarity between query and positive document while minimizing similarity between query and negative document.
Step 2: Upload to NeMo Data Store
NeMo Data Store holds datasets for training and evaluation. It exposes a HuggingFace-compatible API, so you can use familiar huggingface_hub methods - just pointed at a different endpoint.
Step 3: Train Model
Fine-tune using supervised contrastive learning (model learns to pull query-positive pairs closer while pushing query-negative pairs apart).
Config vs Job: A config defines the training template (base model, GPU settings). A job runs training with that config + dataset + hyperparameters.
Step 4: Deploy Model
NeMo Deployment serves your fine-tuned model as a NIM (NVIDIA Inference Microservice). Once deployed, you can query it via the standard OpenAI-compatible embeddings API.
Health Check
Verify the deployed model responds to requests.
Step 5: See the Improvement
Now let's run the same query against your fine-tuned model and compare to the baseline we saw earlier.
Step 6: Evaluate Performance
NeMo Evaluator runs standardized benchmarks against your deployed model. Here we use SciDocs, a retrieval benchmark for scientific papers.
Step 7: Results
Compare your fine-tuned model against the pretrained baseline.
Summary
You fine-tuned NVIDIA's llama-3.2-nv-embedqa-1b-v2 embedding model on 65K scientific paper triplets from SPECTER - a dataset where papers that cite each other are marked as "related."
The base model matched documents by keyword overlap. After fine-tuning, it learned scientific paper neighborhoods: which papers actually cite each other, regardless of surface-level word matches. The demo showed this - "Random Forests" dropped in ranking because it's unrelated to "Conditional Random Fields," despite sharing the word "random."
SciDocs tests retrieval across thousands of scientific queries. Recall@5 asks: "Of all relevant papers, how many appear in the top 5 results?" Your model improved from 0.159 to ~0.17, meaning 6-10% more relevant papers now surface in the top 5.
In a RAG pipeline, better retrieval means better context for the LLM and more accurate answers. Your model is deployed and ready to use.
Next Steps
Scale Up:
- Train on full SPECTER dataset for additional improvement
- Increase to 3 epochs for better convergence
Apply to Your Domain:
- Format your data as query-positive-negative triplets
- Replace SPECTER dataset with your domain data (legal, medical, product catalogs, etc.)
- Evaluate on your own retrieval tasks
Learn More:
Cleanup
Uncomment cleanup cells as needed to delete resources.