Live Evaluation
Live Evaluations with NeMo Evaluator
In the following notebook, we'll be walking through an example of how you can leverage Live Evaluatoions through NeMo Evaluator Microservice.
Full documentation is available here!
In our example - we'll be looking at the following scenarios:
- Simple String Checking
- Custom LLM-as-a-Judge on Synthetically Created Medical Summaries
NOTE: Currently, live evaluation is only supported with the
customevaluation type!
Necessary Configurations
You will need to spin up the NeMo Evaluator Microservice through the provided docker-compose,yaml file provided in this directory.
You can do so with the following commands:
- Login to NVIDIA NGC Container Registry:
docker login -u '$oauthtoken' -p YOUR_NGC_KEY_HERE nvcr.io
- Set-up the initial environment variables (make sure you're correctly set-up Docker so that it can be run from your user group)
export EVALUATOR_IMAGE=nvcr.io/nvidia/nemo-microservices/evaluator:25.07
export DATA_STORE_IMAGE=nvcr.io/nvidia/nemo-microservices/datastore:25.07
export USER_ID=$(id -u)
export GROUP_ID=$(id -g)
- Spin up NeMo Evaluator Microservice through
docker compose!
docker compose -f docker_compose.yaml up evaluator -d
Installing Dependencies with uv
Before moving forward in the notebook, please ensure you're using the virtual environment created by running uv sync in root directory of this notebook.
This will install all the necessary dependencies for the remainder of the notebook.
NeMo Microservices Client
Next, let's initialize our NeMo Microservices client through the Python SDK!
NOTE: By default, the NeMo Evaluator API will be available at:
http://localhost:7331.
Using NeMo Evaluator Microservice for Live Simple String Checking
We can kick off an evaluation job for simple string checking right away using the custom evaluation type, with the data subtype!
Let's look at how we've do this with the SDK.
Status: completed
Results: EvaluationResult(job='eval-Akk2TPTzp96YCQjyvaJsMt', id='evaluation_result-2iKms1yr9GNjVWGSJVV7ZP', created_at=datetime.datetime(2025, 8, 16, 0, 53, 21, 87425), custom_fields={}, description=None, files_url=None, groups={}, namespace='default', ownership=None, project=None, tasks={'qa': TaskResult(metrics={'accuracy': MetricResult(scores={'string-check': Score(value=1.0, stats=ScoreStats(count=1, max=None, mean=1.0, min=None, stddev=None, stderr=None, sum=1.0, sum_squared=None, variance=None))})})}, updated_at=datetime.datetime(2025, 8, 16, 0, 53, 21, 89177))
Using NeMo Evaluator Microservice for Live Custom LLM-as-a-Judge
We can also extend this to Custom LLM-as-a-Judge using a dataset that we have in our local environment!
We're going to use the llama-3.3-nemotron-super-49b-v1 as our judge model today .
NOTE: You can find the API key on
build.nvidia.comby clicking the green "Get API Key" button!
To keep things organized, we'll initialize our model object in a separate code cell - but this is going to be provided alongside the rest of our evaluation config when we create it through the SDK!
We'll do the same for our prompt. Notice that we're able to key into the appropriate fields using the {{}} templating.
NOTE: Since we're using regex to parse the output scores - ensure your output format template is well defined.
Finally, like usual, we can create our custom LLM-as-a-Judge config and target below!
NOTE: The Live feature currently requires you to create the config and target at call time - this is to ensure low latency responses.
Status: completed
Results: {'correct': Score(value=2.6, stats=ScoreStats(count=5, max=None, mean=2.6, min=None, stddev=None, stderr=None, sum=13.0, sum_squared=None, variance=None))}
To learn more about the live evaluation feature - please check out this documentation!