BizNorm 100 Evaluator Optimization
Evaluator Prompt Optimization
In this notebook we'll be using the Prompt Learning SDK to optimize an LLM-as-Judge Eval Prompt. LLM-as-Judge evaluators use an LLM to evaluate LLM outputs, and are effective and versatile in testing/evaluating your LLM applications. You can learn more here.
Since your evals use LLMs, the prompts you provide to those LLMs dictate what your eval does. In practice, the goal is to ALIGN your eval with your goals. You want to bring your eval to a level of competence that you would expect from a human who manually evaluates outputs.
This notebook shows you how to build an evaluator that checks if outputs are normalized/sanitized, and then align the evaluator with your expectations for normalization/sanitization so you can trust this eval in production by optimizing its prompt.
BizNorm-100 Benchmark
BizNorm-100 is a synthetically created dataset containing 100 queries. The goal is to normalize these queries with respect to certain ruleset.
For example, the query
My card 3333-4444-5555-6666 was charged $1200 on 1/12/2025. The record still shows my old phone, 646-555-2201, and the system emailed the receipt to anthony.rogers@company.org. Can you fix this ASAP?
should be normalized to
[PII ALERT] My card [CARD] was charged usd 1200.00 on 2025-01-12. The record still shows my old phone, [PHONE], and the system emailed the receipt to [EMAIL]. Can you fix this as soon as possible? -- Company Confidential --
See the normalization ruleset in BizNorm-ruleset.md.
Train/Test Split
We will be using the training set to train our evaluator with Prompt Learning. We will be using the test set to test our evaluator's accuracy on data it has not been trained on.
Application System Prompt
This is the application system prompt, or the prompt to the LLM used to generate outputs.
This is NOT the prompt we are optimizing! This simply generates outputs.
We are optimizing the evaluator prompt, or the prompt for the LLM-as-judge eval which EVALUATES the generated outputs.
Output Generator
Uses the application system prompt to generate outputs.
Sanitization Helpers
clean and clean_series are used to normalize text before comparing generated outputs with ground truths. This prevents false mismatches caused by superficial formatting differences.
For example, the string:
"today's year is 1/1/2025"
might be normalized to:
"today's year is 2025-01-01"
If we compare it against a ground truth like:
"today’s year is 2025-01-01"
a raw string comparison would incorrectly flag them as different because of the straight vs. curly apostrophe. Normalization ensures both strings are treated as equivalent, so the comparison is judged correctly.
Accuracy Computation
Computes accuracy, f1, precision, recall.
Evaluator
This is the code for our LLM-as-Judge evaluator.
It checks whether outputs are normalized properly.
You can see the prompt below. THIS IS THE PROMPT WE ARE OPTIMIZING.
We want to build evals that align with how we expect them to perform. Good evals are very important. They allow you to filter and classify the information you feed to your users. Because LLM outputs are not deterministic, you need something to check those outputs. It's too time consuminng to do this manually, so employing an LLM to evaluate these outputs is a common and essential practice.
Generate Output and Evaluate
This combines our output generator and our evaluator into one function, and also computes accuracies for our outputs and also our evaluator.
Evaluator accuracy is computed by comparing what the eval thinks ("correct" or "incorrect") versus whether the output is equal to the ground truth or not (actual "correct" or "incorrect").
Run this below cell!
Helper Function - calling the Prompt Learning SDK
You can see the optimize_iteration helper function here actually initializes the optimizer with feedback and produces a new, optimized prompt.
The next step is figuring out what feedback to provide to the optimizer in order for it to generate optimized prompts.
🔄 Optimization Workflow using Prompt Learning
This notebook implements an interactive optimization loop where we:
- Collect Feedback — Display examples from the dataset, and let the user label correctness/explanations.
- Optimize Prompt — Use the feedback to generate an updated evaluator prompt.
- Review & Confirm — Show the optimized prompt, allow manual edits for formatting, and confirm it.
- Evaluate — Re-run the evaluator with the new prompt on train/test sets, log metrics, and save results.
- Loop — Repeat the cycle for
Nrounds, carrying forward the updated evaluator prompt and re-evaluated outputs.
The feedback we provide to the Prompt Learning optimizer is HUMAN ANNOTATED FEEDBACK. We show the power of just needing to annotate 5 examples per loop, and seeing optimization boosts! This shows the data efficiency of Prompt Learning. Rather than RL or an programmatic optimizer, where you need lots of data to make effective accuracy boosts, just hand annotating 5 outputs and giving that feedback to Prompt Learning allows for huge boosts in accuracy.
The workflow is composed of modular helper functions:
collect_feedback_ui: interactive widget interface for gathering manual feedback.review_and_confirm_prompt: UI for reviewing and editing the optimized prompt before saving.run_one_round: runs a single loop round (feedback → optimize → confirm → evaluate).interactive_optimization_loop: orchestrates the full multi-round optimization process.
📝 collect_feedback_ui
This function creates an interactive feedback form using ipywidgets:
- Displays a sample of
query,ground_truth,output, and evaluator outputs. - Provides dropdowns / textareas for feedback fields (
evaluator_correctness,evaluator_explanation). - Saves the annotated feedback set (
feedback_set) to CSV. - Calls
on_save(feedback_set)after the user clicks Save Feedback, triggering the next step in the workflow.
🔍 review_and_confirm_prompt
This function displays the auto-optimized evaluator prompt:
- Shows the generated prompt in a styled block.
- Provides a large text area for manual edits (to fix formatting, braces, JSON requirements, etc.).
- Only after the user clicks Confirm Prompt does it call
on_confirm(edited_prompt). - Ensures the downstream evaluation always uses a user-validated prompt.
🔁 run_one_round
Runs a single optimization cycle:
- Samples a batch of examples from the dataset for feedback.
- Calls
collect_feedback_uito gather manual corrections. - Optimizes the evaluator prompt using
optimize_iteration. - Calls
review_and_confirm_promptto display and edit the new prompt. - After confirmation:
- Saves the new prompt to file.
- Re-evaluates train/test with the updated prompt.
- Logs metrics and appends results.
- Starts the next round (if any).
🚀 interactive_optimization_loop
The master orchestrator of the workflow:
- Computes baseline evaluator performance with the initial prompt.
- Logs round 0 metrics.
- Iteratively calls
run_one_roundfor the specified number of loops. - Maintains a record of:
- All prompts across rounds (
results["prompts"]) - Evaluation metrics (
results["metrics"])
- All prompts across rounds (
- Saves metrics history to
all_metrics.csv. - Stops after
Nconfirmed rounds of optimization.