Notebooks
A
Arize AI
Building A Custom Evaluator

Building A Custom Evaluator

arize-tutorialsevaluationLLMPython

Using a Benchmark Dataset to Build a Custom LLM as a Judge Evaluator

In this tutorial, you’ll learn how to build a custom LLM-as-a-Judge Evaluator tailored to your specific use case. While Arize provides several pre-built evaluators that have been tested against benchmark datasets, these may not always cover the nuances of your application.

So how can you achieve the same level of rigor when your use case falls outside the scope of standard evaluators?

We’ll walk through how to create your own benchmark dataset using a small set of annotated examples. This dataset will allow you to build and refine a custom evaluator by revealing failure cases and guiding iteration. The use case we will be exploring is data extraction from an image of a receipt.

To follow along, you’ll need:

  • A free Arize AX account
  • An OpenAI API Key

Set up Keys and Dependencies

[ ]
[ ]
[ ]

Configure Tracing

[ ]

Generate Image Classification Traces

[ ]

In this tutorial, we’ll ask an LLM to generate expense reports from receipt images provided as public URLs. Running the cells below will generate traces, which you can explore directly in Phoenix for annotation. We’ll use GPT-4, which supports image inputs.

Dataset Information: Jakob (2024). Receipt or Invoice Dataset. Roboflow Universe. CC BY 4.0. Available at: https://universe.roboflow.com/jakob-awn1e/receipt-or-invoice (accessed on 2025‑07‑29)

[ ]
[ ]
[ ]

Create Benchmarked Dataset

After generating traces, open Arize to begin annotating your dataset. In this example, we’ll annotate based on "accuracy", but you can choose any evaluation criterion that fits your use case. Just be sure to update the query below to match the annotation key you’re using—this ensures the annotated examples are included in your benchmark dataset.

Run the cell below to see annotations in action:

[ ]
[ ]
[ ]
[ ]
[ ]

Dataset

Create evaluation template

Next, we’ll create a baseline evaluation template and define both the task and the evaluation function. Once these are set up, we’ll run an experiment to compare the evaluator’s performance against our ground truth annotations.

[ ]
[ ]
[ ]

You will see your experiment result in the experiments tab of your dataset:

Initial Experiment

Iteration 1 to improve evaluator prompt template

Next, we’ll refine our evaluation prompt template by adding more specific instructions to classification rules. We can add these rules based on gaps we saw in the previous iteration. This additional guidance helps improve accuracy and ensures the evaluator's judgments better align with human expectations.

[ ]
[ ]

Iteration 2 to improve evaluator prompt template

To further improve our evaluator, we’ll introduce few-shot examples into the evaluation prompt. These examples help highlight common failure cases and guide the evaluator toward more consistent and generalized judgments.

[ ]
[ ]

Final Results

Once your evaluator reaches a performance level you're satisfied with, it's ready for use. The target score will depend on your benchmark dataset and specific use case. That said, you can continue applying the techniques from this tutorial to refine and iterate until the evaluator meets your desired level of quality.

You can also compare your experiment outcomes to baseline results or previous versions to evaluate progress.

Final Results