Notebooks
A
Arize AI
Pydantic Evals

Pydantic Evals

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Evaluation using Pydantic Evals

Pydantic offers an evaluation library that can be used to run preset direct evaluations, such as whether an output matches a Pydantic model, as well as LLM Judge evaluations. These evals can be run directly over dataframes of cases defined with Pydantic. However, you may want to run evaluations over real traces as opposed to presaved cases.

This notebook shows you how you can use Pydantic Evals alongside Arize Phoenix to run evals on traces captured from your running application.

Note: Phoenix does include its own evals package, however it is designed to work with other eval packages like Pydantic Evals as well.

Note: This notebook was last updated on Oct 7, 2025.

Install dependencies

[ ]

Setup API keys and imports

[ ]

Enable Phoenix Tracing

Sign up for a free instance of Phoenix Cloud to get your API key. If you'd prefer, you can instead self-host Phoenix.

[ ]
[ ]

Create Example Traces to Evaluate

Next, we'll run some example inputs through an LLM call to generate traces that we can evaluate. In practice, you'd likely already have an application you're tracing that you'd want to evaluate instead.

[ ]

You should now see three traces captured in your Phoenix instance. If you don't see them right away, make sure you've selected the pydantic-evals-tutorial project.

Export Traces from Phoenix

Next, you export those traces from Phoenix so that you can evaluate them using Pydantic Evals.

[ ]

Define the Evaluation Dataset

Create a dataset of test cases using Pydantic Evals for a question-answering task.

  1. Each Case represents a single test with an input (question) and an expected output (answer).
  2. The Dataset aggregates these cases for evaluation.
[ ]

Setup LLM task, Evaluator, and Dataset for Pydantic

Pydantic Evals requires a task to run each case through. Since you've already run this task for a given input (represented by the traces you captured above), this case will simply be retrieving the corresponding output from your dataframe of exported traces.

[ ]

Then create a basic evaluator that checks whether the output matches the expected value exactly.

[ ]
[ ]

Run your experiment and evaluation

Now with everything connected up, you can run your evaluation using Pydantic:

[ ]

Redefine Eval to be LLM-powered or Semantic

That evaluation works fine, however the exact match is a bit too strict to work in a real world setting. Try adding two other kinds of evaluators, a fuzzy match eval and an LLM judge eval.

[ ]
[ ]
[ ]

You should now see that the LLM Judge at least catches that "Shakespeare" and "William Shakespeare" represent the same answer.

Upload Labels to Phoenix

As a final step, you can now upload your eval results to Phoenix to capture them in the UI.

[ ]
[ ]
[ ]
[ ]

results_in_phoenix

For more on LLM Evaluation, check out our Arize Master Guide to LLM Evaluation!