Pydantic Evals
Evaluation using Pydantic Evals
Pydantic offers an evaluation library that can be used to run preset direct evaluations, such as whether an output matches a Pydantic model, as well as LLM Judge evaluations. These evals can be run directly over dataframes of cases defined with Pydantic. However, you may want to run evaluations over real traces as opposed to presaved cases.
This notebook shows you how you can use Pydantic Evals alongside Arize Phoenix to run evals on traces captured from your running application.
Note: Phoenix does include its own evals package, however it is designed to work with other eval packages like Pydantic Evals as well.
Note: This notebook was last updated on Oct 7, 2025.
Install dependencies
Setup API keys and imports
Enable Phoenix Tracing
Sign up for a free instance of Phoenix Cloud to get your API key. If you'd prefer, you can instead self-host Phoenix.
Create Example Traces to Evaluate
Next, we'll run some example inputs through an LLM call to generate traces that we can evaluate. In practice, you'd likely already have an application you're tracing that you'd want to evaluate instead.
You should now see three traces captured in your Phoenix instance. If you don't see them right away, make sure you've selected the pydantic-evals-tutorial project.
Export Traces from Phoenix
Next, you export those traces from Phoenix so that you can evaluate them using Pydantic Evals.
Define the Evaluation Dataset
Create a dataset of test cases using Pydantic Evals for a question-answering task.
- Each Case represents a single test with an input (question) and an expected output (answer).
- The Dataset aggregates these cases for evaluation.
Setup LLM task, Evaluator, and Dataset for Pydantic
Pydantic Evals requires a task to run each case through. Since you've already run this task for a given input (represented by the traces you captured above), this case will simply be retrieving the corresponding output from your dataframe of exported traces.
Then create a basic evaluator that checks whether the output matches the expected value exactly.
Run your experiment and evaluation
Now with everything connected up, you can run your evaluation using Pydantic:
Redefine Eval to be LLM-powered or Semantic
That evaluation works fine, however the exact match is a bit too strict to work in a real world setting. Try adding two other kinds of evaluators, a fuzzy match eval and an LLM judge eval.
You should now see that the LLM Judge at least catches that "Shakespeare" and "William Shakespeare" represent the same answer.
Upload Labels to Phoenix
As a final step, you can now upload your eval results to Phoenix to capture them in the UI.
