Datasets Experiments Quickstart Python
This tutorial demonstrates how to use AX Datasts & Experiments to systematically evaluate and improve AI agents. You'll learn how to create datasets, define task functions that run your agent on each example, and use both code-based and LLM-as-a-Judge evaluators to measure performance. By the end, you'll be able to run experiments that compare different agent versions and track improvements over time, enabling data-driven development and deployment decisions.
The notebook covers four main sections. Follow the documention for the complete tutorial.
- Define Agent: Set up a customer support agent with tools for ticket classification and policy retrieval, using the agno framework labels, then upload it to Phoenix
- Create a Dataset: Build a dataset of support ticket queries with ground truth labels, then upload it to Phoenix
- Define an Experiment: Create task functions and evaluators (code-based and LLM judges), then run experiments to measure agent performance and compare different versions
- Iterations with Experiments: Compare different agent versions using experiments to validate improvements before deployment
Define Support Agent
This agent is a customer support assistant that helps users resolve their issues by classifying tickets and retrieving relevant policies. The agent has two tools: classify_ticket, which categorizes support tickets into billing, technical, account, or other categories, and retrieve_policy, which fetches the appropriate internal support policy based on the ticket category.
Section 1: Create a Dataset
Section 2: Define an Experiment
Run an Experiment to Check Tool Call Accuracy (Code-Based Evaluator)
This is our tool function from above:
Since our "baseline" examples have a ground truth field, we can used a code based evaluator to check if the task output matches what we expect.