Notebooks
A
Arize AI
Openai Agents Cookbook

Openai Agents Cookbook

agentsarize-tutorialsLLMPython

arize logo
Docs | GitHub | Community

Tracing and Evaluating OpenAI Agents

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

  • Create an agent using the OpenAI agents SDK

  • Trace the agent activity

  • Create a dataset to benchmark performance

  • Run an experiment to evaluate agent performance using LLM as a judge

Initial setup

Install Libraries

[ ]

Setup Keys

Copy the Arize API_KEY and SPACE_ID from your Space Settings page (shown below) to the variables in the cell below.

[ ]

Setup Tracing

[ ]

Create your first agent with the OpenAI SDK

Here we've setup a basic agent that can solve math problems.

We have a function tool that can solve math equations, and an agent that can use this tool.

We'll use the Runner class to run the agent and get the final output.

[ ]
[ ]
[ ]

Now we have a basic agent, let's evaluate whether the agent responded correctly!

Evaluating our agent

Agents can go awry for a variety of reasons.

  1. Tool call accuracy - did our agent choose the right tool with the right arguments?
  2. Tool call results - did the tool respond with the right results?
  3. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?

We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals here.

Let's setup our evaluation by defining our task function, our evaluator, and our dataset.

[ ]

Let's create our evaluator.

[ ]

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.

[ ]
[ ]

Now let's use this dataset and run it with the agent!

Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

[ ]
[ ]
[ ]