Notebooks
A
Arize AI
Ragas Agents Cookboook

Ragas Agents Cookboook

agentsarize-tutorialsLLMPython

Ragas

Ragas | Arize | Slack Community

Tracing and Evaluating AI agents

This guide will walk you through the process of creating and evaluating agents using Ragas and Arize. We'll cover the following steps:

  • Build a customer support agent with the OpenAI Agents SDK

  • Trace agent activity to monitor interactions

  • Generate a benchmark dataset for performance analysis

  • Evaluate agent performance using Ragas

Initial setup

We'll setup our libraries, keys, and OpenAI tracing using Phoenix.

Install Libraries

[ ]

Setup Keys

Next you need to connect to Arize and enter the relevant keys.

[ ]

Setup Tracing

[ ]

Create your first agent with the OpenAI SDK

Here we've setup a basic agent that can solve math problems.

We have a function tool that can solve math equations, and an agent that can use this tool.

We'll use the Runner class to run the agent and get the final output.

[ ]
[ ]
[ ]

Now we have a basic agent, let's evaluate whether the agent responded correctly.

Evaluating our agent

Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:

  1. Tool call accuracy - did our agent choose the right tool with the right arguments?
  2. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?

Let's setup our evaluation by defining our task function, our evaluator, and our dataset.

[ ]

This is helper code which converts the agent messages into a format that Ragas can use.

[ ]
[ ]

Now let's setup our evaluator. We'll import both metrics we're measuring from Ragas, and use the multi_turn_ascore(sample) to get the results.

The AgentGoalAccuracyWithReference metric compares the final output to the reference to see if the goal was accomplished.

The ToolCallAccuracy metric compares the tool call to the reference tool call to see if the tool call was made correctly.

[ ]

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 10 questions we can use to test our math problem solving agent.

[ ]
[ ]

Now let's use this dataset and run it with the agent.

[ ]

Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

[ ]

Finally, we run our experiment and view the results.

[ ]

Results