Ragas Agents Cookboook
Ragas | Arize | Slack Community
Tracing and Evaluating AI agents
This guide will walk you through the process of creating and evaluating agents using Ragas and Arize. We'll cover the following steps:
-
Build a customer support agent with the OpenAI Agents SDK
-
Trace agent activity to monitor interactions
-
Generate a benchmark dataset for performance analysis
-
Evaluate agent performance using Ragas
Initial setup
We'll setup our libraries, keys, and OpenAI tracing using Phoenix.
Install Libraries
Setup Keys
Next you need to connect to Arize and enter the relevant keys.
Setup Tracing
Create your first agent with the OpenAI SDK
Here we've setup a basic agent that can solve math problems.
We have a function tool that can solve math equations, and an agent that can use this tool.
We'll use the Runner class to run the agent and get the final output.
Now we have a basic agent, let's evaluate whether the agent responded correctly.
Evaluating our agent
Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:
- Tool call accuracy - did our agent choose the right tool with the right arguments?
- Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?
Let's setup our evaluation by defining our task function, our evaluator, and our dataset.
This is helper code which converts the agent messages into a format that Ragas can use.
Now let's setup our evaluator. We'll import both metrics we're measuring from Ragas, and use the multi_turn_ascore(sample) to get the results.
The AgentGoalAccuracyWithReference metric compares the final output to the reference to see if the goal was accomplished.
The ToolCallAccuracy metric compares the tool call to the reference tool call to see if the tool call was made correctly.
Create synthetic dataset of questions
Using the template below, we're going to generate a dataframe of 10 questions we can use to test our math problem solving agent.
Now let's use this dataset and run it with the agent.
Create an experiment
With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.
Let's create this dataset and upload it into the platform.
Finally, we run our experiment and view the results.
