Notebooks
A
Arize AI
Tool Calling Eval Dataset

Tool Calling Eval Dataset

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentsanthropiclangchainexperiments

Open In Colab

Tool Calling Evaluation — Dataset Preparation

This notebook is a companion resource to the How to Evaluate Tool-Calling Agents with Phoenix tutorial (link here).

It uploads the travel-assistant-tool-calling dataset and travel-assistant prompt to your Phoenix instance — the starting point for the full evaluation workflow covered in the tutorial.

Install Dependencies

[ ]

Section 1: Define the Tool Set

Six tools define the capabilities of the travel planning assistant used in the tutorial.

ToolDescription
search_flightsSearch available flights between two cities on a given date
get_weatherGet current weather or forecast for a location
search_hotelsFind hotels in a city for given dates and guest count
get_directionsGet travel directions and estimated time between two locations
convert_currencyConvert an amount from one currency to another
search_restaurantsFind restaurants in a location by cuisine or criteria
[ ]

Section 2: Load the Evaluation Dataset

The evaluation dataset contains 30 travel assistant queries with ground truth tool calls, covering three scenarios:

PatternCountDescription
Single-tool18One tool needed; tests parameter extraction, implicit dates, ambiguous phrasing
Parallel (2 tools)10Two tools needed simultaneously; all 10 two-tool combinations represented
No tool needed2General travel questions the assistant should answer directly

Each query has an expected_tool_calls label with the full tool name and arguments.

[ ]
[ ]

Section 3: Build the Dataset DataFrame

The dataset has two columns:

ColumnTypePurpose
querystringUser's travel query — mapped to {{query}} in the experiment prompt
expected_tool_callsJSON stringFull name + arguments for each call — used for invocation alignment evaluation
[ ]

Section 4: Upload to Phoenix

The cell below launches an in-process Phoenix server — no additional setup required. Run it and open the printed URL to access the UI.

If you'd prefer to connect to an existing instance, skip that cell and set your connection details before running the upload cells:

  • Phoenix Cloud: Set PHOENIX_COLLECTOR_ENDPOINT to your workspace URL and PHOENIX_API_KEY to your API key (both available at phoenix.arize.com under Settings → API Keys).
  • Existing local server: If Phoenix is already running (e.g. python -m phoenix.server.main serve), Client() connects to http://localhost:6006 automatically — no env vars needed.
[ ]
[ ]

Section 5: Create a Phoenix Prompt with the Tool Set

This creates a versioned travel-assistant prompt in Phoenix with all six tool schemas attached. Once pushed, it will be available in Phoenix UI → Prompts → travel-assistant and can be selected directly when creating an experiment.

[ ]

Next Steps

The travel-assistant-tool-calling dataset and travel-assistant prompt are now in Phoenix. The full walkthrough is covered in the tutorial — here's a quick reference for the steps that happen in the UI.

1. Run an experiment

Open Phoenix → Datasetstravel-assistant-tool-callingNew Experiment.

Select the travel-assistant prompt in the playground and run the experiment.

2. Add evaluators

After the experiment completes, click Add Evaluator:

  • Tool Selection — from the built-in template; map input to your dataset's input column
  • Tool Invocation — same input mapping
  • Matches Expected (optional) — create a custom LLM evaluator to compare output tool calls against the labeled expected_tool_calls column

3. Inspect and iterate

Review per-example explanations to identify failure patterns. Look for:

  • Systematic issues (like date assumptions) → fix the system prompt
  • Evaluator over-strictness → adjust the evaluator prompt
  • Missing capabilities (like "current date") → extend the tool set

Rerun the experiment and compare versions side by side.