Arize AI Tool Calling Eval Dataset

Tool Calling Eval Dataset

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentsanthropiclangchainexperiments

alph-notebooks/arize-phoenix / tool_calling_eval_dataset.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Community

Tool Calling Evaluation — Dataset Preparation

This notebook is a companion resource to the How to Evaluate Tool-Calling Agents with Phoenix tutorial (link here).

It uploads the travel-assistant-tool-calling dataset and travel-assistant prompt to your Phoenix instance — the starting point for the full evaluation workflow covered in the tutorial.

Install Dependencies

[ ]

Section 1: Define the Tool Set

Six tools define the capabilities of the travel planning assistant used in the tutorial.

Tool	Description
`search_flights`	Search available flights between two cities on a given date
`get_weather`	Get current weather or forecast for a location
`search_hotels`	Find hotels in a city for given dates and guest count
`get_directions`	Get travel directions and estimated time between two locations
`convert_currency`	Convert an amount from one currency to another
`search_restaurants`	Find restaurants in a location by cuisine or criteria

[ ]

Section 2: Load the Evaluation Dataset

The evaluation dataset contains 30 travel assistant queries with ground truth tool calls, covering three scenarios:

Pattern	Count	Description
Single-tool	18	One tool needed; tests parameter extraction, implicit dates, ambiguous phrasing
Parallel (2 tools)	10	Two tools needed simultaneously; all 10 two-tool combinations represented
No tool needed	2	General travel questions the assistant should answer directly

Each query has an expected_tool_calls label with the full tool name and arguments.

[ ]

Section 3: Build the Dataset DataFrame

The dataset has two columns:

Column	Type	Purpose
`query`	string	User's travel query — mapped to `{{query}}` in the experiment prompt
`expected_tool_calls`	JSON string	Full name + arguments for each call — used for invocation alignment evaluation

[ ]

Section 4: Upload to Phoenix

The cell below launches an in-process Phoenix server — no additional setup required. Run it and open the printed URL to access the UI.

If you'd prefer to connect to an existing instance, skip that cell and set your connection details before running the upload cells:

Phoenix Cloud: Set PHOENIX_COLLECTOR_ENDPOINT to your workspace URL and PHOENIX_API_KEY to your API key (both available at phoenix.arize.com under Settings → API Keys).
Existing local server: If Phoenix is already running (e.g. python -m phoenix.server.main serve), Client() connects to http://localhost:6006 automatically — no env vars needed.

[ ]

Section 5: Create a Phoenix Prompt with the Tool Set

This creates a versioned travel-assistant prompt in Phoenix with all six tool schemas attached. Once pushed, it will be available in Phoenix UI → Prompts → travel-assistant and can be selected directly when creating an experiment.

[ ]

Next Steps

The travel-assistant-tool-calling dataset and travel-assistant prompt are now in Phoenix. The full walkthrough is covered in the tutorial — here's a quick reference for the steps that happen in the UI.

1. Run an experiment

Open Phoenix → Datasets → travel-assistant-tool-calling → New Experiment.

Select the travel-assistant prompt in the playground and run the experiment.

2. Add evaluators

After the experiment completes, click Add Evaluator:

Tool Selection — from the built-in template; map input to your dataset's input column
Tool Invocation — same input mapping
Matches Expected (optional) — create a custom LLM evaluator to compare output tool calls against the labeled expected_tool_calls column

3. Inspect and iterate

Review per-example explanations to identify failure patterns. Look for:

Systematic issues (like date assumptions) → fix the system prompt
Evaluator over-strictness → adjust the evaluator prompt
Missing capabilities (like "current date") → extend the tool set

Rerun the experiment and compare versions side by side.