Notebooks
A
Arize AI
Session Level Evals For Chatbot Cookbook

Session Level Evals For Chatbot Cookbook

arize-tutorialsevaluationLLMPython

Session Level Evals for an AI Tutor

This tutorial demonstrates how to run session-level evaluations on conversations with an AI tutor. You'll log the results back to Arize AX for further monitoring and analysis. Session-level evaluations are valuable because they provide a holistic view of the entire interaction, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.

In this tutorial, you will:

  • Trace and aggregate multi-turn interactions into structured sessions
  • Evaluate sessions across multiple dimensions such as Correctness, Goal Completion, and Frustration
  • Format the evaluation outputs to match Arize's schema and log them to the platform

By the end, you’ll have a robust evaluation pipeline for analyzing and comparing session-level performance.

✅ You’ll need a free Arize AX account and an Anthropic API key to run this notebook.

Set up Dependencies & Keys

[ ]
[ ]

Configure Tracing

[ ]

Build and Run AI Tutor

In this example, we demonstrate how to evaluate AI tutor sessions. The tutor begins by receiving a user ID, topic, and question. It then explains the topic to the student and engages them with follow-up questions in a multi-turn conversation, continuing until the student ends the session. Our goal is to assess the overall quality of this interaction from start to finish.

[ ]
[ ]

Prepare Spans for Session-Level Evaluation

These following cells prepare the data for session-level evaluation. We start by loading all spans into a DataFrame, then sort them chronologically and group them by session ID. You can also group the spans by user ID.

Next, we separate user inputs from AI responses, and finally, store the structured results in a dataframe. We will use this dataframe to run our evaluations.

[ ]
[ ]

Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the sesssion messages if token limits are exceeded. This prevents context window issues for longer sessions.

[ ]
[ ]

Session Correctness Eval

We are ready to begin running our evals. Let's start with an eval that ensures the AI tutor is giving the student factual information:

[ ]
[ ]

Session Frustration Prompt

This evaluation is used to make sure the student isn't getting frustrated with the tutor:

[ ]
[ ]

Session Goal Achievement Eval

Finally, we evaluate to ensure the tutor helped the student reach their learning goals:

[ ]
[ ]

Log Evaluations Back to Arize AX

Finally, we can log the evaluation results back to Arize AX. In the sessions, tab of your project, you will see the evaluation results populate for each session.

[ ]

Session Eval Results