Session Level Evals
Session Level Evals for an AI Tutor
This tutorial demonstrates how to run session-level evaluations on conversations with an AI tutor. You'll log the results back to Phoenix for further monitoring and analysis. Session-level evaluations are valuable because they provide a holistic view of the entire interaction, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.
In this tutorial, you will:
- Trace and aggregate multi-turn interactions into structured sessions
- Evaluate sessions across multiple dimensions such as Correctness, Goal Completion, and Frustration
- Format the evaluation outputs to match the Phoenix schema and log them to the platform
By the end, you’ll have a robust evaluation pipeline for analyzing and comparing session-level performance.
✅ You’ll need a free Phoenix Cloud account and an Anthropic API key to run this notebook.
Set up Dependencies & Keys
Configure Tracing
Build and Run AI Tutor
In this example, we demonstrate how to evaluate AI tutor sessions. The tutor begins by receiving a user ID, topic, and question. It then explains the topic to the student and engages them with follow-up questions in a multi-turn conversation, continuing until the student ends the session. Our goal is to assess the overall quality of this interaction from start to finish.
Prepare Spans for Session-Level Evaluation
These following cells prepare the data for session-level evaluation. We start by loading all spans into a DataFrame, then sort them chronologically and group them by session ID. You can also group the spans by user ID.
Next, we separate user inputs from AI responses, and finally, store the structured results in a dataframe. We will use this dataframe to run our evaluations.
Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the sesssion messages if token limits are exceeded. This prevents context window issues for longer sessions.
Session Correctness Eval
We are ready to begin running our evals. Let's start with an eval that ensures the AI tutor is giving the student factual information:
Session Frustration Prompt
This evaluation is used to make sure the student isn't getting frustrated with the tutor:
Session Goal Achievement Eval
Finally, we evaluate to ensure the tutor helped the student reach their learning goals:
Log Evaluations Back to Phoenix
Finally, we can log the evaluation results back to Phoenix. In the sessions, tab of your project, you will see the evaluation results populate for each session.
