Session Evals
Session-Level Evaluations
This notebook demonstrates how to evaluate the effectiveness of AI agent interactions at the session level, where a session consists of multiple traces (individual interactions) between a user and the system.
Conceptual Overview
Session-level evaluations assess:
- Coherence across multiple interactions
- Context retention between interactions
- Overall goal achievement across an entire conversation
- Appropriate progression through complex multi-step tasks
Setup
Configure your environment variables and import dependencies. You'll need to set up your Arize API key and import necessary libraries for data processing and evaluation.
Data Extraction
Pull trace data from Arize and prepare it for analysis.
Note: Modify the space_id, model_id, and date range to match your deployment.
Evaluation Prompt Design
The evaluation uses a carefully designed prompt template that instructs the LLM how to evaluate session-level effectiveness and coherence. You can customize this template to fit your specific evaluation criteria.
The session evaluation prompt focuses on:
- Coherence assessment: Does the agent maintain a consistent understanding across interactions?
- Context utilization: Does the agent effectively use information from previous interactions?
- Goal progression: Does the conversation move logically toward resolving the user's needs?
- Response appropriateness: Are the agent's responses suitable given the conversation history?
The evaluation looks at overall conversation quality and effectiveness throughout the session.
Prompt Variables
| Variable | Description | Source |
|---|---|---|
{session_user_inputs} | The user inputs across all traces in the session | Extracted from trace data |
{session_output_messages} | The AI's responses across all traces in the session | Extracted from trace data |
Customizing the Prompt
You may want to adjust the evaluation criteria or output format based on your specific use case:
- Add domain-specific criteria relevant to your agent's purpose
- Modify success criteria based on your application's goals
- Include additional session metadata as context
Data Preparation
These functions filter and transform session data into the format needed for evaluation.
Core concepts:
- Session identification: Finding complete user sessions to evaluate
- Trace ordering: Arranging traces chronologically within sessions
- Message extraction: Gathering user inputs and system responses across the session
The filter_sessions_by_trace_criteria function is particularly important as it allows you to:
- Select relevant sessions that contain traces matching your criteria
- Retrieve the complete session context for evaluation
This approach ensures we evaluate the full conversation flow rather than isolated interactions.
Evaluation Configuration
Filter Data
Customize these parameters to match your specific evaluation needs:
| Parameter | Description | Example |
|---|---|---|
| trace_filters | Criteria for selecting traces within sessions | {"name": {"contains": "searchrouter"}} |
| span_filters | Criteria for selecting spans within traces | {"parent_id": {"==": None}} |
Span filters help determine which specific spans within the matched traces will be used for the evaluation. For example, filtering for "parent_id": None ensures we focus on the parent spans for the selected sessions.
Note: Update the
trace_filtersandspan_filtersto match your specific evaluation criteria
Prepare the data for the evaluation
This will group the prompt variables by session_id and extract the required columns and append any additional data to the dataframe
Running the Evaluation
After preparing your sessions and configuring the evaluation parameters, you can execute the LLM-based evaluation:
Analyzing Results
The evaluation results contain:
- label: Overall trajectory assessment (correct/incorrect)
- explanation: Detailed reasoning for the assessment
The evaluation results can then be merged with your original data for analysis or to log back to Arize:
See your results in Arize
