Notebooks
A
Arize AI
Session Evals

Session Evals

arize-tutorialsevaluationLLMPython

Session-Level Evaluations

This notebook demonstrates how to evaluate the effectiveness of AI agent interactions at the session level, where a session consists of multiple traces (individual interactions) between a user and the system.

Conceptual Overview

Session-level evaluations assess:

  • Coherence across multiple interactions
  • Context retention between interactions
  • Overall goal achievement across an entire conversation
  • Appropriate progression through complex multi-step tasks

Setup

Configure your environment variables and import dependencies. You'll need to set up your Arize API key and import necessary libraries for data processing and evaluation.

[ ]
[ ]
[ ]

Data Extraction

Pull trace data from Arize and prepare it for analysis.

Note: Modify the space_id, model_id, and date range to match your deployment.

[ ]

Evaluation Prompt Design

The evaluation uses a carefully designed prompt template that instructs the LLM how to evaluate session-level effectiveness and coherence. You can customize this template to fit your specific evaluation criteria.

The session evaluation prompt focuses on:

  • Coherence assessment: Does the agent maintain a consistent understanding across interactions?
  • Context utilization: Does the agent effectively use information from previous interactions?
  • Goal progression: Does the conversation move logically toward resolving the user's needs?
  • Response appropriateness: Are the agent's responses suitable given the conversation history?

The evaluation looks at overall conversation quality and effectiveness throughout the session.

Prompt Variables

VariableDescriptionSource
{session_user_inputs}The user inputs across all traces in the sessionExtracted from trace data
{session_output_messages}The AI's responses across all traces in the sessionExtracted from trace data

Customizing the Prompt

You may want to adjust the evaluation criteria or output format based on your specific use case:

  • Add domain-specific criteria relevant to your agent's purpose
  • Modify success criteria based on your application's goals
  • Include additional session metadata as context
[ ]

Data Preparation

These functions filter and transform session data into the format needed for evaluation.

Core concepts:

  • Session identification: Finding complete user sessions to evaluate
  • Trace ordering: Arranging traces chronologically within sessions
  • Message extraction: Gathering user inputs and system responses across the session

The filter_sessions_by_trace_criteria function is particularly important as it allows you to:

  1. Select relevant sessions that contain traces matching your criteria
  2. Retrieve the complete session context for evaluation

This approach ensures we evaluate the full conversation flow rather than isolated interactions.

[ ]
[ ]
[ ]

Evaluation Configuration

Filter Data

Customize these parameters to match your specific evaluation needs:

ParameterDescriptionExample
trace_filtersCriteria for selecting traces within sessions{"name": {"contains": "searchrouter"}}
span_filtersCriteria for selecting spans within traces{"parent_id": {"==": None}}

Span filters help determine which specific spans within the matched traces will be used for the evaluation. For example, filtering for "parent_id": None ensures we focus on the parent spans for the selected sessions.

Note: Update the trace_filters and span_filters to match your specific evaluation criteria

[ ]
[ ]
[ ]
[ ]

Prepare the data for the evaluation

This will group the prompt variables by session_id and extract the required columns and append any additional data to the dataframe

[ ]
[ ]
[ ]
[ ]

Running the Evaluation

After preparing your sessions and configuring the evaluation parameters, you can execute the LLM-based evaluation:

[ ]
[ ]
[ ]

Analyzing Results

The evaluation results contain:

  • label: Overall trajectory assessment (correct/incorrect)
  • explanation: Detailed reasoning for the assessment
[ ]
[ ]

The evaluation results can then be merged with your original data for analysis or to log back to Arize:

[ ]
[ ]
[ ]
[ ]

See your results in Arize