Notebooks
A
Arize AI
Evaluate QA Classifications

Evaluate QA Classifications

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Q&A Classification Evals

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted approach to detecting issues with Q&A systems on retrieved context data
  • to provide an experimental framework for users to iterate and improve on the default classification template.

Install Dependencies and Import Libraries

Note: This notebook was last updated on May 30, 2025.
[ ]
[ ]

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

[ ]
[ ]
[ ]

Download Benchmark Dataset

  • Squad 2: The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf
  • Supplemental Data to Sqaud 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.
  • sampled_answer is a sampled column of randomly original Squad 2 or incorrect answers
[ ]
  • question: This is the question the Q&A system is running against
  • sampled_answer: This is a random sample of correct_answer from Squad 2 or wrong_answer which is a made up incorrect answer. This is the column we test against as it has wrong and right answers.
  • correct_answer: True if answer is correct, False if not. The ground truth to test against.
  • answers: This is the right answer to the question.
  • wrong_answer: This is an incorrect answer generated by the context.
  • context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer.
[ ]

Display Binary Q&A Classification Template

View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.

[ ]

Configure the API Key

Configure your OpenAI API key.

[ ]

Benchmark Dataset Sample

Sample size determines run time Recommend iterating small: 100 samples Then increasing to large test set

[ ]

LLM Evals: Q&A Classifications GPT-4

Run Q&A classifications against a subset of the data.

Instantiate the LLM and set parameters.

[ ]
[ ]

Run LLM Eval using the template against the dataset: This is the main Eval function

[ ]

Evaluate the predictions against human-labeled ground-truth Q&A labels.

[ ]

LLM Evals: Q&A Classifications GPT-3.5

Evaluate the predictions against human-labeled ground-truth Q&A labels.

[ ]
[ ]
[ ]

LLM Evals: Q&A Classifications GPT-4 Turbo

Evaluate the predictions against human-labeled ground-truth Q&A labels.

[ ]
[ ]
[ ]