Notebooks
A
Arize AI
Evaluate Human Vs Ai Classifications

Evaluate Human Vs Ai Classifications

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Human/GroundTruth Versus AI Evals

Arize provides tooling to evaluate LLM applications, including tools to determine whether AI answers match Human Groundtruth answers. In many Q&A systems its important to test the AI answer results as compared to Human answers prior to deployment. These help assess how often the answers are correctly generated by the AI system.

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted Evals for AI vs Human answers
  • to provide an experimental framework for users to iterate and improve on the default classification template.
Note: This notebook was last updated on May 30, 2025.

Install Dependencies and Import Libraries

[ ]
[ ]

Download the Dataset

We've crafted a dataset of common questions and answers about the Arize platform.

[ ]

Vizualization of Prompts/Templates Evals in Phoenix (Optional Section)

Visualization of Evals is not required but can be helpful to see the actual calls to the LLM. The link below starts the Phoenix UI/server and is a link to Phoenix running locally

[ ]

Human vs AI Template

View the default template used to evaluate the AI answers.

[ ]

The template variables are:

  • question: the question asked by a user
  • correct_answer: human labeled correct answer
  • ai_answer: AI generated answer

Configure the LLM

Configure your OpenAI API key.

[ ]

LLM Evals:Human Groundtruth vs AI GPT-4

Run Human vs AI Eval against a subset of the data. Instantiate the LLM and set parameters.

[ ]
[ ]

Classifications with explanations

When evaluating a dataset for relevance, it can be useful to know why the LLM classified an AI answer as relevant or irrelevant. The following code block runs llm_classify with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

[ ]

Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

[ ]

LLM Evals: Human Groundtruth vs AI Classifications GPT-3.5 Turbo

Run against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as we will see below.

[ ]
[ ]
[ ]
[ ]
[ ]

Preview: Running with GPT-4 Turbo

[ ]
[ ]