Notebooks
A
Arize AI
Evaluate Summarization Classifications

Evaluate Summarization Classifications

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentslegacyanthropiclangchain

Open In Colab

phoenix logo
Docs | GitHub | Community

Summarization Classification Evals

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted approach to evaluating summarization quality,
  • to provide an experimental framework for users to iterate and improve on the default classification template.

Install Dependencies and Import Libraries

[1]
[2]

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

[3]
[4]

Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. We will be using the CNN Daily News Mail dataset. This dataset is commonly used for text summarization models as a benchmark.

[5]

Display Binary Summarization Classification Template

View the default template used to classify summarizations. You can tweak this template and evaluate its performance relative to the default.

[6]

You are comparing the summary text and it's original document and trying to determine
if the summary is good. Here is the data:
    [BEGIN DATA]
    ************
    [Summary]: {output}
    ************
    [Original Document]: {input}
    [END DATA]
Compare the Summary above to the Original Document and determine if the Summary is
comprehensive, concise, coherent, and independent relative to the Original Document.
Your response must be a single word, either "good" or "bad", and should not contain any text
or characters aside from that. "bad" means that the Summary is not comprehensive,
concise, coherent, and independent relative to the Original Document. "good" means the
Summary is comprehensive, concise, coherent, and independent relative to the Original Document.

Eval template variables:

  • input : The document text to summarize
  • output : The summary of the document

Configure the LLM

Configure your OpenAI API key.

[7]

Benchmark Dataset Sample

Sample size determines run time Recommend iterating small: 100 samples Then increasing to large test set

[8]

LLM Evals: Summarization Evals Classifications GPT-4

Run summarization classifications against a subset of the data.

Instantiate the LLM and set parameters.

[9]
[10]
"Hello! I'm working perfectly. How can I assist you today?"
[11]
llm_classify |          | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s

Evaluate the predictions against human-labeled ground-truth summarization labels.

[12]
              precision    recall  f1-score   support

        good       0.78      0.88      0.83        52
         bad       0.85      0.73      0.79        48

    accuracy                           0.81       100
   macro avg       0.82      0.81      0.81       100
weighted avg       0.82      0.81      0.81       100

<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>
Output

LLM Evals: Summarization Evals Classifications GPT-3.5

Run summarization classifications against a subset of the data.

[13]
[14]
llm_classify |          | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s
[15]
              precision    recall  f1-score   support

        good       0.69      0.81      0.74        52
         bad       0.74      0.60      0.67        48

    accuracy                           0.71       100
   macro avg       0.72      0.71      0.71       100
weighted avg       0.71      0.71      0.71       100

<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>
Output

LLM Evals: Summarization Evals Classifications GPT-4 Turbo

Run summarization classifications against a subset of the data.

[16]
[17]
llm_classify |          | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s
[18]
              precision    recall  f1-score   support

        good       0.95      0.67      0.79        52
         bad       0.73      0.96      0.83        48

    accuracy                           0.81       100
   macro avg       0.84      0.82      0.81       100
weighted avg       0.84      0.81      0.81       100

<Axes: title={'center': 'Confusion Matrix (Normalized)'}, xlabel='Predicted Classes', ylabel='Actual Classes'>
Output