Notebooks
A
Arize AI
Summarization Experiment

Summarization Experiment

arize-tutorialsLLMPythonexperiments

arize logo
Docs | GitHub | Slack Community

Arize Prompt Experimentation

This guide demonstrates how to use Arize for logging and analyzing prompt iteration experiments with your LLM. We're going to build a simple prompt experimentation pipeline that generates outputs using different variations of a base prompt. The generated outputs will be logged to an Arize dataset along with the corresponding prompt used. Arize makes it easy to track and compare results from prompt iteration experiments, allowing you to identify which prompt variations yield the best performance. You can read more about experiment tracking with Arize here. In this tutorial, you will:

  • Set up an Arize dataset to log the prompts and generated outputs from our experiments

  • Create a base prompt and define a set of variations to experiment with

  • Implement a script that iterates through the prompt variations, generates outputs using an LLM, and logs each prompt-output pair to the Arize dataset

  • Analyze the logged data in Arize to compare results across prompt variations and identify the best performing prompts

By leveraging Arize for experiment tracking, you'll be able to systematically test different prompt variations at scale and use the logged data to inform your prompt engineering process. Let's get started!

ℹ️ This notebook requires:

  • An OpenAI API key
  • An Arize Space ID & Developer Key (explained below)

Step 1: Setup Config

Copy the Arize developer API Key and Space ID from the Datasets page (shown below) to the variables in the cell below.

[ ]

Install dependencies

[ ]

Let's make sure we can run async code in the notebook.

[ ]

Step 2: Download Data

Download your data from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.

Upload the data as a dataset in Arize and inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.

[ ]
[ ]
[ ]

Define Your Experiment Task

A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.

First, define a function to format a prompt template and invoke an OpenAI model on an example.

[ ]

From this function, you can use functools.partial to derive your first task, which is a callable that takes in an example and returns an output.

[ ]

Define Your Evaluators

Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:

  • ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
  • ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
  • ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.

Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a model fine-tuned on this particular dataset achieved a rouge score of ~0.44.

Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.

Note that you can use any third-party library you like while defining evaluators (in your case, rouge and tiktoken).

[ ]

Test out your evaluator by testing it on a test example.

[ ]

Run Experiments and Iterate on Your Prompt Template

Run your first experiment with the first prompt template.

[ ]

Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.

[ ]

Inspecting the experiment results, you'll notice that the average num_tokens has indeed increased, but the generated summaries are still far more verbose than the reference summaries.

Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.

Screenshot 2024-09-05 at 12.05.29 PM.png

[ ]

Now run the experiment.

[ ]

By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.