Notebooks
A
Anthropic
04 Code Graded Classification Evals

04 Code Graded Classification Evals

04_code_graded_classification_evalsanthropic-coursesprompt_evaluations

Code-graded eval: classification task

In this lesson, we'll implement a slightly more complex code-graded evaluation from scratch to test a customer complaint classification prompt. Our goal is to write a prompt that can reliably classify customer complaints into the following categories:

  • Software Bug
  • Hardware Malfunction
  • User Error
  • Feature Request
  • Service Outage

For example, the following complaint text:

The website is completely down, I can't access any pages

Should be classified as Service Outage

In some cases, we may want allow up to two applicable classification categories, as in this example:

I think I installed something incorrectly, and now my computer won't start at all

which should be classified as both User Error and Hardware Malfunction


The Evaluation data set

We'll start by defining our evaluation data set of inputs and golden answers. Remember that generally we want an evaluation data set of around 100 inputs, but to keep these lessons simple (and quick and affordable to run), we're using a slimmed down set.

This test set consists of a list of dictionaries where each dictionary contains a complaint and golden_answer key:

[2]

An initial prompt

We'll start with a basic prompt and measure how it performs. The prompt-generating function below takes a complaint as an argument and returns a prompt string:

[3]

Collecting outputs

Next, we'll write the logic to evaluate the prompt. This logic is a bit more complex than our "leg-counting" example from the previous lesson:

[4]

The evaluate_prompt function does the following:

  1. It passes each input into our prompt-generating function and runs the resulting prompt through the model using the get_model_response function, collecting the responses as they're generated.
  2. It calculates the accuracy by comparing the model output answers to the golden answers in our data set. To do this it calls the calculate_accuracy function.
  3. The calculate_accuracy function checks to see if the appropriate classification categories are present in each of the model's outputs, using a set. Remember, this is not an exact-match eval like our previous "leg-counting" eval.
  4. calculate_accuracy returns an accuracy score
  5. evaluate_prompt prints the final results

Note that instead of grading via exact string match, as we did in the previous lesson, our grading logic uses a set to check for the presence of values in the model output.

Let's test it out with our initial basic_prompt

[5]
Evaluating with model: claude-3-haiku-20240307
Accuracy: 85.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I accidentally deleted some important files, can you help me recover them?
Golden Answer: ['User Error']
Model Response: User Error

Complaint: None of your servers are responding, is there an outage?
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: Could you add a feature to export data in CSV format?
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The app is crashing and my phone is overheating
Golden Answer: ['Software Bug', 'Hardware Malfunction']
Model Response: Hardware Malfunction
Software Bug

Complaint: I can't remember my password!
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The new update broke something and the app no longer works for me
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: I think I installed something incorrectly, now my computer won't start at all
Golden Answer: ['User Error', 'Hardware Malfunction']
Model Response: User Error, Hardware Malfunction

Complaint: Your service is down, and I urgently need a feature to batch process files
Golden Answer: ['Service Outage', 'Feature Request']
Model Response: Feature Request, Service Outage

Complaint: The graphics card is making weird noises
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: My keyboard just totally stopped working out of nowhere
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: Whenever I open your app, my phone gets really slow
Golden Answer: ['Software Bug']
Model Response: Hardware Malfunction

Complaint: Can you make the interface more user-friendly? I always get lost in the menus
Golden Answer: ['Feature Request', 'User Error']
Model Response: Feature Request

Complaint: The cloud storage isn't syncing and I can't access my files from other devices
Golden Answer: ['Software Bug', 'Service Outage']
Model Response: Software Bug, Service Outage
0.85

An improved prompt

Our initial prompt resulted in an 85% accuracy score. Let's make some changes to the prompt and rerun the evaluation, hopefully resulting in a better score.

The following prompt incorporates an expanded explanation of the categories, as well as 9 example input/output pairs:

[6]

Let's run the evaluation with our improved prompt:

[80]
Evaluating with model: claude-3-haiku-20240307
Accuracy: 100.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I accidentally deleted some important files, can you help me recover them?
Golden Answer: ['User Error']
Model Response: User Error

Complaint: None of your servers are responding, is there an outage?
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: Could you add a feature to export data in CSV format?
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The app is crashing and my phone is overheating
Golden Answer: ['Software Bug', 'Hardware Malfunction']
Model Response: Software Bug, Hardware Malfunction

Complaint: I can't remember my password!
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The new update broke something and the app no longer works for me
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: I think I installed something incorrectly, now my computer won't start at all
Golden Answer: ['User Error', 'Hardware Malfunction']
Model Response: Hardware Malfunction, User Error

Complaint: Your service is down, and I urgently need a feature to batch process files
Golden Answer: ['Service Outage', 'Feature Request']
Model Response: Service Outage, Feature Request

Complaint: The graphics card is making weird noises
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: My keyboard just totally stopped working out of nowhere
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: Whenever I open your app, my phone gets really slow
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: Can you make the interface more user-friendly? I always get lost in the menus
Golden Answer: ['Feature Request', 'User Error']
Model Response: User Error, Feature Request

Complaint: The cloud storage isn't syncing and I can't access my files from other devices
Golden Answer: ['Software Bug', 'Service Outage']
Model Response: Software Bug, Service Outage
1.0

We got 100% accuracy with the newer, improved prompt!

Again, we're following the standard prompt + eval loop outlined in this diagram:

process.png

Please keep in mind that this is a very simple evaluation, using a very small dataset. This lesson aims to illustrate the general process of code-graded evaluations, but it is not meant as a canonical example of a production-scale evaluation!

This approach works, but it's a bit laborious to write all the evaluation logic from scratch, and it's difficult to compare results side-by-side. What if we used a tool that generated nicely-formatted results with charts and graphs and made it easy to run an evaluation across multiple models? In the next lesson, we'll see just that! Next up, we'll take a look at an evaluation framework that makes it easy to write repeatable, scalable evaluations for production use-cases.