04 Code Graded Classification Evals
Code-graded eval: classification task
In this lesson, we'll implement a slightly more complex code-graded evaluation from scratch to test a customer complaint classification prompt. Our goal is to write a prompt that can reliably classify customer complaints into the following categories:
- Software Bug
- Hardware Malfunction
- User Error
- Feature Request
- Service Outage
For example, the following complaint text:
The website is completely down, I can't access any pages
Should be classified as Service Outage
In some cases, we may want allow up to two applicable classification categories, as in this example:
I think I installed something incorrectly, and now my computer won't start at all
which should be classified as both User Error and Hardware Malfunction
The Evaluation data set
We'll start by defining our evaluation data set of inputs and golden answers. Remember that generally we want an evaluation data set of around 100 inputs, but to keep these lessons simple (and quick and affordable to run), we're using a slimmed down set.
This test set consists of a list of dictionaries where each dictionary contains a complaint and golden_answer key:
An initial prompt
We'll start with a basic prompt and measure how it performs. The prompt-generating function below takes a complaint as an argument and returns a prompt string:
Collecting outputs
Next, we'll write the logic to evaluate the prompt. This logic is a bit more complex than our "leg-counting" example from the previous lesson:
The evaluate_prompt function does the following:
- It passes each input into our prompt-generating function and runs the resulting prompt through the model using the
get_model_responsefunction, collecting the responses as they're generated. - It calculates the accuracy by comparing the model output answers to the golden answers in our data set. To do this it calls the
calculate_accuracyfunction. - The
calculate_accuracyfunction checks to see if the appropriate classification categories are present in each of the model's outputs, using aset. Remember, this is not an exact-match eval like our previous "leg-counting" eval. calculate_accuracyreturns an accuracy scoreevaluate_promptprints the final results
Note that instead of grading via exact string match, as we did in the previous lesson, our grading logic uses a set to check for the presence of values in the model output.
Let's test it out with our initial basic_prompt
Evaluating with model: claude-3-haiku-20240307 Accuracy: 85.00% Complaint: The app crashes every time I try to upload a photo Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: My printer isn't recognized by my computer Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: I can't figure out how to change my password Golden Answer: ['User Error'] Model Response: User Error Complaint: The website is completely down, I can't access any pages Golden Answer: ['Service Outage'] Model Response: Service Outage Complaint: It would be great if the app had a dark mode option Golden Answer: ['Feature Request'] Model Response: Feature Request Complaint: The software keeps freezing when I try to save large files Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: My wireless mouse isn't working, even with new batteries Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: I accidentally deleted some important files, can you help me recover them? Golden Answer: ['User Error'] Model Response: User Error Complaint: None of your servers are responding, is there an outage? Golden Answer: ['Service Outage'] Model Response: Service Outage Complaint: Could you add a feature to export data in CSV format? Golden Answer: ['Feature Request'] Model Response: Feature Request Complaint: The app is crashing and my phone is overheating Golden Answer: ['Software Bug', 'Hardware Malfunction'] Model Response: Hardware Malfunction Software Bug Complaint: I can't remember my password! Golden Answer: ['User Error'] Model Response: User Error Complaint: The new update broke something and the app no longer works for me Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: I think I installed something incorrectly, now my computer won't start at all Golden Answer: ['User Error', 'Hardware Malfunction'] Model Response: User Error, Hardware Malfunction Complaint: Your service is down, and I urgently need a feature to batch process files Golden Answer: ['Service Outage', 'Feature Request'] Model Response: Feature Request, Service Outage Complaint: The graphics card is making weird noises Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: My keyboard just totally stopped working out of nowhere Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: Whenever I open your app, my phone gets really slow Golden Answer: ['Software Bug'] Model Response: Hardware Malfunction Complaint: Can you make the interface more user-friendly? I always get lost in the menus Golden Answer: ['Feature Request', 'User Error'] Model Response: Feature Request Complaint: The cloud storage isn't syncing and I can't access my files from other devices Golden Answer: ['Software Bug', 'Service Outage'] Model Response: Software Bug, Service Outage
0.85
An improved prompt
Our initial prompt resulted in an 85% accuracy score. Let's make some changes to the prompt and rerun the evaluation, hopefully resulting in a better score.
The following prompt incorporates an expanded explanation of the categories, as well as 9 example input/output pairs:
Let's run the evaluation with our improved prompt:
Evaluating with model: claude-3-haiku-20240307 Accuracy: 100.00% Complaint: The app crashes every time I try to upload a photo Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: My printer isn't recognized by my computer Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: I can't figure out how to change my password Golden Answer: ['User Error'] Model Response: User Error Complaint: The website is completely down, I can't access any pages Golden Answer: ['Service Outage'] Model Response: Service Outage Complaint: It would be great if the app had a dark mode option Golden Answer: ['Feature Request'] Model Response: Feature Request Complaint: The software keeps freezing when I try to save large files Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: My wireless mouse isn't working, even with new batteries Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: I accidentally deleted some important files, can you help me recover them? Golden Answer: ['User Error'] Model Response: User Error Complaint: None of your servers are responding, is there an outage? Golden Answer: ['Service Outage'] Model Response: Service Outage Complaint: Could you add a feature to export data in CSV format? Golden Answer: ['Feature Request'] Model Response: Feature Request Complaint: The app is crashing and my phone is overheating Golden Answer: ['Software Bug', 'Hardware Malfunction'] Model Response: Software Bug, Hardware Malfunction Complaint: I can't remember my password! Golden Answer: ['User Error'] Model Response: User Error Complaint: The new update broke something and the app no longer works for me Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: I think I installed something incorrectly, now my computer won't start at all Golden Answer: ['User Error', 'Hardware Malfunction'] Model Response: Hardware Malfunction, User Error Complaint: Your service is down, and I urgently need a feature to batch process files Golden Answer: ['Service Outage', 'Feature Request'] Model Response: Service Outage, Feature Request Complaint: The graphics card is making weird noises Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: My keyboard just totally stopped working out of nowhere Golden Answer: ['Hardware Malfunction'] Model Response: Hardware Malfunction Complaint: Whenever I open your app, my phone gets really slow Golden Answer: ['Software Bug'] Model Response: Software Bug Complaint: Can you make the interface more user-friendly? I always get lost in the menus Golden Answer: ['Feature Request', 'User Error'] Model Response: User Error, Feature Request Complaint: The cloud storage isn't syncing and I can't access my files from other devices Golden Answer: ['Software Bug', 'Service Outage'] Model Response: Software Bug, Service Outage
1.0
We got 100% accuracy with the newer, improved prompt!
Again, we're following the standard prompt + eval loop outlined in this diagram:
Please keep in mind that this is a very simple evaluation, using a very small dataset. This lesson aims to illustrate the general process of code-graded evaluations, but it is not meant as a canonical example of a production-scale evaluation!
This approach works, but it's a bit laborious to write all the evaluation logic from scratch, and it's difficult to compare results side-by-side. What if we used a tool that generated nicely-formatted results with charts and graphs and made it easy to run an evaluation across multiple models? In the next lesson, we'll see just that! Next up, we'll take a look at an evaluation framework that makes it easy to write repeatable, scalable evaluations for production use-cases.