02 Workbench Evals
Anthropic workbench evaluations
This lesson will show you how to use Anthropic Workbench to run your own human-graded evaluations. This is an extremely easy-to-use and visual interface for quickly prototyping prompts and running human-graded evaluations. While we generally recommend using a more scale-able approach for production evaluations, the Anthropic Workbench is a great place to start with human-graded evaluations before moving to more rigorous code-graded or model-graded evals.
In this lesson we'll see how to use the Workbench to test prompts, run simple evaluations, and compare prompt versions.
The Anthropic workbench
Anthropic's workbench is a great place to quickly prototype prompts and run human-graded evaluations. This is what the workbench looks like when we first load it:
On the left side we can enter a prompt. Let's imagine we're working on a code-translation application and want to write the best possible prompt to use the Anthropic API to translate code from any coding language into Python. Here's an initial attempt at a prompt:
You are a skilled programmer tasked with translating code from one programming language to Python. Your goal is to produce an accurate and idiomatic Python translation of the provided source code.
Here is the source code to translate:
<source_code>
{{SOURCE_CODE}}
</source_code>
The source code is written in the following language:
<source_language>
{{SOURCE_LANGUAGE}}
</source_language>
Please translate this code to Python
Notice the {{SOURCE_CODE}} and {{SOURCE_LANGUAGE}} variables, which we will later replace with dynamic values.
We can put this prompt into the left side of the workbench:
Next, we can set test values for our variables by clicking on the variables ({ }) button:
This will open a dialog, asking us to input values for the {{SOURCE_CODE}} and {{SOURCE_LANGUAGE}} variables:
Next, we can hit run and see the resulting output from the model:
Workbench evaluations
Testing our prompt with one set of variables at a time is a good place to start, but the Workbench also comes with a built-in evaluation tool to help us run prompts against multiple inputs. To switch over to the evaluate view, click the "Evaluate" toggle button at the top:
This opens the evaluate view, with our initial result pre-populated:
Next, we can click the "Add Row" button to add some new test cases. Let's add in two new test cases: some Ruby code and some C# code:
Next, we can either click the individual "Run" buttons next to each test case, or we can click the orange "Run Remaining" button to click all remaining un-run test cases:
Let's click the Run Remaining button and take a look at our model responses:
This are the results we got:
Human grading
Now it's time to take a close look at the model outputs and give them scores. In the right column, we have the option of assigning a score to each output: