Notebooks
A
Anthropic
02 Workbench Evals

02 Workbench Evals

02_workbench_evalsanthropic-coursesprompt_evaluations

Anthropic workbench evaluations

This lesson will show you how to use Anthropic Workbench to run your own human-graded evaluations. This is an extremely easy-to-use and visual interface for quickly prototyping prompts and running human-graded evaluations. While we generally recommend using a more scale-able approach for production evaluations, the Anthropic Workbench is a great place to start with human-graded evaluations before moving to more rigorous code-graded or model-graded evals.

In this lesson we'll see how to use the Workbench to test prompts, run simple evaluations, and compare prompt versions.


The Anthropic workbench

Anthropic's workbench is a great place to quickly prototype prompts and run human-graded evaluations. This is what the workbench looks like when we first load it:

empty_workbench.png

On the left side we can enter a prompt. Let's imagine we're working on a code-translation application and want to write the best possible prompt to use the Anthropic API to translate code from any coding language into Python. Here's an initial attempt at a prompt:

You are a skilled programmer tasked with translating code from one programming language to Python. Your goal is to produce an accurate and idiomatic Python translation of the provided source code.

Here is the source code to translate:

<source_code>
{{SOURCE_CODE}}
</source_code>

The source code is written in the following language:

<source_language>
{{SOURCE_LANGUAGE}}
</source_language>

Please translate this code to Python

Notice the {{SOURCE_CODE}} and {{SOURCE_LANGUAGE}} variables, which we will later replace with dynamic values.

We can put this prompt into the left side of the workbench:

workbench_with_prompt.png

Next, we can set test values for our variables by clicking on the variables ({ }) button:

variables_button.png

This will open a dialog, asking us to input values for the {{SOURCE_CODE}} and {{SOURCE_LANGUAGE}} variables:

adding_variables.png

Next, we can hit run and see the resulting output from the model:

first_output.png


Workbench evaluations

Testing our prompt with one set of variables at a time is a good place to start, but the Workbench also comes with a built-in evaluation tool to help us run prompts against multiple inputs. To switch over to the evaluate view, click the "Evaluate" toggle button at the top:

evaluate_button.png

This opens the evaluate view, with our initial result pre-populated:

evaluate1.png

Next, we can click the "Add Row" button to add some new test cases. Let's add in two new test cases: some Ruby code and some C# code:

evaluate2.png

Next, we can either click the individual "Run" buttons next to each test case, or we can click the orange "Run Remaining" button to click all remaining un-run test cases:

run_remaining.png

Let's click the Run Remaining button and take a look at our model responses:

This are the results we got:

evaluate3.png


Human grading

Now it's time to take a close look at the model outputs and give them scores. In the right column, we have the option of assigning a score to each output: