OpenAI Developing Hallucination Guardrails

Developing Hallucination Guardrails

Export

Run Notebooks

idle

Contents

No cells yet

Add cells to see them here

Developing Hallucination Guardrails

A guardrail is a set of rules and checks designed to ensure that the outputs of an LLM are accurate, appropriate, and aligned with user expectations. For more additional information on developing guardrails, you can refer to this guide on developing guardrails.

In this notebook, we'll walk through the process of developing an output guardrail that specifically checks model outputs for hallucinations.

This notebook will focus on:

Building out a strong eval set
Identifying specific criteria to measure hallucinations
Improving the accuracy of our guardrail with few-shot prompting

[1]

[2]

1. Building out an eval set

Imagine we are a customer support team that is building out an automated support agent. We will be feeding the assistant information from our knowledge base about a specific set of policies for how to handle tickets such as returns, refunds, feedback, and expect the model to follow the policy when interacting with customers.

The first thing we will do is use GPT-4o to build out a set of policies that we will want to follow.

If you want to do deep dive into generating synthetic data, you can review our Synthetic Data Generation Cookbook here

[3]

[4]

Next we'll take these policies and generate sample customer interactions that do or do not follow the instructions.

[5]

Now let's iterate through the policies and generate some examples.

[6]

[7]

[8]

[9]

2. Constructing our hallucination guardrail

When building out our hallucination guardrail, here are some guiding principles:

Provide very descriptive metrics to evaluate whether a response is accurate

It is important to break down this idea of "truth" in easily identifiable metrics that we can measure
Metrics like truthfulness and relevance are difficult to measure. Giving concrete ways to score the statement can result in a more accurate guardrail

Ensure consistency across key terminology

It is important to keep relevant terms such as knowledge base articles, assistants, and users consistent across the prompt
If we begin to use phrases such as assistant vs agent, the model could get confused

Start with the most advanced model

There is a cost vs quality trade-off when using the most advanced models. Although GPT-4o may be more expensive, it is important to start with the most advanced model so we can ensure a high degree of accuracy
Once we have thoroughly tested out the guardrail and are confident in its performance, we can look to reducing cost by tuning it down to gpt-3.5-turbo

Evaluate each sentence independently and the entire response as a whole

If the agent returns a long response, it can be useful to break down the response to individual sentences and evaluate them independently
In addition to that, evaluating the whole intent of the message as a whole can ensure that you don't lose important context

With all of this in mind, let's build out a guardrail system and measure its performance.

[10]

[11]

[12]

[13]

[14]

Precision: 0.97 (Precision measures the proportion of correctly identified true positives out of all instances predicted as positive.), 
Recall: 1.00 (Recall measures the proportion of correctly identified true positives out of all actual positive instances in the dataset.)

From the results above we can see the program is performing well with a high precision and recall metric. This means that the guardrails are able to accurately identify hallucinations in the model outputs.