Developing Hallucination Guardrails
A guardrail is a set of rules and checks designed to ensure that the outputs of an LLM are accurate, appropriate, and aligned with user expectations. For more additional information on developing guardrails, you can refer to this guide on developing guardrails.
In this notebook, we'll walk through the process of developing an output guardrail that specifically checks model outputs for hallucinations.
This notebook will focus on:
- Building out a strong eval set
- Identifying specific criteria to measure hallucinations
- Improving the accuracy of our guardrail with few-shot prompting
1. Building out an eval set
Imagine we are a customer support team that is building out an automated support agent. We will be feeding the assistant information from our knowledge base about a specific set of policies for how to handle tickets such as returns, refunds, feedback, and expect the model to follow the policy when interacting with customers.
The first thing we will do is use GPT-4o to build out a set of policies that we will want to follow.
If you want to do deep dive into generating synthetic data, you can review our Synthetic Data Generation Cookbook here
Next we'll take these policies and generate sample customer interactions that do or do not follow the instructions.
Now let's iterate through the policies and generate some examples.
2. Constructing our hallucination guardrail
When building out our hallucination guardrail, here are some guiding principles:
- Provide very descriptive metrics to evaluate whether a response is accurate
- It is important to break down this idea of "truth" in easily identifiable metrics that we can measure
- Metrics like truthfulness and relevance are difficult to measure. Giving concrete ways to score the statement can result in a more accurate guardrail
- Ensure consistency across key terminology
- It is important to keep relevant terms such as knowledge base articles, assistants, and users consistent across the prompt
- If we begin to use phrases such as assistant vs agent, the model could get confused
- Start with the most advanced model
- There is a cost vs quality trade-off when using the most advanced models. Although GPT-4o may be more expensive, it is important to start with the most advanced model so we can ensure a high degree of accuracy
- Once we have thoroughly tested out the guardrail and are confident in its performance, we can look to reducing cost by tuning it down to gpt-3.5-turbo
- Evaluate each sentence independently and the entire response as a whole
- If the agent returns a long response, it can be useful to break down the response to individual sentences and evaluate them independently
- In addition to that, evaluating the whole intent of the message as a whole can ensure that you don't lose important context
With all of this in mind, let's build out a guardrail system and measure its performance.
Precision: 0.97 (Precision measures the proportion of correctly identified true positives out of all instances predicted as positive.), Recall: 1.00 (Recall measures the proportion of correctly identified true positives out of all actual positive instances in the dataset.)
From the results above we can see the program is performing well with a high precision and recall metric. This means that the guardrails are able to accurately identify hallucinations in the model outputs.