Notebooks
A
Arize AI
Dataset Embeddings Guard

Dataset Embeddings Guard

arize-tutorialsguardrailsLLMPython

arize logo
Docs | GitHub | Community

ArizeDatasetEmbeddings Guard

In this demo, we are going to set up and use a Guard that blocks the LLM from responding from attempted jailbreaks. We will do this using the ArizeDatasetEmbeddings Guard from Arize AI. This Guard works in the following way:

  • The Guard computes embeddings for chunks associated with a set of few shot examples of "bad" user prompts or LLM messages (we recommend using 10 different prompts)
  • When the the Guard is applied to a user or LLM message, the Guard computes the embedding for the input message and checks if any of the few shot "train" examples in the dataset are close to the message in embedded space
  • If the cosine distance between the input message and any of the chunks is within the user-specified threshold (default setting is 0.2), then the Guard intercepts the LLM call.

In this demo, we use the ArizeDatasetEmbeddings Guard in two ways, first on a dataset of jailbreak prompts, then subsequently on a dataset of PII prompts. In both cases, we apply the Guard to user input messages, rather than LLM output messages (although we could take either approach). If the Guard flags a jailbreak attempt or PII in the user message, we simply throw an Exception. Alternatively, in practice the user can specify a default LLM response that can be used when the Guard is triggered.

Install Dependencies

Various installations are required for OTL, LlamaIndex and Open AI.

[ ]
[ ]

Initialize Arize

Set up OTL tracer for the LlamaIndexInstrumentor.

[ ]

Instrument Guardrails AI

Install and instrument Guardrails AI. Import ArizeDatasetEmbeddings Guard.

[ ]

Import ArizeDatasetEmbeddings Guard

To run the following commands, you will need to go to guardrailsai.com and register for an account. Then you can use the API key on https://hub.guardrailsai.com/keys after running the guardrails configure command.

[ ]
[ ]

Instantiate ArizeDatasetEmbeddings Guard

We're going to use a public dataset to instantiate the ArizeDatasetEmbeddings Guard with 10 few shot example jailbreak prompts. For details on the dataset, please refer to the following resources:

Note that we could Guard against any type of dataset by passing in the argument sources={my_sources}. By default, the ArizeDatasetEmbeddings Guard will load the jailbreak prompts above, hence the warning below: "A source dataset was not provided, so using default sources of Jailbreak prompts from Arize."

[ ]
[ ]

Set Up RAG Application

Create a LlamaIndex VectorStore to create a classic RAG application over Paul Graham essays.

[ ]

Run Guard on Jailbreak Prompts from Public Dataset

Below, we're only going to run the ArizeDatasetEmbeddings Guard on a single jailbreak prompt and a single "regular" prompt (which looks similar to a jailbreak). These examples come from the same dataset above.

Although we are only running the Guard on two examples in this notebook, we have also benchmarked the Guard on the full dataset and found the following results:

  • True Positives: 86.43% of 656 jailbreak prompts failed the JailbreakEmbeddings guard.
  • False Negatives: 13.57% of 656 jailbreak prompts passed the JailbreakEmbeddings guard.
  • False Positives: 13.95% of 2000 regular prompts failed the JailbreakEmbeddings guard.
  • True Negatives: 86.05% of 2000 regular prompts passed the JailbreakEmbeddings guard.
  • 1.41 median latency for end-to-end LLM call on GPT-3.5
  • 2.91 mean latency for end-to-end LLM call on GPT-3.5
[ ]

View Trace in Arize UI

Now we can debug the entire trace using the Arize UI. Below we see the following information:

  • Each LLM call and guard step that took place under the hood.
  • The error message from the Guard when it flagged the Jailbreak attempt.
  • The validator_result: "fail"
  • The validator_on_fail: "exception"
  • The cosine_distance: 0.15, which is the cosine distance of the closest embedded prompt chunk in the set of few shot examples of jailbreak prompts.
  • The text corresponding to the most similar_jailbreak_phrase.
  • The text corresponding to the most user_message.

Screenshot 2024-07-01 at 2.38.44 PM.png

Screenshot 2024-07-01 at 2.38.55 PM.png

Screenshot 2024-07-01 at 2.39.10 PM.png

Trace Regular Prompt

Now we will send a "regular" prompt to the query engine. This comes from the same research paper and Github repository referenced earlier in the notebook. These regular prompt samples are designed to resemble jailbreak prompts in their role-play design, but are not actually jailbreak attempts.

[ ]
[ ]

When passing in the regular_prompt above, we see "validator_result": "pass". The most similar chunk in our few shot dataset of jailbreak prompts has a cosine distance 0.21 to the input message, so this does not trigger the Guard.