Anthropic Building Moderation Filter

Building Moderation Filter

Export

Run Notebooks

idle

Contents

No cells yet

Add cells to see them here

Building a moderation filter with Claude

This guide will show you how to use Claude to build a content moderation filter for user-generated text. The key idea is to define the moderation rules and categories directly in the prompt, allowing for easy customization and experimentation.

Basic Approach

The basic approach is to provide Claude with a prompt that describes the categories you want to filter for (e.g. "ALLOW" and "BLOCK"), along with detailed descriptions or examples of what kinds of content should fall into each category. Then, you insert the user-generated text to be classified as part of the prompt, and ask Claude to categorize it based on the provided guidelines.

Here's an example prompt structure:

	You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:

BLOCK CATEGORY:
- [Description or examples of content that should be blocked]

ALLOW CATEGORY:
- [Description or examples of content that is allowed]

Here is the user-generated text to categorize:
<user_text>{{USER_TEXT}}</user_text>

Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.

To use this, you would replace {{USER_TEXT}} with the actual user-generated text to be classified, and then send the prompt to Claude using the Claude API. Claude's response should be either "ALLOW" or "BLOCK", indicating how the text should be handled based on your provided guidelines.

Example usage

Here's some example Python code that demonstrates how to use this approach:

[ ]

[2]

And here's an example of how you could use this function to moderate an array of user comments:

[3]

Comment: This movie was great, I really enjoyed it. The main actor really killed it!
Classification: ALLOW

Comment: Delete this post now or you better hide. I am coming after you and your family.
Classification: BLOCK

Comment: Stay away from the 5G cellphones!! They are using 5G to control you.
Classification: BLOCK

Comment: Thanks for the helpful information!
Classification: ALLOW

Customization

One of the key benefits of this approach is that you can easily customize the moderation rules by modifying the descriptions or examples provided in the prompt for the "BLOCK" and "ALLOW" categories. This allows you to fine-tune the filtering to suit your specific needs or preferences.

For example, if you wanted to Claude to moderate a rollercoaster enthusiast forum and ensure posts stay on topic, you could update the "ALLOW" and "BLOCK" category descriptions accordingly:

[4]

Title: Top 10 Wildest Inversions on Steel Coasters
Classification: ALLOW

Title: My Review of the New RMC Raptor Coaster at Cedar Point
Classification: ALLOW

Title: Best Places to Buy Cheap Hiking Gear
Classification: BLOCK

Title: Rumor: Is Six Flags Planning a Giga Coaster for 2025?
Classification: ALLOW

Title: My Thoughts on the Latest Marvel Movie
Classification: BLOCK

Improving Performance with Chain of Thought (CoT)

One technique that can enhance Claude's content moderation capabilities is "chain-of-thought" (CoT) prompting. This approach encourages Claude to break down its reasoning process into a step-by-step chain of thoughts, rather than just providing the final output.

To leverage chain of thought for moderation, you can modify your prompt to explicitly instruct Claude to break down its process into clear steps inside <thinking> tags. Here's an example:

[8]

<thinking>
The post appears to be promoting a band rather than discussing rollercoasters, theme parks, or the amusement industry. This falls under the "spam, advertisements, or self-promotion" category, which is grounds for blocking the post.
</thinking>

<output>BLOCK</output>

Improving Performance with Examples

Another technique for improving performance is by adding a few examples to the prompt, you provide Claude with some initial training data or "few-shot learning" to better understand the desired categorization. This can be especially helpful for nuanced or ambiguous cases where the category boundaries may not be entirely clear from the text descriptions alone. Here's an example of how you could modify the prompt template to include examples:

[9]

ALLOW