Building Moderation Filter
Building a moderation filter with Claude
This guide will show you how to use Claude to build a content moderation filter for user-generated text. The key idea is to define the moderation rules and categories directly in the prompt, allowing for easy customization and experimentation.
Basic Approach
The basic approach is to provide Claude with a prompt that describes the categories you want to filter for (e.g. "ALLOW" and "BLOCK"), along with detailed descriptions or examples of what kinds of content should fall into each category. Then, you insert the user-generated text to be classified as part of the prompt, and ask Claude to categorize it based on the provided guidelines.
Here's an example prompt structure:
You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:
BLOCK CATEGORY:
- [Description or examples of content that should be blocked]
ALLOW CATEGORY:
- [Description or examples of content that is allowed]
Here is the user-generated text to categorize:
<user_text>{{USER_TEXT}}</user_text>
Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.
To use this, you would replace {{USER_TEXT}} with the actual user-generated text to be classified, and then send the prompt to Claude using the Claude API. Claude's response should be either "ALLOW" or "BLOCK", indicating how the text should be handled based on your provided guidelines.
Example usage
Here's some example Python code that demonstrates how to use this approach:
And here's an example of how you could use this function to moderate an array of user comments:
Comment: This movie was great, I really enjoyed it. The main actor really killed it! Classification: ALLOW Comment: Delete this post now or you better hide. I am coming after you and your family. Classification: BLOCK Comment: Stay away from the 5G cellphones!! They are using 5G to control you. Classification: BLOCK Comment: Thanks for the helpful information! Classification: ALLOW
Customization
One of the key benefits of this approach is that you can easily customize the moderation rules by modifying the descriptions or examples provided in the prompt for the "BLOCK" and "ALLOW" categories. This allows you to fine-tune the filtering to suit your specific needs or preferences.
For example, if you wanted to Claude to moderate a rollercoaster enthusiast forum and ensure posts stay on topic, you could update the "ALLOW" and "BLOCK" category descriptions accordingly:
Title: Top 10 Wildest Inversions on Steel Coasters Classification: ALLOW Title: My Review of the New RMC Raptor Coaster at Cedar Point Classification: ALLOW Title: Best Places to Buy Cheap Hiking Gear Classification: BLOCK Title: Rumor: Is Six Flags Planning a Giga Coaster for 2025? Classification: ALLOW Title: My Thoughts on the Latest Marvel Movie Classification: BLOCK
Improving Performance with Chain of Thought (CoT)
One technique that can enhance Claude's content moderation capabilities is "chain-of-thought" (CoT) prompting. This approach encourages Claude to break down its reasoning process into a step-by-step chain of thoughts, rather than just providing the final output.
To leverage chain of thought for moderation, you can modify your prompt to explicitly instruct Claude to break down its process into clear steps inside <thinking> tags. Here's an example:
<thinking> The post appears to be promoting a band rather than discussing rollercoasters, theme parks, or the amusement industry. This falls under the "spam, advertisements, or self-promotion" category, which is grounds for blocking the post. </thinking> <output>BLOCK</output>
Improving Performance with Examples
Another technique for improving performance is by adding a few examples to the prompt, you provide Claude with some initial training data or "few-shot learning" to better understand the desired categorization. This can be especially helpful for nuanced or ambiguous cases where the category boundaries may not be entirely clear from the text descriptions alone. Here's an example of how you could modify the prompt template to include examples:
ALLOW