AI Spam detection has been tremendously successful and helped so many of our communities succeed.
In this post, I would like to share details about our “in progress” experiment, in case it is helpful to other communities.
I intend to keep this post up to date as the experiment progresses and share some information on the class of problems it can detect.
Be mindful, though: this is an evolving system, not a final product yet.
Why AI moderation?
A key approach we have with AI integration on Discourse is that it should add value to human moderators, rather than replacing them. The promise of AI moderation is that it can signal to moderators that “something is wrong” and make recommendations about actions they should take. Agency should be left entirely to human moderators.
Why chat as a modality for this experiment
When mounting my experiment, I opted to use chat as the modality for notifications. This allows for a dedicated channel for the experiment that does not interfere with general moderation on the forum.
Given that building and refining prompts is very much a work-in-progress, bugging the rest of the moderation team on meta did not feel like a good approach.
When you give people highly unfinished AI projects, you can very easily lose all trust and future buy-in.
What about batch testing?
One limitation of our current automation system is that you can not currently batch test changes. This means that when you make changes to AI prompts there is a long delay till you figure out how helpful it is.
This is particularly problematic if you only see a handful of problems on the forum during the day. Reaction time is too slow and it can take months to refine a prompt.
I am very aware of this limitation and hope to delete this section from the post over the next few weeks, cause we have a system for doing so.
How is this configured?
My current experiment builds on 3 features:
- Automation - AI Persona responder
- Discourse AI - AI Persona
- Discourse AI - Custom tools
Our responder automation
The most notable thing about the responder is that responder is silent, meaning it will neither whisper or post on the topic it triages.
Our Persona moderating
The most notable thing here is the forced tool this means every post will be judged using the judge post custom tool.
Our current system prompt is this: (will update as we go)
system prompt
You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a “clean, well-lighted place for civilized public discourse” in alignment with our community guidelines.
MODERATION PHILOSOPHY:
- View this forum as a shared community resource, like a public park
- Use guidelines to aid human judgment, not as rigid rules
- Focus on improving discussions rather than just enforcing rules
- Balance between facilitation and moderation
CONTENT EVALUATION FRAMEWORK:
-
IMPROVE THE DISCUSSION
- Assess if posts add value to the conversation
- Recognize posts that show respect for topics and participants
- Support exploration of existing discussions before starting new ones
-
DISAGREEMENT STANDARDS
- Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
- Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
- Evaluate whether counter-arguments are reasoned and improve the conversation
-
PARTICIPATION QUALITY
- Prioritize discussions that make the forum an interesting place
- Consider community signals (likes, flags, replies) in assessment
- Support content that leaves the community “better than we found it”
-
PROBLEM IDENTIFICATION
- Focus on flagging bad behavior rather than engaging with it
- Recognize when flags should trigger action (automatically or by human moderators)
- Remember that both moderators and users share responsibility for the forum
-
CIVILITY ENFORCEMENT
- Identify potentially offensive, abusive, or hate speech
- Flag obscene or sexually explicit content
- Watch for harassment, impersonation, or exposure of private information
- Prevent spam or forum vandalism
-
ORGANIZATION MAINTENANCE
- Note topics posted in wrong categories
- Identify cross-posting across multiple topics
- Flag no-content replies and topic diversions
- Discourage post signatures
-
CONTENT OWNERSHIP
- Flag unauthorized posting of others’ digital content
- Identify potential intellectual property violations
When evaluating content, consider context, user history, and forum norms. Your goal is to guide rather than punish, educate rather than enforce, but maintain consistent standards that preserve the quality of discussion.
Judge ALL posts, if a post requires no moderation use the ignore priority.
Our judge post custom tool
the script powering it
function invoke(params) {
let post,topic;
if (params.priority !== "ignore") {
// post_id for testing
const post_id = context.post_id || 1735240;
post = discourse.getPost(post_id);
topic = post.topic;
let statusEmoji = "";
if (params.priority === "urgent") {
statusEmoji = ":police_car_light:"; // Red circle for urgent
} else if (params.priority === "medium") {
statusEmoji = ":warning:"; // Orange circle for medium
} else if (params.priority === "low") {
statusEmoji = ":writing_hand:"; // Green circle for low
}
const message = `${statusEmoji} [${topic.title} - ${post.username}](${post.post_url}): ${params.message}`;
discourse.createChatMessage({ channel_name: "AI Moderation", username: "AI-moderation-bot", message: message});
}
chain.setCustomRaw("Post was classified");
return "done";
}
function details() {
return "Judge Post";
}
The script uses quite a few advanced techniques:
chain.setCustomRaw
this tell the persona to stop running the LLM chain and makes the call to the tool the final call, saving tokensdiscourse.createChatMessage
a new API that can be used from tools to create chat messages.discourse.getPost
which is used to get post information
Given this I am able to test the tool using the test button and confirm it works well:
What model are you using?
At the moment, we are using Sonnet 3.7, which is a frontier model. However, we plan to shift to Gemini Flash once I make some improvements to Discourse Automation, particularly the ability to tell it to scan only public content and avoid secure categories.
I’m happy to field questions here and will keep updating as the experiment runs and we roll out more Discourse Automation features.