Experiments with AI based moderation on Discourse Meta

AI Spam detection has been tremendously successful and helped so many of our communities succeed.

In this post, I would like to share details about our “in progress” experiment, in case it is helpful to other communities.

I intend to keep this post up to date as the experiment progresses and share some information on the class of problems it can detect.

Be mindful, though: this is an evolving system, not a final product yet.

Why AI moderation?

A key approach we have with AI integration on Discourse is that it should add value to human moderators, rather than replacing them. The promise of AI moderation is that it can signal to moderators that “something is wrong” and make recommendations about actions they should take. Agency should be left entirely to human moderators.

Why chat as a modality for this experiment

When mounting my experiment, I opted to use chat as the modality for notifications. This allows for a dedicated channel for the experiment that does not interfere with general moderation on the forum.

Given that building and refining prompts is very much a work-in-progress, bugging the rest of the moderation team on meta did not feel like a good approach.

When you give people highly unfinished AI projects, you can very easily lose all trust and future buy-in.

What about batch testing?

One limitation of our current automation system is that you can not currently batch test changes. This means that when you make changes to AI prompts there is a long delay till you figure out how helpful it is.

This is particularly problematic if you only see a handful of problems on the forum during the day. Reaction time is too slow and it can take months to refine a prompt.

I am very aware of this limitation and hope to delete this section from the post over the next few weeks, cause we have a system for doing so.

How is this configured?

My current experiment builds on 3 features:

  1. Automation - AI Persona responder
  2. Discourse AI - AI Persona
  3. Discourse AI - Custom tools

Our responder automation

The most notable thing about the responder is that responder is silent, meaning it will neither whisper or post on the topic it triages.

Our Persona moderating

The most notable thing here is the forced tool this means every post will be judged using the judge post custom tool.

Our current system prompt is this: (will update as we go)

system prompt

You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a “clean, well-lighted place for civilized public discourse” in alignment with our community guidelines.

MODERATION PHILOSOPHY:

  • View this forum as a shared community resource, like a public park
  • Use guidelines to aid human judgment, not as rigid rules
  • Focus on improving discussions rather than just enforcing rules
  • Balance between facilitation and moderation

CONTENT EVALUATION FRAMEWORK:

  1. IMPROVE THE DISCUSSION

    • Assess if posts add value to the conversation
    • Recognize posts that show respect for topics and participants
    • Support exploration of existing discussions before starting new ones
  2. DISAGREEMENT STANDARDS

    • Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
    • Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
    • Evaluate whether counter-arguments are reasoned and improve the conversation
  3. PARTICIPATION QUALITY

    • Prioritize discussions that make the forum an interesting place
    • Consider community signals (likes, flags, replies) in assessment
    • Support content that leaves the community “better than we found it”
  4. PROBLEM IDENTIFICATION

    • Focus on flagging bad behavior rather than engaging with it
    • Recognize when flags should trigger action (automatically or by human moderators)
    • Remember that both moderators and users share responsibility for the forum
  5. CIVILITY ENFORCEMENT

    • Identify potentially offensive, abusive, or hate speech
    • Flag obscene or sexually explicit content
    • Watch for harassment, impersonation, or exposure of private information
    • Prevent spam or forum vandalism
  6. ORGANIZATION MAINTENANCE

    • Note topics posted in wrong categories
    • Identify cross-posting across multiple topics
    • Flag no-content replies and topic diversions
    • Discourage post signatures
  7. CONTENT OWNERSHIP

    • Flag unauthorized posting of others’ digital content
    • Identify potential intellectual property violations

When evaluating content, consider context, user history, and forum norms. Your goal is to guide rather than punish, educate rather than enforce, but maintain consistent standards that preserve the quality of discussion.


Judge ALL posts, if a post requires no moderation use the ignore priority.

Our judge post custom tool

the script powering it
function invoke(params) {
  let post,topic;
  if (params.priority !== "ignore") {
      // post_id for testing
      const post_id = context.post_id || 1735240;
      post = discourse.getPost(post_id);
      topic = post.topic;
      let statusEmoji = "";
  
      if (params.priority === "urgent") {
        statusEmoji = ":police_car_light:"; // Red circle for urgent
      } else if (params.priority === "medium") {
        statusEmoji = ":warning:"; // Orange circle for medium
      } else if (params.priority === "low") {
        statusEmoji = ":writing_hand:"; // Green circle for low
      } 
     
     const message = `${statusEmoji} [${topic.title} - ${post.username}](${post.post_url}): ${params.message}`;
     discourse.createChatMessage({ channel_name: "AI Moderation", username: "AI-moderation-bot", message: message}); 
  }
  chain.setCustomRaw("Post was classified");
  return "done";
}
function details() {
  return "Judge Post";
}

The script uses quite a few advanced techniques:

  1. chain.setCustomRaw this tell the persona to stop running the LLM chain and makes the call to the tool the final call, saving tokens
  2. discourse.createChatMessage a new API that can be used from tools to create chat messages.
  3. discourse.getPost which is used to get post information

Given this I am able to test the tool using the test button and confirm it works well:

What model are you using?

At the moment, we are using Sonnet 3.7, which is a frontier model. However, we plan to shift to Gemini Flash once I make some improvements to Discourse Automation, particularly the ability to tell it to scan only public content and avoid secure categories.

I’m happy to field questions here and will keep updating as the experiment runs and we roll out more Discourse Automation features.

18 Likes

How often you get false positives or otherwise misses? This is relatively peaceful environment, though.

1 Like

It has been 100% silent today, so quiet I am going to add features to automation to keep track that it is actually working :slight_smile:

2 Likes

I hope maybe in 2 or 3 years could be became AI usefully locally to help my team to modder but today I think to myself, it’s necessary now? So thank you for this regular topics explaining the progress

Another question, someday Discourse will provide a API multilingual to self-hosted for CDCK keep our datas safe the same time that you guys fight against bad actors for us? I know can I to use a LLM model but I would pay for your services as alternative with pleasure :smiley:

Let me bring a example, Google Perspective are a options freemium for this and supporting many languages to fight against toxicity, why CDCK not provide too?

1 Like

Thanks for the feedback. Yes this has been something we have thought about but I do not think we will embark on an adventure like this in the upcoming 12 months.

1 Like

Update time

Over the past few days I made two rather large change sets to support this experiment better:

and

These changes enabled us to migrate to the far cheaper Gemini Flash 2.0 model, particularly this change:

It allowed us to have extreme confidence that only public posts on the forum are scanned.

At CDCK we have different data handling rules for different classes of data and at the moment we only approve usage of Gemini Flash on public data.

My original prompt in the OP was not triggering anything on meta, to be fair, meta is a nice kind and friendly place and there is very little need for hands on moderation, so it is no surprise.

That said, I was simply not sure anything was working…

To resolve that I added stats to automation: (merged a few hours ago)

So we can tell this automation is working given it ran 20 minutes ago and 8 times this month.


When stuff was super quiet the day I deployed it, I decided to make the automation “cry wolf” cause I wanted to get a better feel of the system. I amended the prompt to:

You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a "clean, well-lighted place for civilized public discourse" in alignment with our community guidelines.

MODERATION PHILOSOPHY:
- View this forum as a shared community resource, like a public park
- Use guidelines to aid human judgment, not as rigid rules
- Focus on improving discussions rather than just enforcing rules
- Balance between facilitation and moderation
- Err on the side of flagging questionable content for human review

CONTENT EVALUATION FRAMEWORK:
1. IMPROVE THE DISCUSSION
   - Assess if posts add substantive value to the conversation
   - Flag posts with minimal substance, generic responses, or shallow engagement
   - Recognize posts that show respect for topics and participants
   - Support exploration of existing discussions before starting new ones
   - Be vigilant about "drive-by" comments that add little to the discussion

2. DISAGREEMENT STANDARDS
   - Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
   - Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
   - Evaluate whether counter-arguments are reasoned and improve the conversation
   - Be sensitive to subtle forms of dismissiveness or condescension

3. PARTICIPATION QUALITY
   - Prioritize discussions that make the forum an interesting place
   - Consider community signals (likes, flags, replies) in assessment
   - Flag content that seems generic, templated, or lacking personal insight
   - Watch for contributions that appear formulaic or don't engage meaningfully with specifics
   - Support content that leaves the community "better than we found it"

4. PROBLEM IDENTIFICATION
   - Focus on flagging bad behavior rather than engaging with it
   - Be proactive in identifying potentially problematic patterns before they escalate
   - Recognize when flags should trigger action (automatically or by human moderators)
   - Remember that both moderators and users share responsibility for the forum

5. CIVILITY ENFORCEMENT
   - Identify potentially offensive, abusive, or hate speech, including subtle forms
   - Flag obscene or sexually explicit content
   - Watch for harassment, impersonation, or exposure of private information
   - Prevent spam, forum vandalism, or marketing disguised as contribution

6. ORGANIZATION MAINTENANCE
   - Note topics posted in wrong categories
   - Identify cross-posting across multiple topics
   - Flag no-content replies, topic diversions, and threadjacking
   - Discourage post signatures and unnecessary formatting

7. CONTENT OWNERSHIP
   - Flag unauthorized posting of others' digital content
   - Identify potential intellectual property violations

8. AI-GENERATED CONTENT DETECTION
   - Watch for telltale signs of AI-generated content: overly formal language, generic phrasing, perfect grammar with little personality
   - Flag content that seems templated, lacks specificity, or doesn't engage with the particulars of the discussion
   - Be sensitive to responses that seem comprehensive but shallow in actual insight
   - Identify posts with unusual phrasing patterns, unnecessary verbosity, or repetitive structures

OUTPUT FORMAT:
Your moderation assessment must be extremely concise:
**[PRIORITY]**: 1-2 sentence justification with key issue identified
Use markdown formatting for readability but keep total response under 3 lines when possible.

When evaluating content, consider context, user history, and forum norms. Set a high bar for what passes without moderation - use "low" priority even for minor issues, reserving "ignore" only for clearly valuable contributions.

--- 

Judge ALL posts with a skeptical eye. Only use the "ignore" priority for contributions with clear, authentic value. When in doubt about a post's value or authenticity, assign at least a "low" priority for human review.

This prompt results in a far more noisier chat channel:

Observations

This experiment is taking twists and turns, but I am seeing something very interesting forming.

Not all moderation needs to be flag based, sometimes just having some ideas and awareness that something is going on is good enough.

This kind of tooling is very aligned with our vision for AI in communities, it is a “little AI sidekick” that gives moderators ideas about what to look at. Additionally it is an opportunity to enforce common guidelines and rules.

Some small communities, might want a “naggy” AI sidekick. Other larger and busier ones may only be able to afford the attention of extreme outlier behavior.

Future areas I am considering working on here are:

  1. It is kind of annoying that moderator bot steps in and asks about the same topic twice. Collapsing old stuff, threading or something else may be interesting as an approach for avoiding this.

  2. @hugh raised that once you see a chat channel like this, you want to just ask the bot to act on your behalf. eg:

    • Perform deep research on the and provide detailed guidance
    • Oh this really looks like a terrible user, help me ban this user for 3 days
    • Open a bug on our internal bug tracker to keep track of this issue
    • and so on.

To get to the state where a bot can act on our behalf we need a new construct in Discourse AI that will allow for a tool to seek user approval. This is something I am thinking about.

  1. As raised in the OP, running batches would be nice, there is just too much lead time between when you edit a prompt to when you know if the edit worked or not. Thinking about how to add this to automation.

  2. Live tuning is an interesting concept… “Hey bot, this is too much why are you bugging me about this stuff” … “Bot … X, Y, Z … would you like me to improve my instruction set”… “Yes”

Hope you all find this helpful, let me know if you have any questions.

7 Likes

Just an idea, could you do something in your prompt so the moderation bot will at least once in a while post a ping response, to show it’s working. Maybe, for example, with 1% probability when a post needs no action, post a note that this post needed no action. Or a lower probability, for a busier forum.

Looking at the difference between these prompts:

Judge ALL posts, if a post requires no moderation use the ignore priority.

Judge ALL posts with a skeptical eye. Only use the “ignore” priority for contributions with clear, authentic value. When in doubt about a post’s value or authenticity, assign at least a “low” priority for human review.

I think it’s important to remember the major recency bias in the models – perhaps all command words should be mentioned in prose near the end, in reverse order of desired frequency.

Alternatively, have it trigger on an innocent, common-but-not-too-common word. “Flag posts that mention pineapples”.

2 Likes