Experiments with AI based moderation on Discourse Meta

sam · March 19, 2025, 12:31am

AI Spam detection has been tremendously successful and helped so many of our communities succeed.

In this post, I would like to share details about our “in progress” experiment, in case it is helpful to other communities.

I intend to keep this post up to date as the experiment progresses and share some information on the class of problems it can detect.

Be mindful, though: this is an evolving system, not a final product yet.

Why AI moderation?

A key approach we have with AI integration on Discourse is that it should add value to human moderators, rather than replacing them. The promise of AI moderation is that it can signal to moderators that “something is wrong” and make recommendations about actions they should take. Agency should be left entirely to human moderators.

Why chat as a modality for this experiment

When mounting my experiment, I opted to use chat as the modality for notifications. This allows for a dedicated channel for the experiment that does not interfere with general moderation on the forum.

Given that building and refining prompts is very much a work-in-progress, bugging the rest of the moderation team on meta did not feel like a good approach.

When you give people highly unfinished AI projects, you can very easily lose all trust and future buy-in.

What about batch testing?

One limitation of our current automation system is that you can not currently batch test changes. This means that when you make changes to AI prompts there is a long delay till you figure out how helpful it is.

This is particularly problematic if you only see a handful of problems on the forum during the day. Reaction time is too slow and it can take months to refine a prompt.

I am very aware of this limitation and hope to delete this section from the post over the next few weeks, cause we have a system for doing so.

How is this configured?

My current experiment builds on 3 features:

Automation - AI Persona responder
Discourse AI - AI Persona
Discourse AI - Custom tools

Our responder automation

The most notable thing about the responder is that responder is silent, meaning it will neither whisper or post on the topic it triages.

Our Persona moderating

The most notable thing here is the forced tool this means every post will be judged using the judge post custom tool.

Our current system prompt is this: (will update as we go)

system prompt

You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a “clean, well-lighted place for civilized public discourse” in alignment with our community guidelines.

MODERATION PHILOSOPHY:

View this forum as a shared community resource, like a public park
Use guidelines to aid human judgment, not as rigid rules
Focus on improving discussions rather than just enforcing rules
Balance between facilitation and moderation

CONTENT EVALUATION FRAMEWORK:

IMPROVE THE DISCUSSION
- Assess if posts add value to the conversation
- Recognize posts that show respect for topics and participants
- Support exploration of existing discussions before starting new ones
DISAGREEMENT STANDARDS
- Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
- Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
- Evaluate whether counter-arguments are reasoned and improve the conversation
PARTICIPATION QUALITY
- Prioritize discussions that make the forum an interesting place
- Consider community signals (likes, flags, replies) in assessment
- Support content that leaves the community “better than we found it”
PROBLEM IDENTIFICATION
- Focus on flagging bad behavior rather than engaging with it
- Recognize when flags should trigger action (automatically or by human moderators)
- Remember that both moderators and users share responsibility for the forum
CIVILITY ENFORCEMENT
- Identify potentially offensive, abusive, or hate speech
- Flag obscene or sexually explicit content
- Watch for harassment, impersonation, or exposure of private information
- Prevent spam or forum vandalism
ORGANIZATION MAINTENANCE
- Note topics posted in wrong categories
- Identify cross-posting across multiple topics
- Flag no-content replies and topic diversions
- Discourage post signatures
CONTENT OWNERSHIP
- Flag unauthorized posting of others’ digital content
- Identify potential intellectual property violations

When evaluating content, consider context, user history, and forum norms. Your goal is to guide rather than punish, educate rather than enforce, but maintain consistent standards that preserve the quality of discussion.

Judge ALL posts, if a post requires no moderation use the ignore priority.

Our judge post custom tool

the script powering it

function invoke(params) {
  let post,topic;
  if (params.priority !== "ignore") {
      // post_id for testing
      const post_id = context.post_id || 1735240;
      post = discourse.getPost(post_id);
      topic = post.topic;
      let statusEmoji = "";
  
      if (params.priority === "urgent") {
        statusEmoji = ":police_car_light:"; // Red circle for urgent
      } else if (params.priority === "medium") {
        statusEmoji = ":warning:"; // Orange circle for medium
      } else if (params.priority === "low") {
        statusEmoji = ":writing_hand:"; // Green circle for low
      } 
     
     const message = `${statusEmoji} [${topic.title} - ${post.username}](${post.post_url}): ${params.message}`;
     discourse.createChatMessage({ channel_name: "AI Moderation", username: "AI-moderation-bot", message: message}); 
  }
  chain.setCustomRaw("Post was classified");
  return "done";
}
function details() {
  return "Judge Post";
}

The script uses quite a few advanced techniques:

chain.setCustomRaw this tell the persona to stop running the LLM chain and makes the call to the tool the final call, saving tokens
discourse.createChatMessage a new API that can be used from tools to create chat messages.
discourse.getPost which is used to get post information

Given this I am able to test the tool using the test button and confirm it works well:

What model are you using?

At the moment, we are using Sonnet 3.7, which is a frontier model. However, we plan to shift to Gemini Flash once I make some improvements to Discourse Automation, particularly the ability to tell it to scan only public content and avoid secure categories.

I’m happy to field questions here and will keep updating as the experiment runs and we roll out more Discourse Automation features.

Jagster · March 19, 2025, 7:25am

How often you get false positives or otherwise misses? This is relatively peaceful environment, though.

sam · March 19, 2025, 7:45am

It has been 100% silent today, so quiet I am going to add features to automation to keep track that it is actually working

eisammy · March 19, 2025, 7:55am

I hope maybe in 2 or 3 years could be became AI usefully locally to help my team to modder but today I think to myself, it’s necessary now? So thank you for this regular topics explaining the progress

Another question, someday Discourse will provide a API multilingual to self-hosted for CDCK keep our datas safe the same time that you guys fight against bad actors for us? I know can I to use a LLM model but I would pay for your services as alternative with pleasure

Let me bring a example, Google Perspective are a options freemium for this and supporting many languages to fight against toxicity, why CDCK not provide too?

sam · March 21, 2025, 3:36am

Thanks for the feedback. Yes this has been something we have thought about but I do not think we will embark on an adventure like this in the upcoming 12 months.

sam · March 21, 2025, 4:13am

Update time

Over the past few days I made two rather large change sets to support this experiment better:

github.com/discourse/discourse

FEATURE: enhance post created edited trigger in automation

main ← automation-stuff

opened 05:32AM - 19 Mar 25 UTC

SamSaffron

+427 -69

1. **Multiselect Support for Choice Fields** - Added a `multiselect` option …to the choices field component - Updated Field model to accept arrays as values for choices fields 2. **Post Content Feature Filtering** - Added ability to filter posts based on content features: - Posts with images - Posts with links - Posts with code blocks - Posts with uploads 3. **Improved Group Filtering** - Renamed `restricted_user_group` to `restricted_groups` to allow filtering by multiple groups - Added `excluded_groups` to replace `ignore_group_members` which was complex for end users - Renamed `restricted_groups` to `restricted_inbox_groups` for more specific PM filtering and clarity. 4. **Public Topics Filter** - Added a "Public Topics" filter option that excludes all secure categories

and

github.com/discourse/discourse

FEATURE: Add automation statistics tracking to Automation

main ← automation-stats

opened 06:32AM - 20 Mar 25 UTC

SamSaffron

+906 -192

introduces comprehensive statistics tracking for the Discourse Automation plugin…, allowing users to monitor the performance and execution patterns of their automations: - Add `discourse_automation_stats` table to track execution metrics including run counts, execution times, and performance data - Create a new `Stat` model to handle tracking and retrieving automation statistics - Update the admin UI to display automation stats (runs today/this week/month and last run time) - Modernize the automation list interface using Glimmer components - Replace the older enable/disable icon with a toggle switch for better UX - Add schema annotations to existing models for better code documentation - Include extensive test coverage for the new statistics functionality This helps administrators understand how their automations are performing and identify potential bottlenecks or optimization opportunities.

These changes enabled us to migrate to the far cheaper Gemini Flash 2.0 model, particularly this change:

It allowed us to have extreme confidence that only public posts on the forum are scanned.

At CDCK we have different data handling rules for different classes of data and at the moment we only approve usage of Gemini Flash on public data.

My original prompt in the OP was not triggering anything on meta, to be fair, meta is a nice kind and friendly place and there is very little need for hands on moderation, so it is no surprise.

That said, I was simply not sure anything was working…

To resolve that I added stats to automation: (merged a few hours ago)

So we can tell this automation is working given it ran 20 minutes ago and 8 times this month.

When stuff was super quiet the day I deployed it, I decided to make the automation “cry wolf” cause I wanted to get a better feel of the system. I amended the prompt to:

You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a "clean, well-lighted place for civilized public discourse" in alignment with our community guidelines.

MODERATION PHILOSOPHY:
- View this forum as a shared community resource, like a public park
- Use guidelines to aid human judgment, not as rigid rules
- Focus on improving discussions rather than just enforcing rules
- Balance between facilitation and moderation
- Err on the side of flagging questionable content for human review

CONTENT EVALUATION FRAMEWORK:
1. IMPROVE THE DISCUSSION
   - Assess if posts add substantive value to the conversation
   - Flag posts with minimal substance, generic responses, or shallow engagement
   - Recognize posts that show respect for topics and participants
   - Support exploration of existing discussions before starting new ones
   - Be vigilant about "drive-by" comments that add little to the discussion

2. DISAGREEMENT STANDARDS
   - Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
   - Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
   - Evaluate whether counter-arguments are reasoned and improve the conversation
   - Be sensitive to subtle forms of dismissiveness or condescension

3. PARTICIPATION QUALITY
   - Prioritize discussions that make the forum an interesting place
   - Consider community signals (likes, flags, replies) in assessment
   - Flag content that seems generic, templated, or lacking personal insight
   - Watch for contributions that appear formulaic or don't engage meaningfully with specifics
   - Support content that leaves the community "better than we found it"

4. PROBLEM IDENTIFICATION
   - Focus on flagging bad behavior rather than engaging with it
   - Be proactive in identifying potentially problematic patterns before they escalate
   - Recognize when flags should trigger action (automatically or by human moderators)
   - Remember that both moderators and users share responsibility for the forum

5. CIVILITY ENFORCEMENT
   - Identify potentially offensive, abusive, or hate speech, including subtle forms
   - Flag obscene or sexually explicit content
   - Watch for harassment, impersonation, or exposure of private information
   - Prevent spam, forum vandalism, or marketing disguised as contribution

6. ORGANIZATION MAINTENANCE
   - Note topics posted in wrong categories
   - Identify cross-posting across multiple topics
   - Flag no-content replies, topic diversions, and threadjacking
   - Discourage post signatures and unnecessary formatting

7. CONTENT OWNERSHIP
   - Flag unauthorized posting of others' digital content
   - Identify potential intellectual property violations

8. AI-GENERATED CONTENT DETECTION
   - Watch for telltale signs of AI-generated content: overly formal language, generic phrasing, perfect grammar with little personality
   - Flag content that seems templated, lacks specificity, or doesn't engage with the particulars of the discussion
   - Be sensitive to responses that seem comprehensive but shallow in actual insight
   - Identify posts with unusual phrasing patterns, unnecessary verbosity, or repetitive structures

OUTPUT FORMAT:
Your moderation assessment must be extremely concise:
**[PRIORITY]**: 1-2 sentence justification with key issue identified
Use markdown formatting for readability but keep total response under 3 lines when possible.

When evaluating content, consider context, user history, and forum norms. Set a high bar for what passes without moderation - use "low" priority even for minor issues, reserving "ignore" only for clearly valuable contributions.

--- 

Judge ALL posts with a skeptical eye. Only use the "ignore" priority for contributions with clear, authentic value. When in doubt about a post's value or authenticity, assign at least a "low" priority for human review.

This prompt results in a far more noisier chat channel:

Observations

This experiment is taking twists and turns, but I am seeing something very interesting forming.

Not all moderation needs to be flag based, sometimes just having some ideas and awareness that something is going on is good enough.

This kind of tooling is very aligned with our vision for AI in communities, it is a “little AI sidekick” that gives moderators ideas about what to look at. Additionally it is an opportunity to enforce common guidelines and rules.

Some small communities, might want a “naggy” AI sidekick. Other larger and busier ones may only be able to afford the attention of extreme outlier behavior.

Future areas I am considering working on here are:

It is kind of annoying that moderator bot steps in and asks about the same topic twice. Collapsing old stuff, threading or something else may be interesting as an approach for avoiding this.
@hugh raised that once you see a chat channel like this, you want to just ask the bot to act on your behalf. eg:
- Perform deep research on the and provide detailed guidance
- Oh this really looks like a terrible user, help me ban this user for 3 days
- Open a bug on our internal bug tracker to keep track of this issue
- and so on.

To get to the state where a bot can act on our behalf we need a new construct in Discourse AI that will allow for a tool to seek user approval. This is something I am thinking about.

As raised in the OP, running batches would be nice, there is just too much lead time between when you edit a prompt to when you know if the edit worked or not. Thinking about how to add this to automation.
Live tuning is an interesting concept… “Hey bot, this is too much why are you bugging me about this stuff” … “Bot … X, Y, Z … would you like me to improve my instruction set”… “Yes”

Hope you all find this helpful, let me know if you have any questions.

Ed_S · March 21, 2025, 9:14pm

Just an idea, could you do something in your prompt so the moderation bot will at least once in a while post a ping response, to show it’s working. Maybe, for example, with 1% probability when a post needs no action, post a note that this post needed no action. Or a lower probability, for a busier forum.

riking · March 22, 2025, 8:05am

Looking at the difference between these prompts:

Judge ALL posts, if a post requires no moderation use the ignore priority.

Judge ALL posts with a skeptical eye. Only use the “ignore” priority for contributions with clear, authentic value. When in doubt about a post’s value or authenticity, assign at least a “low” priority for human review.

I think it’s important to remember the major recency bias in the models – perhaps all command words should be mentioned in prose near the end, in reverse order of desired frequency.

RGJ · March 25, 2025, 12:30pm

Alternatively, have it trigger on an innocent, common-but-not-too-common word. “Flag posts that mention pineapples”.

sam · April 3, 2025, 3:56am

I have not posted in a while, despite daily visiting my little chat window and having it be helpful at least once or twice per day… consistently.

The reason for my delay here was that I had to work through this rather large change.

github.com/discourse/discourse-ai

FEATURE: flexible image handling within messages

main ← better_upload_support

opened 07:22AM - 25 Mar 25 UTC

SamSaffron

+1380 -722

**1. What Led to the Change? (Problems with Previous Approach)** * **Incons…istent Context Handling:** The previous system often passed context information (like `post_id`, `user`, `private_message`, `topic_id`, `custom_instructions`) around using plain Ruby hashes (`context: {}`). This approach lacked structure, was potentially error-prone (typos in keys), and made it harder to track what context was available or required in different parts of the AI Bot system (Tools, Personas, Bot logic). Accessing context often involved `context[:key]`. * **Inflexible Image/Upload Handling:** Images associated with a user message were previously passed using a separate `upload_ids: [...]` array within the message hash. This made it difficult or impossible to represent prompts where text and images are interleaved naturally (e.g., "Describe this image {image1}, then compare it to this one {image2} and tell me the difference"). The LLM received the text and a list of associated image IDs, but not their precise relationship *within* the user's text flow. * **Complex/Decentralized Context Building:** Logic for assembling conversation history (e.g., pulling previous posts/messages, handling custom prompts, associating uploads) was somewhat spread out, notably seen in the significant changes and removals within `lib/ai_bot/playground.rb` (specifically the `conversation_context` and `chat_context` logic being refactored). **2. What New Support Does It Add? (Key Changes & Benefits)** * **Introduction of `DiscourseAi::AiBot::BotContext`:** * **What:** A dedicated class (`BotContext`) is introduced to encapsulate all contextual information for an AI Bot interaction. This includes messages, post/topic details, user information, site details (URL, title, description), time, participants, and control flags (like `skip_tool_details`). * **Why:** Provides a structured, standardized, and object-oriented way to manage and pass context. This improves code readability, maintainability, and reduces the chance of errors compared to using unstructured hashes. Access changes from `context[:key]` to `context.key`. * **Impact:** This class is now used consistently when initializing Tools (`Tool#initialize`), crafting prompts (`Persona#craft_prompt`), invoking the bot (`Bot#reply`), and within various helper methods, ensuring a uniform context object is available throughout the system. * **Enhanced Multimodal Input (Inline Images/Uploads):** * **What:** The format for representing user messages with uploads has fundamentally changed. Instead of a separate `upload_ids` array, uploads are now embedded directly *within* the `content` field, which becomes an array if uploads are present. Example: `content: ["Here is an image:", { upload_id: 123 }, "What do you see?"]`. * **Why:** This allows for precise interleaving of text and visual elements within a single user turn. It's a much more natural way to represent multimodal prompts for vision-capable LLMs, enabling more complex instructions involving multiple images referenced at specific points in the text. * **Impact:** Required changes across multiple components: * **`Prompt` Class:** Logic for handling uploads (`encoded_uploads`, `encode_upload`, `content_with_encoded_uploads`, `text_only`) was refactored to support this new inline structure. Validation was updated. * **LLM Dialects:** All relevant dialects (`ChatGpt`, `Claude`, `Gemini`, `Mistral`, `Nova`, `Ollama`, `OpenAiCompatible`) were updated to correctly parse the new `content` array format and translate it into the specific structure required by each respective LLM API (e.g., OpenAI's array of text/image_url objects, Gemini's parts array). A helper `to_encoded_content_array` was added to the base `Dialect` class. * **Modules Using Vision:** Code that passes uploads to LLMs (e.g., `LlmTriage`, `Assistant`, `SpamScanner`, `Playground`) was updated to use the new `content` format. * **Refactored Context Building:** * **What:** Logic for building conversation history from posts or chat messages seems to be increasingly centralized in `DiscourseAi::Completions::PromptMessagesBuilder`. New methods like `messages_from_post` and `messages_from_chat` appear to encapsulate this logic. * **Why:** Simplifies components like the `Playground` by abstracting away the details of fetching and formatting conversation history, including handling the new inline upload format. * **Impact:** Significant simplification in `lib/ai_bot/playground.rb`, removing large chunks of previous context-building code.

It provides a subtle, yet critical, improvement to Discourse AI.

I was regularly noticing the moderation bot talk about completely irrelevant images, due to the way we constructed context. The change allows us to present mixed content (containing images and text in a correctly ordered fashion).

This means the LLM no longer gets confused.

What’s next?

We have no way in automation to let it call a rule after post editing has “settled”, llm calls can be expensive, just because people edit typos we don’t want to scan something over and over again. I am not sure if this is required here, but I would like to allow for the possibility of triggering an automation once a post settles into the new shape.
Prompt engineering - the current prompt is OK, but a bit too loud for my liking, it is bugging me a bit too much, I may soften it some
Improved context - one thing that really bugs me is that the automation is now has not awareness of user trust. Some users are far more trusted in a community than others (eg: moderators) I would like to see if we can improve this story.
Ability to run the automation on batches of posts for fast iterations.
I am sure a lot more will pop up.

sam · April 8, 2025, 7:21am

My latest work in progress is:

github.com/discourse/discourse-ai

FEATURE: allow tools to amend personas

main ← persona-editing

opened 06:59AM - 08 Apr 25 UTC

SamSaffron

+281 -4

Add API methods to AI tools for reading and updating personas, enabling more fle…xible AI workflows. This allows custom tools to: - Fetch persona information through discourse.getPersona() - Update personas with modified settings via discourse.updatePersona() - Also update using persona.update() These APIs enable new use cases like "trainable" moderation bots, where users with appropriate permissions can set and refine moderation rules through direct chat interactions, without needing admin panel access. Also adds a special API scope which allows people to lean on API for similar actions

My idea is that there will be 2 personas powering the system:

Persona performing triage - the one defined already today (triage bot)
Persona that interacts with moderators / high trust users (mod bot)

By chatting with @mod_bot moderators (or very high trust users) will be able to guide @triage_bot on how to behave.

For example:

@mod_bot, be sure to let @sam know if anyone talks about ai

This will trigger mod_bot to amend the system prompt on triage bot. Which means being in this specific chat room will be enough to allow any community to train the robot to behave the way they want it to.

It’s an interesting twist on implementing memory. Not sure how well it will do in practice, but it is a very worthy experiment.

sam · May 26, 2025, 12:01am

This is still running on meta.

One insight I have now is that, automation is great, except when it is not, then it is terrible.

Particularly, make a robot too loud, and the robot becomes useless.

I swapped our custom instructions to the VERY boring:

You are an AI based bot that reads EVERY post on meta.discourse.org

You have access to a single tool which you will call on every post. 

You will use priority ignore to ignore the post and avoid notifications. 

ANY other priority will notify

### Triage Guidelines
## Content Monitoring
* Notify @nat when non-English content is posted (to assist with translator functionality)
* Notify @sam if you notice any discussion is becoming toxic or over-heated
* Notify @hugh when users discuss review queue functionality
  * Includes discussions about staff experience, moderator tools, queues, moderation workflows
  * Especially flag, review, approval, and related moderation UI/UX matters
### End Triage Guidelines

In past iterations I looked at stuff like “let me know if you notice a bug being discussed that is not in the bug category”.

It is enough to have one poison rule and then chat notifications go through the roof and you just ignore them.

Topic		Replies	Views
Introducing Discourse AI Blog	26	3562	May 4, 2023
AI Forum Moderation: Seeking Insights and Experiences Dev ai	7	1489	September 15, 2024
Have AI check for inappropriate post or at least words and flag the post Support ai , ai-toxicity	3	376	July 7, 2023
Discourse AI - AI triage Site Management automation , how-to , ai	50	4313	July 27, 2025
Are you experiencing AI based spam? Community ai	23	1709	January 19, 2025

Experiments with AI based moderation on Discourse Meta

Why AI moderation?

Why chat as a modality for this experiment

What about batch testing?

How is this configured?

Our responder automation

Our Persona moderating

Our judge post custom tool

What model are you using?

Observations

What’s next?

Related topics