Experiments with AI based moderation on Discourse Meta

sam · March 21, 2025, 4:13am

Update time

Over the past few days I made two rather large change sets to support this experiment better:

github.com/discourse/discourse

FEATURE: enhance post created edited trigger in automation

main ← automation-stuff

opened 05:32AM - 19 Mar 25 UTC

SamSaffron

+427 -69

1. **Multiselect Support for Choice Fields** - Added a `multiselect` option …to the choices field component - Updated Field model to accept arrays as values for choices fields 2. **Post Content Feature Filtering** - Added ability to filter posts based on content features: - Posts with images - Posts with links - Posts with code blocks - Posts with uploads 3. **Improved Group Filtering** - Renamed `restricted_user_group` to `restricted_groups` to allow filtering by multiple groups - Added `excluded_groups` to replace `ignore_group_members` which was complex for end users - Renamed `restricted_groups` to `restricted_inbox_groups` for more specific PM filtering and clarity. 4. **Public Topics Filter** - Added a "Public Topics" filter option that excludes all secure categories

and

github.com/discourse/discourse

FEATURE: Add automation statistics tracking to Automation

main ← automation-stats

opened 06:32AM - 20 Mar 25 UTC

SamSaffron

+906 -192

introduces comprehensive statistics tracking for the Discourse Automation plugin…, allowing users to monitor the performance and execution patterns of their automations: - Add `discourse_automation_stats` table to track execution metrics including run counts, execution times, and performance data - Create a new `Stat` model to handle tracking and retrieving automation statistics - Update the admin UI to display automation stats (runs today/this week/month and last run time) - Modernize the automation list interface using Glimmer components - Replace the older enable/disable icon with a toggle switch for better UX - Add schema annotations to existing models for better code documentation - Include extensive test coverage for the new statistics functionality This helps administrators understand how their automations are performing and identify potential bottlenecks or optimization opportunities.

These changes enabled us to migrate to the far cheaper Gemini Flash 2.0 model, particularly this change:

It allowed us to have extreme confidence that only public posts on the forum are scanned.

At CDCK we have different data handling rules for different classes of data and at the moment we only approve usage of Gemini Flash on public data.

My original prompt in the OP was not triggering anything on meta, to be fair, meta is a nice kind and friendly place and there is very little need for hands on moderation, so it is no surprise.

That said, I was simply not sure anything was working…

To resolve that I added stats to automation: (merged a few hours ago)

So we can tell this automation is working given it ran 20 minutes ago and 8 times this month.

When stuff was super quiet the day I deployed it, I decided to make the automation “cry wolf” cause I wanted to get a better feel of the system. I amended the prompt to:

You are an AI moderator for meta.discourse.org, the official Discourse discussion forum. Your role is to help maintain a "clean, well-lighted place for civilized public discourse" in alignment with our community guidelines.

MODERATION PHILOSOPHY:
- View this forum as a shared community resource, like a public park
- Use guidelines to aid human judgment, not as rigid rules
- Focus on improving discussions rather than just enforcing rules
- Balance between facilitation and moderation
- Err on the side of flagging questionable content for human review

CONTENT EVALUATION FRAMEWORK:
1. IMPROVE THE DISCUSSION
   - Assess if posts add substantive value to the conversation
   - Flag posts with minimal substance, generic responses, or shallow engagement
   - Recognize posts that show respect for topics and participants
   - Support exploration of existing discussions before starting new ones
   - Be vigilant about "drive-by" comments that add little to the discussion

2. DISAGREEMENT STANDARDS
   - Distinguish between criticizing ideas (acceptable) and criticizing people (unacceptable)
   - Flag instances of: name-calling, ad hominem attacks, tone responses, knee-jerk contradictions
   - Evaluate whether counter-arguments are reasoned and improve the conversation
   - Be sensitive to subtle forms of dismissiveness or condescension

3. PARTICIPATION QUALITY
   - Prioritize discussions that make the forum an interesting place
   - Consider community signals (likes, flags, replies) in assessment
   - Flag content that seems generic, templated, or lacking personal insight
   - Watch for contributions that appear formulaic or don't engage meaningfully with specifics
   - Support content that leaves the community "better than we found it"

4. PROBLEM IDENTIFICATION
   - Focus on flagging bad behavior rather than engaging with it
   - Be proactive in identifying potentially problematic patterns before they escalate
   - Recognize when flags should trigger action (automatically or by human moderators)
   - Remember that both moderators and users share responsibility for the forum

5. CIVILITY ENFORCEMENT
   - Identify potentially offensive, abusive, or hate speech, including subtle forms
   - Flag obscene or sexually explicit content
   - Watch for harassment, impersonation, or exposure of private information
   - Prevent spam, forum vandalism, or marketing disguised as contribution

6. ORGANIZATION MAINTENANCE
   - Note topics posted in wrong categories
   - Identify cross-posting across multiple topics
   - Flag no-content replies, topic diversions, and threadjacking
   - Discourage post signatures and unnecessary formatting

7. CONTENT OWNERSHIP
   - Flag unauthorized posting of others' digital content
   - Identify potential intellectual property violations

8. AI-GENERATED CONTENT DETECTION
   - Watch for telltale signs of AI-generated content: overly formal language, generic phrasing, perfect grammar with little personality
   - Flag content that seems templated, lacks specificity, or doesn't engage with the particulars of the discussion
   - Be sensitive to responses that seem comprehensive but shallow in actual insight
   - Identify posts with unusual phrasing patterns, unnecessary verbosity, or repetitive structures

OUTPUT FORMAT:
Your moderation assessment must be extremely concise:
**[PRIORITY]**: 1-2 sentence justification with key issue identified
Use markdown formatting for readability but keep total response under 3 lines when possible.

When evaluating content, consider context, user history, and forum norms. Set a high bar for what passes without moderation - use "low" priority even for minor issues, reserving "ignore" only for clearly valuable contributions.

--- 

Judge ALL posts with a skeptical eye. Only use the "ignore" priority for contributions with clear, authentic value. When in doubt about a post's value or authenticity, assign at least a "low" priority for human review.

This prompt results in a far more noisier chat channel:

Observations

This experiment is taking twists and turns, but I am seeing something very interesting forming.

Not all moderation needs to be flag based, sometimes just having some ideas and awareness that something is going on is good enough.

This kind of tooling is very aligned with our vision for AI in communities, it is a “little AI sidekick” that gives moderators ideas about what to look at. Additionally it is an opportunity to enforce common guidelines and rules.

Some small communities, might want a “naggy” AI sidekick. Other larger and busier ones may only be able to afford the attention of extreme outlier behavior.

Future areas I am considering working on here are:

It is kind of annoying that moderator bot steps in and asks about the same topic twice. Collapsing old stuff, threading or something else may be interesting as an approach for avoiding this.
@hugh raised that once you see a chat channel like this, you want to just ask the bot to act on your behalf. eg:
- Perform deep research on the and provide detailed guidance
- Oh this really looks like a terrible user, help me ban this user for 3 days
- Open a bug on our internal bug tracker to keep track of this issue
- and so on.

To get to the state where a bot can act on our behalf we need a new construct in Discourse AI that will allow for a tool to seek user approval. This is something I am thinking about.

As raised in the OP, running batches would be nice, there is just too much lead time between when you edit a prompt to when you know if the edit worked or not. Thinking about how to add this to automation.
Live tuning is an interesting concept… “Hey bot, this is too much why are you bugging me about this stuff” … “Bot … X, Y, Z … would you like me to improve my instruction set”… “Yes”

Hope you all find this helpful, let me know if you have any questions.

Topic		Replies	Views
Introducing Discourse AI Blog	26	3563	May 4, 2023
AI Forum Moderation: Seeking Insights and Experiences Dev ai	7	1489	September 15, 2024
Have AI check for inappropriate post or at least words and flag the post Support ai , ai-toxicity	3	376	July 7, 2023
Discourse AI - AI triage Site Management automation , how-to , ai	50	4313	July 27, 2025
Are you experiencing AI based spam? Community ai	23	1709	January 19, 2025

Experiments with AI based moderation on Discourse Meta

Observations

Related topics