Setting up NSFW detection in your community

:bookmark: This is a guide for setting up NSFW content detection in your community using Discourse AI automation to identify and moderate inappropriate images and text.

:person_raising_hand: Required user level: Administrator

Setting up NSFW detection in your community

Automatically detect and moderate NSFW (Not Safe for Work) content in your Discourse community using AI-powered automation. This guide will help you configure automated detection for both inappropriate images and text content, allowing you to maintain community standards with minimal manual intervention.

Summary

This documentation covers configuring the Discourse AI Triage posts using AI automation to:

  • Detect NSFW images using vision-enabled AI models
  • Identify inappropriate text content and language
  • Automatically flag, categorize, and moderate problematic posts
  • Set up custom responses and moderation actions

The automation uses large language models (LLMs) to analyze post content and takes predefined actions when NSFW material is detected.

Prerequisites

Before setting up NSFW detection, ensure you have the following enabled:

  • Discourse AI plugin - The core AI functionality plugin
  • Discourse Automation plugin: Required for creating automated rules
  • Agent: Agent with a system prompt that defines what constitutes NSFW content. Use distinct language for positive and negative classifications to avoid confusion.
  • Vision-enabled LLM: Required only for image detection; standard LLMs work for text-only detection. Make sure “Vision enabled” is turned on for both the LLM model and the Agent.
    • Discourse hosted customers can select our CDCK Hosted Small LLM when configuring Agents.
    • Self-hosted Discourse users will need to configure a third-party LLM.

Example prompts:

For image detection:

You are a bot specializing in image classification. Respond only with either NSFW or SAFE, and nothing else. NSFW is porn or gore, and SAFE is everything else. When in doubt reply with SAFE.

For text detection:

You are an advanced AI content moderation system designed to triage user-generated posts. Your task is to detect and flag any content that includes bad language, inappropriate terms, or NSFW (Not Safe for Work) content.

NSFW content includes explicit sexual content, violence, hate speech, graphic language, discrimination, self-harm references, or illegal activity.

Respond with exactly one word:
* "SAFE": The post is appropriate and doesn't contain bad or NSFW content
* "NSFW": If bad, inappropriate, or NSFW content is detected

Be context-aware and avoid false positives.

Configuration steps

Enable required plugins

  1. Navigate to your site’s admin panel
  2. Go to Plugins > Installed Plugins
  3. Enable both the Discourse AI and Automation plugins

Create automation rule

  1. In the admin panel, navigate to Plugins > Automation
  2. Click + Create to begin creating a new automation rule
  3. Select Triage Posts Using AI
  4. Set a descriptive name (e.g., “NSFW Content Detection”)

Configure triggers and restrictions

Set the trigger:

  • Choose Post created/edited as the trigger for scanning new or edited posts
  • Alternatively, choose Stalled topic to triage topics that have gone without replies for a specified duration
  • Optionally specify Action type, Categories, Tags, Groups, Trust Levels, or Post features to restrict automation scope
  • Leave fields blank to apply automation site-wide

Optional restrictions (Post created/edited trigger):
Configure additional settings to further limit automation scope:

  • First post only or Original post only to target only new topics
  • First topic only to target only a user’s first topic
  • Post features to restrict to posts with images, links, code, or uploads — useful for image-based NSFW detection
  • Restricted archetype to limit to regular topics, public topics, or personal messages

Configure AI classification

:spiral_notepad: The system prompt field has been deprecated in favor of Agents. If you had an AI automation prior to this change, a new Agent with the associated system prompt will be automatically created.

Agent:
Select the Agent defined for the NSFW detection automation.

Search text:
Enter the exact output from your prompt that triggers automation actions. Using the examples above, enter NSFW.

Advanced options:

  • Max Post Tokens: Limit how many tokens of the post are sent to the LLM
  • Max output tokens: Set an upper bound on the number of tokens the model can generate
  • Stop Sequences: Instruct the model to halt generation when it encounters specific values

Set moderation actions

Categorization and tagging:

  • Define the category where flagged posts should be moved
  • Specify tags to be added to identified NSFW content

Flagging options:

  • Enable Flag post to activate flagging, then choose a flag type:
    • Add post to review queue — sends the post to the review queue for manual moderator review
    • Add post to review queue and hide post — review queue + immediately hides the post
    • Add post to review queue and delete post — review queue + soft-deletes the post
    • Add post to review queue, delete post and silence user — review queue + soft-deletes the post + silences the author
    • Flag as spam and hide post — flags the post as spam (auto-hides it)
    • Flag as spam, hide post and silence user — spam flag + silences the author
  • Enable Hide Topic to automatically hide the entire topic

Automated responses:

  • Set a Reply User and Reply (canned reply) to post a fixed message explaining why the post was flagged
  • Select a Reply Agent to use a separate AI agent for generating dynamic responses (this takes priority over a canned reply)
  • Enable Reply as Whisper to make the reply visible only to staff

Author notifications:

  • Enable Notify author via PM to send a personal message to the post author when their content is flagged
  • Set a PM sender user (defaults to system) and optionally provide a custom PM content

Other options:

  • Enable Include personal messages to also scan and triage personal messages

Caveats

  • Keep in mind, LLM calls can be expensive. When applying a classifier be careful to monitor costs and always consider only running this on small subsets.
  • While better-performing models, i.e - GPT-4o, will yield better results, it can come at a higher cost. However, we have seen the cost decrease over time as LLMs get even better and cheaper

Other uses

The prompt could be customized to perform all sorts of detection, like PII exposure and spam detection. We’d love to hear how you are putting this automation to work to benefit your Community!

8 Likes

A post was split to a new topic: LLM and NSFW Content Detection Delay