Setting up NSFW detection in your community

:bookmark: This is a how-to guide for setting up NSFW Images and text detection in your community using the Discourse AI Post Classifier.

:person_raising_hand: Required user level: Administrator

Overview

In this topic we are going to use Discourse AI - AI triage to detect NSFW Images and text in your community. Using this guide, admins will be aware of such posts and take action accordingly.

Note that the instructions here can be customized to your preference.

Below is an example setup:

Pre-requisites

In order for this to work you will need the following enabled

  • Discourse AI
  • Discourse-Automation
  • A vision-enabled LLM (Large Language Model)
    • Discourse hosted customers on our Business or Enterprise plans can opt into our hosted CDCK LLMs by enabling the experimental settings on your site’s Admin > What’s-New page.

:warning: Vision-enabled LLM is only a requirement if you are trying to detect images, else a standard LLM will work fine for text detection.

Configuration

:information_source: Not every step is mandatory as automation rules can be customized as needed. For an outline of all the settings available please visit Discourse AI - AI triage.

  1. Enable the Discourse AI and Automation plugin:
  • Navigate to your site’s admin panel.
  • Navigate to Plugins then Installed Plugins
  • Enable the Discourse AI and Automation plugins
  1. Create a New Automation Rule:
  • Navigate to your site’s admin panel.
  • Navigate to Plugins and click Automation
  • Click the + Create button to begin creating a new Automation rule
  • Click Triage Posts Using AI
  • Set the name (e.g., “Triage Posts using AI”)
  • Leave Triage Posts Using AI as the selected script

What/When

  1. Set the Trigger:
  • Choose Post created/edited as the trigger.
  • Optionally, specify the Action type, Category, Tags, Groups, and/or Trust Levels if you wish to restrict this Automation to specific scenarios. Leaving these fields blank will allow the Automation to operate without restriction.
  • Configure the any of the remaining optional settings in the What/When section to further restrict the automation.

Script Options

  1. System Prompt:

:warning: When authoring the prompt, picking between NSFW and not NSFW - avoid having similar language for the end result. In this example, we use NSFW and Safe (for not NSFW)

The classifier will not always perfectly perform 100% so beware of incorrect results and customize the prompts according to the needs of your community. The narrower the focus the better.

  • Enter the system prompt for the AI model. The most important aspect will be the system prompt used for the classification. In the following example I have used AI bot to author the prompt. An example prompt might look like this:
Copyable LLM prompts for NSFW content detection AI

Prompt Example 1 - NSFW Image Detection:

You are a bot specializing in image classification. Respond only with either NSFW or SAFE, and nothing else. NSFW is porn or gore, and SAFE is everything else. When in doubt reply with SAFE.


Prompt Example 2 - NSFW Text Detection:

You are an advanced AI content moderation system designed to triage user-generated posts. Your task is to detect and flag any content that includes bad language, inappropriate terms, or NSFW (Not Safe for Work) content. NSFW content includes explicit sexual content, violence, hate speech, graphic language, discrimination, self-harm references, or illegal activity.

Follow these directives:

Flag NSFW or inappropriate content: If a post contains inappropriate or offensive words, profanity, or NSFW content (e.g., sexual references, hate speech, or graphic language) respond with exactly one word either:

  • “SAFE”: The post is appropriate and doesn’t contain bad or NSFW content.
  • “NSFW”: If bad, inappropriate, or NSFW content is detected.

Be context-aware: Consider surrounding text when determining if a word or phrase is inappropriate (e.g., distinguish between harmless use of questionable terms and offensive usage).

Avoid false positives: Do not flag words that may have dual meanings unless the context confirms they are being used inappropriately.

Example Input/Expected Responses:
Input:
“This post is awesome, you guys rock!”
Response: “SAFE”

Input:
[Example of NSFW content]
Response: “NSFW”

Additional examples of words or content to respond with “NSFW”:

[Insert additional examples or list of inappropriate words, profanity, or NSFW content you wish to block]


  1. Search for Text:
  • Enter the output from your prompt that will trigger the automation, only the “positive” result. Using our example above, we would enter NSFW.
  1. Select the Model:
  • Choose your LLM.
    • Discourse hosted customers on our Enterprise and Business tiers can select the Discourse hosted open-weights LLM CDCK Hosted Vision LLM or a third-party provider
    • Self-hosted Discourse users will need to select the third-party vision-enabled LLM configured as a Pre-requisite to using this Automation.
  1. Set Category and Tags:
  • Define the category where these posts should be moved and the tags to be added if the post is marked as spam.
  1. Flagging:
  • Flag post as either spam or for review.
  • Select a flag type to determine what action you might want to take.
  1. Additional Options:
  • Enable the “Hide Topic” option if you want the post to be hidden.
  • Set a “Reply” that will be posted in the topic when the post is deemed NSFW.

Caveats

  • Keep in mind, LLM calls can be expensive. When applying a classifier be careful to monitor costs and always consider only running this on small subsets.
  • While better-performing models, i.e - GPT-4o, will yield better results, it can come at a higher cost. However, we have seen the cost decrease over time as LLMs get even better and cheaper

Other uses

The prompt could be customized to perform all sorts of detection, like PII exposure and spam detection. We’d love to hear how you are putting this automation to work to benefit your Community!

Last edited by @SaraDev 2024-12-13T00:16:01Z

Last checked by @Lilly 2024-12-06T00:16:06Z

Check documentPerform check on document:
5 Likes