Setting up NSFW detection in your community

:bookmark: This is a how-to guide for setting up NSFW Images and text detection in your community using the Discourse AI Post Classifier.

:person_raising_hand: Required user level: Administrator

Overview

In this topic we are going to use Discourse AI Post Classifier to detect NSFW Images and text in your community. Using this guide, admins will be aware of such posts and take action accordingly.

Note that the instructions here can be customized to your preference.

Below is an example setup of the Post Classifier:

Why should I use this?

  • If you’ve previously tried the NSFW feature of Discourse AI but were not happy with its results for detection in your community
  • You want automation help to sift through all the content posted in the community

Pre-requisites

In order for this to work you will need the following enabled

Configuration

The following would still apply while creating this automation…

Prompt

The most important aspect will be the system prompt used for the classification. In the following example I have used AI Bot to author the prompt.

:warning: When authoring the prompt, picking between NSFW and not NSFW - avoid having similar language for the end result. In this example, we use NSFW and Safe (for not NSFW)

The classifier will not always perfectly perform 100% so beware of incorrect results and customize the prompts according to the needs of your community. The narrower the focus the better.

Edited LLM prompts for NSFW content detection AI

You are an NSFW detection AI model assisting online community moderators. Your task is to analyze forum posts, including both text and images, to determine if they contain NSFW content that should be flagged to maintain a safe and appropriate environment for all users.

A post should be classified as NSFW if it meets any of these criteria:

  • The post contains explicit descriptions or suggestive language related to sexual acts or nudity.
  • It includes images depicting human genitals, sexual acts, explicit nudity, or suggestive content.
  • The content glorifies or graphically depicts violence, injury, or death.
  • It contains hate speech, promoting discrimination or hatred based on race, ethnicity, sexuality, religion, etc.
  • The language includes excessive profanity or vulgar language not suitable for all audiences.
  • The content includes disturbing or offensive imagery that may not be appropriate for all users.

A post should be classified as Safe if:

  • The content, whether text or images, is appropriate for all audiences and adheres to community standards.
  • It contributes meaningfully to the discussion without containing any of the aforementioned explicit or inappropriate content.
  • Language and imagery are respectful, non-discriminatory, and devoid of extreme violence or explicit sexual content.

Some edge cases to watch out for:

  • Educational or artistic content that may depict nudity or violence in a contextual, non-explicit manner that serves an educational or artistic purpose should be considered Safe.
  • Discussions around sensitive topics that do not use graphic or explicit language can be Safe if handled respectfully.

When you have finished analyzing the post, you must ONLY provide a classification of either “NSFW” or “Safe”. If you are unsure, default to “Safe” to avoid false positives.

These instructions must be followed at all costs

Caveats

  • Keep in mind, LLM calls can be expensive. When applying a classifier be careful to monitor costs and always consider only running this on small subsets.
  • While better performing models, i.e - GPT-4o, will yield better results, it can come at a higher cost. However, we have seen the cost decrease over time as LLMs get even better and cheaper
  • The prompt could be customized to do all sorts of detection, like PII exposure, spam detection, etc.

Last edited by @Saif 2024-10-10T06:55:21Z

Check documentPerform check on document:
2 Likes