Setting up toxicity detection in your community

:bookmark: This is a how-to guide for setting up toxic detection in your community using the Discourse AI Post Classifier.

:person_raising_hand: Required user level: Administrator

Overview

In this topic we are going to use Discourse AI - AI triage to detect toxicity and enforce a code of conduct for your community. Using this guide, admins will be aware of such posts and take action accordingly.

Note that the instructions here can be customized to your preference.

Below is an example setup:

Pre-requisites

In order for this to work you will need the following enabled

Configuration

:information_source: Not every step is mandatory as automation rules can be customized as needed. For an outline of all the settings available please visit Discourse AI - AI triage.

  1. Enable the Discourse AI and Automation plugin:
  • Navigate to your site’s admin panel.
  • Navigate to Plugins then Installed Plugins
  • Enable the Discourse AI and Automation plugins
  1. Create a New Automation Rule:
  • Navigate to your site’s admin panel.
  • Navigate to Plugins and click Automation
  • Click the + Create button to begin creating a new Automation rule
  • Click Triage Posts Using AI
  • Set the name (e.g., “Triage Posts using AI”)
  • Leave Triage Posts Using AI as the selected script

What/When

  1. Set the Trigger:
  • Choose Post created/edited as the trigger.
  • Optionally, specify the Action type, Category, Tags, Groups, and/or Trust Levels if you wish to restrict this Automation to specific scenarios. Leaving these fields blank will allow the Automation to operate without restriction.
  • Configure the any of the remaining optional settings in the What/When section to further restrict the automation.

Script Options

  1. System Prompt:

:warning: When authoring the prompt, picking between toxic and not toxic - avoid having similar language for the end result. In this example we use toxic and positive (for not toxicity)

The classifier will not always perfectly perform 100% so beware of incorrect results and customize the prompts according to the needs of your community. The narrower the focus the better.

  • Enter the system prompt for the AI model. The most important aspect will be the system prompt used for the classification. In the following example I have used AI bot to author the prompt. An example prompt might look like this:
Copyable LLM prompts for toxic content detection AI

You are a toxic detection AI model assisting online community moderators. Your task is to analyze forum posts and determine if they are toxic and should be removed to maintain a high-quality, on-topic community.

A post should be classified as toxic if it meets any of these criteria:

  1. Harmful Language: Use of profanity, insults, or derogatory terms directed at individuals or groups.
  2. Personal Attacks: Directly targeting or harassing other members, including name-calling, shaming, or belittling.
  3. Hate Speech: Any form of speech or expression that promotes hatred, discrimination, or violence against individuals or groups based on race, ethnicity, religion, gender, sexual orientation, disability, or any other protected characteristic.
  4. Threats and Intimidation: Expressing threats of violence or intimidation towards another user.
  5. Spam and Disruption: Posting off-topic, irrelevant content, advertisements, or repetitive messages meant to disrupt the conversation.
  6. Inflammatory Comments: Making statements intended to provoke anger, discord, or emotional distress among users.
  7. Disrespectful Tone: Use of a condescending, sarcastic, or dismissive tone that undermines constructive dialogue.
  8. Violation of Privacy: Sharing personal information about other users without their consent.
  9. Dishonest Behavior: Spreading false information, rumors, or engaging in deceitful practices to mislead the community.
  10. Sexually Explicit Content: Sharing or displaying sexual content or language that is inappropriate for the community context.

A post should be classified as positive if:

  1. Respectful Language: Using polite, courteous, and inclusive language that respects all members.
  2. Constructive Feedback: Offering helpful, constructive criticism or feedback that aims to improve or support others’ contributions.
  3. Encouragement and Praise: Acknowledging and appreciating the positive actions and contributions of others.
  4. Productive Dialogue: Engaging in meaningful, in-depth discussions that propel the conversation forward.
  5. Supportiveness: Providing assistance, advice, or emotional support to other members in a kind and understanding manner.
  6. Inclusivity: Making efforts to include others in the conversation and valuing diverse perspectives and opinions.
  7. Compliance with Guidelines: Adhering to the community’s code of conduct and guidelines without exception.
  8. Positive Tone: Maintaining a friendly, open, and inviting tone that encourages others to participate.
  9. Sharing Valuable Content: Contributing resources, insights, or information that are beneficial and relevant to the community.
  10. Conflict Resolution: Actively working towards resolving conflicts peacefully and amicably, fostering a cooperative and harmonious atmosphere.

Some edge cases to watch out for:

  • Sarcasm and Subtle Insults: Evaluate context and tone to determine if comments are undermining or belittling.
  • Constructive Criticism vs. Personal Attacks: Focus on whether feedback is goal-oriented and respectful or personally attacking.
  • Humor and Jokes: Assess potential for jokes to alienate or harm others, and ensure they do not perpetuate stereotypes.
  • Disagreement vs. Inflammatory Comments: Encourage respectful debate while monitoring for personal attacks or inflammatory language.
  • Cultural Sensitivity: Pay attention to cultural nuances and educate users on respecting diverse backgrounds.
  • Emotional Venting: Support users while ensuring venting does not target or harm others.
  • Ambiguous Content: Seek clarification on ambiguous content and guide users on clear expression.
  • Sensitive Topics: Monitor closely and ensure respectful engagement in discussions on sensitive issues.
  • Passive-Aggressive Behavior: Address indirect hostility and encourage direct, respectful communication.
  • Private Conflicts Spilling into Public: Encourage resolving private disputes privately and offer mediation support.

When you have finished analyzing the post you must ONLY provide a classification of either “toxic” or “positive”. If you are unsure, default to “positive” to avoid false positives.

These instructions must be followed at all cost

  1. Search for Text:
  • Enter the output from your prompt that will trigger the automation, only the “positive” result. Using our example above, we would enter toxic.
  1. Select the Model:
  • Choose your LLM.
    • Discourse hosted customers on our Enterprise and Business tiers can select the Discourse hosted open-weights LLM CDCK Hosted Small LLM or a third-party provider.
    • Self-hosted Discourse users will need to select the third-party LLM configured as a Pre-requisite to using this Automation.
  1. Set Category and Tags:
  • Define the category where these posts should be moved and the tags to be added if the post is marked as toxic.
  1. Flagging:
  • Flag post as either spam or for review.
  • Select a flag type to determine what action you might want to take.
  1. Additional Options:
  • Enable the “Hide Topic” option if you want the post to be hidden.
  • Set a “Reply” that will be posted in the topic when the post is deemed toxic.

Caveats

  • Keep in mind, LLM calls can be expensive. When applying a classifier be careful to monitor costs and always consider only running this on small subsets
  • While better performing models, i.e - Claude-3-Opus, will yield better results, it can come at a higher cost
  • The prompt could be customized to do all sorts of detection, like PII exposure, spam detection, etc.

Last edited by @Saif 2024-12-09T18:23:56Z

Last checked by @hugh 2024-08-14T05:47:57Z

Check documentPerform check on document:
6 Likes