Disorder - Automatic toxicity detection for your community

:discourse2: Summary Disorder helps moderators by automatically flagging potential toxic content on your Discourse forum.
:hammer_and_wrench: Repository Link https://github.com/xfalcox/disorder
:open_book: Install Guide How to install plugins in Discourse

If you are using our official hosting service, please contact our support to register your interest for this plugin.

Toxicity

As @hawk said in Dealing with Toxicity in Online Communities, managing toxicity in your community is fundamental.

While Discourse ships out of the box with many different tools to help manage toxicity in your community, we are always researching ways to improve it further. In particular, I’ve been researching about possible applications of Machine Learning and AI in online forums. Which is now an experimental plugin, available to all communities.

The plugin

Disorder leverages Artificial Intelligence and Machine Learning to help you moderate you community, making it easier for your moderation team to be on top of potentially problematic content and even optionally nudging your users to revise toxic posts before posting.

This is a first foray into using self-hosted ML models in Discourse, and while it’s a simple model it sets a pattern that can be reused to apply more complex models down the road.

Features

Background Flagging

This is Disorder’s main mode of operation, as it’s completely transparent to your users, that will not be aware of any changes.

Whenever a new post (or chat message using Discourse Chat) is created, it will be put in a classification queue asynchronously. In case that classification comes back above a configurable threshold, the post/chat message will be flagged so your moderation team is warned about it, and can make the final decision on the flag.

New post intervention

If you think that prevention is the best medicine, you may be interested in this more active option.

You can enable a synchronous classification of any new post, that if above a configurable threshold of toxicity, will trigger an intervention on the new post flow, asking the user to review and make amends to the message that may be outside of the boundaries set by your community rules.

This will only happen once, and after closing the modal the user will be able to post normally.

How does it work?

This plugin integrates the open source models from Detoxify, using a remote API call model to allow admins to properly scale the inference rate to each community needs.

We provide a simple image that provides an thin HTTP API that Discourse will call to perform content classification, which can be ran either in the same server where you run Discourse, or in a different server altogether.

The Discourse plugin listens to new post / new chat messages events, and enqueue a classification job in the background queue. Results are stored in the database so you can extract reports, and we flag content using a separate bot user so we can track it’s flag accuracy over time.

Options

First, the plugin ships working out of the box, so it’s not necessary to change any setttings right away. However, if you want to change the plugin behavior, there are a few knobs you can use.

We provide 3 different classification models that you can pick on the plugin options:

  • unbiased (default): A model that tries to reduce the unintended model bias in toxicity classification

  • multilingual: A model that can classify Italian, French, Russian, Portuguese, Spanish and Turkish.

  • original: Most simple model.

You can also tweak if the plugin will:

  • automatically flag
  • enable sync intervention on toxic posts with warning (experimental)
  • enable sync intervention on toxic posts (not recommended)

All the above only happen when the comment is classified to be above the thresholds for each classification type:

  • toxicity
  • severe_toxicity
  • identity_attack
  • insult
  • threat
  • sexual_explicit

You can tweak each of the classification thresholds for automatic actions.

Classification Service

The plugin comes pre-configured working out of the box. For that, it’s contacting a service ran by Discourse (CDCK) to classify the user content. That classifier API service is open-source, and you can run your own copy of the service if necessary.

30 Likes

Just out of curiosity, what are the differences between “Disorder” and Discourse’s implementation of the Google Perspective API?

6 Likes

Code wise, they are completely different plugins.

From a ten foot view cover they same need, but they are engineered differently:

  • Disorder works with chat and posts, Perspective only with posts

  • Perspective relies on a proprietary and third party API, with all the privacy, reliability and transparence implications of that.

  • Disorder sets a pattern that allows the addition of new models easily, so we can evolve the service or even add brand new features

  • Disorder self-hostable API gives flexibility and freedom from pay-per API call and rate limits.

  • Disorder front-end surface is quite smaller, so it should be more resilient across Discourse updates.

13 Likes

Cool. Where and how do we do that?

4 Likes

Email team@discourse.org :slight_smile:

5 Likes

No need to reply here, but if you’re looking for suggestions about where to go next, an AI tag suggester based on a topic’s text could be useful. I’m imagining something similar to how Soundcloud suggest musical genre tags after they run an analysis on an upload. It’s useful for organizing user generated content on a busy site.

6 Likes

Do I understand correctly the disorder API instance should be launched to companion the plugin? There is a pre-filled setting disorder inference service api endpoint with https://disorder-testing.demo-by-discourse.com pre-set. Yet there is disorder inference service api key which is empty by default.

We are interested in giving this plugin a test as we face a lot of toxic behavior between users, which eventually gets resolved by flagging and help of Leaders, yet we would like to pro-actively prevent users from spreading negative posts if it’s possible, and this plugin seems to fit into such role.

Can we use any ready endpoint to give it a try? Fair warning, we have ~150k page views daily and it might clog up some unprepared servers.

We are standalone.

1 Like

While you can run your own API server, the plugin comes pre-configured pointing to https://disorder-testing.demo-by-discourse.com/ so it works out of the box.

Please do use this endpoint for your instance, as it’s provided exactly for your use case of self-hosted instances wanting to give this plugin a try. In the default configuration, all the API calls happen in the background, so the API being down won’t impact your site in any way, so it’s safe to use it.

The api key setting is optional, and only needed if your API server have it enabled. The public instance at https://disorder-testing.demo-by-discourse.com/ doesn’t have it enabled.

3 Likes

Thank you! Sounds perfect and will give it a try in upcoming days :heart:

3 Likes

Are there other ML applications planned for the future?

1 Like

I tried this for a week, and it was absurdly aggressive at flagging posts. I recommend using this only if you have a huge site without enough mods. Hope the AI gets better, but it’s just not there yet.

3 Likes

This is great feedback! Would you be willing to share some debugging stats to help me understand exactly what went down?

Something like the result of

SELECT
  pcf.value,
  p.raw
FROM
  post_custom_fields AS pcf
INNER JOIN
  posts AS p ON p.id = pcf.post_id
WHERE
  pcf.name = 'disorder'

here or in a PM would be immensely helpful.

2 Likes

Ahh yes, I forgot all about that! Here you go. There really weren’t that many, but they were just unnecessary and members and mods found them annoying. I also am unsure about it scanning DMs, I know there could be value there if someone is harassing someone via DM, but most of the time it’s just going to trigger people knowing that we’re looking at their DMs

1 Like

Do you use chat? Were all the annoying flags in posts / PMs ?

We do use chat but I’m pretty sure all the flags were in posts and pms

1 Like

First of all, I’m very grateful for both the feedback and the data you shared that allowed me to debug this further.

Now to my findings!

During this week, you had 1942 new posts from non-staff users. Quite an active community! However I would not say that the AI is " absurdly aggressive at flagging posts", as only 7 posts were flagged.

That said, of those 7, half are clearly false positives triggered by too low defaults thresholds, other half are trickier for AI to understand the context (calling your interlocutor a jerk vs telling a story about how someone was a jerk to you today while you were shopping) and one is, IMO, a correct hit.

If you are willing to give it another try, moving all the thresholds to 85 and moving to the original model may solve almost all trigger-happy flagging issues you had so far. I’ll add a site setting to allow skipping PMs as I can see how that can be annoying for some communities too.

4 Likes

Thanks Falco, I apologize for saying it was absurdly agressive. I had a lot of drama happening on the site already and the flagging just added to that and I was quite annoyed at the time.

I appreciate the suggestions and will give it another try. Question, what happens when you disable disorder flag automatically? Will I still be notified somehow if a post is deemed disorderly? This would be nice to test it out and figure out what settings work without having posts flagged.

1 Like

Without that setting it will runs the posts against the AI but won’t take any actions. You can leave it like that and then run that Data Explorer query to do some analysis of the false positive/false negative rates.

There is also another setting that allows you to add groups to a skip list, where you could, for example, skip posts from TL4/3 from being classified. That may also help.