Disorder - Automatic toxicity detection for your community

First of all, I’m very grateful for both the feedback and the data you shared that allowed me to debug this further.

Now to my findings!

During this week, you had 1942 new posts from non-staff users. Quite an active community! However I would not say that the AI is " absurdly aggressive at flagging posts", as only 7 posts were flagged.

That said, of those 7, half are clearly false positives triggered by too low defaults thresholds, other half are trickier for AI to understand the context (calling your interlocutor a jerk vs telling a story about how someone was a jerk to you today while you were shopping) and one is, IMO, a correct hit.

If you are willing to give it another try, moving all the thresholds to 85 and moving to the original model may solve almost all trigger-happy flagging issues you had so far. I’ll add a site setting to allow skipping PMs as I can see how that can be annoying for some communities too.

8 Likes

Thanks Falco, I apologize for saying it was absurdly agressive. I had a lot of drama happening on the site already and the flagging just added to that and I was quite annoyed at the time.

I appreciate the suggestions and will give it another try. Question, what happens when you disable disorder flag automatically? Will I still be notified somehow if a post is deemed disorderly? This would be nice to test it out and figure out what settings work without having posts flagged.

4 Likes

Without that setting it will runs the posts against the AI but won’t take any actions. You can leave it like that and then run that Data Explorer query to do some analysis of the false positive/false negative rates.

There is also another setting that allows you to add groups to a skip list, where you could, for example, skip posts from TL4/3 from being classified. That may also help.

Dear @Falco,

We started testing Disorder out. The overall feedback is positive - it really does detect inappropriate things, while flagging a lot of things which our community accepts. Due to nature of the forum where we test this plugin (Adult), the communication involves several aspects which trigger Disorder to flag many many posts. Your SQL Query really does help checking out which thresholds to adjust, but may I suggest adding those to Reviewable Scoring table for each flagged post?

This one

I don’t know if it’s possible for a plugin to introduce it’s own data to this view, but it would help staff a lot to understand which criteria to adjust to reduce false-positive results for us. The way I see it is adding dropdown with a breakdown per criteria triggered within this view. No need to include criteria equaling 0. Those which are above 0, should be present there, but only those which exceed the current config thresholds should be marked bold/red.

Disorder Scoring example
  • Toxicity 65% [1]
  • Insult 73% [2]
  • Threat 12% [3]
  • Sexual explicit 2% [4]

If needed, I can provide you with SQL Query results. We are far from finishing reviewing Flag Queue…
We are using multilingual model and haven’t tried others. Decided it would be a good to start with considering we have some users who prefer posting using their original language.


  1. exceeding, red font ↩︎

  2. exceeding, red font ↩︎

  3. normal, normal font ↩︎

  4. normal, normal font ↩︎

1 Like

Hi again,

Wanted to let you know that we get Errors in logs related to Disorder using “original” model. I just switched it back to multilingual to see if it will make difference.

Job exception: undefined method `>=’ for nil:NilClass @classification[label] >= SiteSetting.send(“disorder_flag_threshold_#{label}”) ^^

Details

/var/www/discourse/plugins/disorder/lib/classifier.rb:39:in `block in consider_flagging’

/var/www/discourse/plugins/disorder/lib/classifier.rb:38:in `filter’

/var/www/discourse/plugins/disorder/lib/classifier.rb:38:in `consider_flagging’

/var/www/discourse/plugins/disorder/lib/classifier.rb:25:in `classify!’

/var/www/discourse/plugins/disorder/app/jobs/regular/classify_post.rb:14:in `execute’

/var/www/discourse/app/jobs/base.rb:249:in `block (2 levels) in perform’

rails_multisite-4.0.1/lib/rails_multisite/connection_management.rb:80:in with_connection' /var/www/discourse/app/jobs/base.rb:236:in block in perform’

/var/www/discourse/app/jobs/base.rb:232:in `each’

/var/www/discourse/app/jobs/base.rb:232:in `perform’

sidekiq-6.5.8/lib/sidekiq/processor.rb:202:in `execute_job’

sidekiq-6.5.8/lib/sidekiq/processor.rb:170:in `block (2 levels) in process’

sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:177:in `block in invoke’

/var/www/discourse/lib/sidekiq/pausable.rb:134:in `call’

sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:179:in `block in invoke’

sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:182:in `invoke’

sidekiq-6.5.8/lib/sidekiq/processor.rb:169:in `block in process’

sidekiq-6.5.8/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch’

sidekiq-6.5.8/lib/sidekiq/job_retry.rb:113:in `local’

sidekiq-6.5.8/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch’

sidekiq-6.5.8/lib/sidekiq.rb:44:in `block in module:Sidekiq

sidekiq-6.5.8/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch’

sidekiq-6.5.8/lib/sidekiq/processor.rb:263:in `stats’

sidekiq-6.5.8/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch’

sidekiq-6.5.8/lib/sidekiq/job_logger.rb:13:in `call’

sidekiq-6.5.8/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch’

sidekiq-6.5.8/lib/sidekiq/job_retry.rb:80:in `global’

sidekiq-6.5.8/lib/sidekiq/processor.rb:124:in `block in dispatch’

sidekiq-6.5.8/lib/sidekiq/job_logger.rb:39:in `prepare’

sidekiq-6.5.8/lib/sidekiq/processor.rb:123:in `dispatch’

sidekiq-6.5.8/lib/sidekiq/processor.rb:168:in `process’

sidekiq-6.5.8/lib/sidekiq/processor.rb:78:in `process_one’

sidekiq-6.5.8/lib/sidekiq/processor.rb:68:in `run’

sidekiq-6.5.8/lib/sidekiq/component.rb:8:in `watchdog’

sidekiq-6.5.8/lib/sidekiq/component.rb:17:in `block in safe_thread’

Details 2
hostname
process_id 65460
application_version 2f8ad17aed81bbfa2fd20b6cc9210be92779bd74
current_db default
current_hostname
job Jobs::ClassifyPost
problem_db default
time 1:52 pm
opts
post_id 604063
current_site_id default

P.S. Yes, multilingual mode does not produce these errors. Unbiased model does not produce errors either

1 Like

I have also modified your query to display scoring in a more convenient way using Data Explorer.
Credits go to ChatGPT and PostgreSQL clues by Leonardo:

SELECT
  json_extract_path_text(pcf.value::json, 'classification', 'toxicity') AS toxicity,
  json_extract_path_text(pcf.value::json, 'classification', 'severe_toxicity') AS severe_toxicity,
  json_extract_path_text(pcf.value::json, 'classification', 'obscene') AS obscene,
  json_extract_path_text(pcf.value::json, 'classification', 'identity_attack') AS identity_attack,
  json_extract_path_text(pcf.value::json, 'classification', 'insult') AS insult,
  json_extract_path_text(pcf.value::json, 'classification', 'threat') AS threat,
  json_extract_path_text(pcf.value::json, 'classification', 'sexual_explicit') AS sexual_explicit,
  json_extract_path_text(pcf.value::json, 'model') AS model,
  pcf.created_at,
  p.raw
FROM
  post_custom_fields AS pcf
INNER JOIN
  posts AS p ON p.id = pcf.post_id
INNER JOIN
  topics AS t ON t.id = p.topic_id
WHERE
  pcf.name = 'disorder' 
  AND t.archetype = 'regular'
ORDER BY created_at DESC
And this modification will return those rows, where any of classification values is bigger than 50 (or whatever you set)
-- [params]
-- int :threshold = 50
SELECT DISTINCT ON (p.id, pcf.created_at)
  json_extract_path_text(pcf.value::json, 'classification', 'toxicity') AS toxicity,
  json_extract_path_text(pcf.value::json, 'classification', 'severe_toxicity') AS severe_toxicity,
  json_extract_path_text(pcf.value::json, 'classification', 'obscene') AS obscene,
  json_extract_path_text(pcf.value::json, 'classification', 'identity_attack') AS identity_attack,
  json_extract_path_text(pcf.value::json, 'classification', 'insult') AS insult,
  json_extract_path_text(pcf.value::json, 'classification', 'threat') AS threat,
  json_extract_path_text(pcf.value::json, 'classification', 'sexual_explicit') AS sexual_explicit,
  json_extract_path_text(pcf.value::json, 'model') AS model,
  p.id as post_id,
  pcf.created_at,
  p.raw
FROM
  post_custom_fields AS pcf
INNER JOIN
  posts AS p ON p.id = pcf.post_id
INNER JOIN
  topics AS t ON t.id = p.topic_id
WHERE
  pcf.name = 'disorder' 
  AND t.archetype = 'regular'
GROUP BY p.id, pcf.value, pcf.created_at
HAVING 
  CAST(json_extract_path_text(pcf.value::json, 'classification', 'toxicity') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'severe_toxicity') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'obscene') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'identity_attack') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'insult') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'threat') AS FLOAT) > :threshold 
  OR CAST(json_extract_path_text(pcf.value::json, 'classification', 'sexual_explicit') AS FLOAT) > :threshold
ORDER BY pcf.created_at DESC, p.id

You can also modify it by introducing several more parameters to be able to set different thresholds to report on using Data explorer.

Please note: this will return Public posts only, without accessing private messages.

3 Likes

We are working on this exact feature right now!

We are also planning on using the false positive / negative rates to run an optimizer that can suggest you the best thresholds for each option, so keep that information as it will be useful in the near future.

5 Likes

Sounds great. Glad to hear that.
So far, I tend to decline/ignore all the flags Disorderbot makes, even having thresholds raised up to maximum of 90-100. But, due to the nature of the forum we’re testing it on (NSFW), AI is confused easily if communication is really toxic or not. As long as it is not that reliable to our use case, we will continue using it, but will use it’s reports only to “re-inforce” other reports to really toxic posts.

As soon as we find some better thresholds to use for a long-term, we will be able to enable precautionary warnings when user tries to post something really toxic.

That’s what I suspect when AI becomes mainstream. It will allow censorship and limit genuine status-quo questioning that’s neccessary for the healty of every community on the world.

Not limit or ban, educate and discuss. Perhaps there is a way to use the tools without the side-effect (as my concerns that’s the wanted effect) but I see that’s not possible at the moment.

Thanks for your feedback, it has value for me. And of course, thanks to the team for keeping Discourse updated and improving like always :slight_smile:

Setting all thresholds to 100 and relying only on the more extreme ones, like “severe toxicity” and “threat”, is something that I can see being adopted in communities like that.

3 Likes

Thanks. It is currently set like this, and is still too sensitive. I will raise some even further and see how it goes

1 Like

Would have to see the raw classifications, but I’d increase the insult one first too.

I’d better keep you away from reading those :smiley: Those may be really NSFW, even in text form
I’ve raised the first threshold to 100 too, will see how it goes now :smiley:

1 Like

I really hope to make it possible for Disorder not to check (or not to report) on private messages in the future versions. We do not access them and feel like AI checking private conversations is highly unethical.

4 Likes

Yeah, that is the same thing @davidkingham asked, we will put it in our roadmap.

3 Likes

…and English? :sweat_smile:

Also, I’m wondering to what degree this can replace Akismet. We’re at a 97% disagree rate on Akismet’s flags right now. It seems to simply react to posts with a lot of digits in them, so if you’re posting job logs, where every line starts with a timestamp…

1 Like

The arms war between spam and spam detection just turned went nuclear with the advent of widely available LLMs. We are hard at work on features using a wide range of models, and while spam isn’t our priority right now, it’s something we will investigate.

4 Likes

Okay, so: I turned it on. How do I know it’s working?

Other than turning the thresholds down really low to catch everything, I mean.

Is there a diagnostic mode or log where I can see what a given post has scored?

2 Likes

The easiest way is to provoke it by posting something insulting. Make sure your user’s group is not bypassed in plugin settings.

The better way is to query Data Explorer. Please refer to one of my queries in this post:

1 Like

Thanks. That’s returning 0s across the board for all posts so far… is that to be expected?

1 Like