Discourse AI - Toxicity

:bookmark: This topic covers the configuration of the Toxicity feature of the Discourse AI plugin.

:person_raising_hand: Required user level: Administrator

The Toxicity modules can automatically classify the toxicity score of every new post and chat message in your Discourse instance. You can also enable automatic flagging of content that crosses a threshold.

Classifications are stored in the database, so you can enable the plugin and use Data Explorer for reports of the classification happening for new content in Discourse immediately. We will soon ship some default Data Explorer queries with the plugin to make this easier.

Settings

  • ai_toxicity_enabled: Enables or disables the module

  • ai_toxicity_inference_service_api_endpoint: URL where the API is running for the toxicity module. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.

  • ai_toxicity_inference_service_api_key: API key for the toxicity API configured above. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.

  • ai_toxicity_inference_service_api_model: ai_toxicity_inference_service_api_model: We offer three different models: original, unbiased, and multilingual. unbiased is recommended over original because it’ll try not to carry over biases introduced by the training material into the classification. For multilingual communities, the last model supports Italian, French, Russian, Portuguese, Spanish, and Turkish.

  • ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.

  • ai_toxicity_groups_bypass: Users on those groups will not have their posts classified by the toxicity module. By default includes staff users.

Additional resources

Last edited by @hugh 2024-08-06T05:37:39Z

Last checked by @hugh 2024-08-06T05:37:44Z

Check documentPerform check on document:
10 Likes

Tuning this a bit right now, am I correct in assuming that a higher threshold is more stringent and a lower one more lenient?

1 Like

I would say the higher the threshold, the more lenient it would be. A lower threshold would be more apt to flag a post as being toxic since it would take less to trigger a flag, thus a higher threshold would require more to trigger a flag.
Low threshold = easy to cross
High threshold = harder to cross

2 Likes

I want to have a mechanism to catch attempts at commercial activity on our site - not toxicity per se, but very damaging to our community.

This is close, but not quite looking for the thing we are interested in.

Have you considered this dimension?

That’s covered by Discourse AI Post Classifier - Automation rule. Let me know how it goes.

4 Likes

Can someone help me set it up with Google Perspective API? I’d put a ad in the market place but i think here is more apropriate.

I know this was a year ago but please let me know how this implementation went! I am personally vested in it ^^ That said, please correct me if I’m wrong @Discourse, but the attributes you mention on this page ARE Perspective’s atomic metrics, as implemented through Detoxify so adding Perspective is a bit of a moot point right?

  • ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.

Regardless, Detoxify can be implemented by the Kaggle community community. That’s a great place to find someone to implement it because that’s precisely what Kaggle does :slight_smile:

2 Likes

We integrated GitHub - unitaryai/detoxify: Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai. models to handle automatic post toxicity classification and perform automatic flagging when over a configurable threshold.

What we found, is that while it works great if you have a zero tolerance for typical toxicity on your instances, like what more “brand” owned instance are, for other more community oriented Discourse instances, the toxicity models were too strict, generating too much flags in more lenient instances.

Because of that our current plan is to Depreate Toxicity and move this feature to our AI Triage plugin, where we give a customizable prompt for admins to adapt their automatic Toxicity detection to the levels of what are allowed in their instance.

We also plan on offering our customer a hosted moderation LLM, in the likes of ShieldGemma  |  Google AI for Developers or [2312.06674] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, which perfomed very well in our internal evals against the same dataset used in the original Jigsaw Kaggle competition that spawned Detoxify.

4 Likes