Discourse AI - Toxicity

Discourse · April 24, 2023, 7:39pm

This topic covers the configuration of the Toxicity feature of the Discourse AI plugin.

Required user level: Administrator

The Toxicity modules can automatically classify the toxicity score of every new post and chat message in your Discourse instance. You can also enable automatic flagging of content that crosses a threshold.

Classifications are stored in the database, so you can enable the plugin and use Data Explorer for reports of the classification happening for new content in Discourse immediately. We will soon ship some default Data Explorer queries with the plugin to make this easier.

Settings

ai_toxicity_enabled: Enables or disables the module
ai_toxicity_inference_service_api_endpoint: URL where the API is running for the toxicity module. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.
ai_toxicity_inference_service_api_key: API key for the toxicity API configured above. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.
ai_toxicity_inference_service_api_model: ai_toxicity_inference_service_api_model: We offer three different models: original, unbiased, and multilingual. unbiased is recommended over original because it’ll try not to carry over biases introduced by the training material into the classification. For multilingual communities, the last model supports Italian, French, Russian, Portuguese, Spanish, and Turkish.
ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.
ai_toxicity_groups_bypass: Users on those groups will not have their posts classified by the toxicity module. By default includes staff users.

Additional resources

Last edited by @hugh 2024-08-06T05:37:39Z

Last checked by @hugh 2024-08-06T05:37:44Z

Check document
Perform check on document:

Hifihedgehog · September 11, 2023, 11:18pm

Tuning this a bit right now, am I correct in assuming that a higher threshold is more stringent and a lower one more lenient?

JimPas · September 12, 2023, 5:08am

I would say the higher the threshold, the more lenient it would be. A lower threshold would be more apt to flag a post as being toxic since it would take less to trigger a flag, thus a higher threshold would require more to trigger a flag.
Low threshold = easy to cross
High threshold = harder to cross

nathank · November 23, 2023, 7:45am

I want to have a mechanism to catch attempts at commercial activity on our site - not toxicity per se, but very damaging to our community.

This is close, but not quite looking for the thing we are interested in.

Have you considered this dimension?

Falco · November 23, 2023, 12:00pm

That’s covered by Discourse AI Post Classifier - Automation rule. Let me know how it goes.

Mr.X_Mr.X · April 17, 2024, 2:09am

Can someone help me set it up with Google Perspective API? I’d put a ad in the market place but i think here is more apropriate.

Samantha_Venia_Logan · August 26, 2024, 5:46am

I know this was a year ago but please let me know how this implementation went! I am personally vested in it ^^ That said, please correct me if I’m wrong @Discourse, but the attributes you mention on this page ARE Perspective’s atomic metrics, as implemented through Detoxify so adding Perspective is a bit of a moot point right?

ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.

Regardless, Detoxify can be implemented by the Kaggle community community. That’s a great place to find someone to implement it because that’s precisely what Kaggle does

Falco · August 26, 2024, 7:21pm

We integrated GitHub - unitaryai/detoxify: Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai. models to handle automatic post toxicity classification and perform automatic flagging when over a configurable threshold.

What we found, is that while it works great if you have a zero tolerance for typical toxicity on your instances, like what more “brand” owned instance are, for other more community oriented Discourse instances, the toxicity models were too strict, generating too much flags in more lenient instances.

Because of that our current plan is to Depreate Toxicity and move this feature to our AI Triage plugin, where we give a customizable prompt for admins to adapt their automatic Toxicity detection to the levels of what are allowed in their instance.

We also plan on offering our customer a hosted moderation LLM, in the likes of ShieldGemma | Google AI for Developers or [2312.06674] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, which perfomed very well in our internal evals against the same dataset used in the original Jigsaw Kaggle competition that spawned Detoxify.

Topic		Replies	Views
Setting up toxicity detection in your community Site Management automation , ai , how-to , moderation	0	810	August 7, 2024
Have AI check for inappropriate post or at least words and flag the post Support ai , ai-toxicity	3	390	July 7, 2023
Discourse Google Perspective API Plugin official , perspective-api	2	20953	August 10, 2024
Setting up NSFW detection in your community Site Management moderation , automation , how-to , ai	0	674	October 10, 2024
AI flagging too sensitive Support ai , ai-toxicity	2	572	March 31, 2024

Discourse AI - Toxicity

Settings

Additional resources

Related topics