This topic covers the configuration of the Toxicity feature of the Discourse AI plugin.
Required user level: Administrator
The Toxicity modules can automatically classify the toxicity score of every new post and chat message in your Discourse instance. You can also enable automatic flagging of content that crosses a threshold.
Classifications are stored in the database, so you can enable the plugin and use Data Explorer for reports of the classification happening for new content in Discourse immediately. We will soon ship some default Data Explorer queries with the plugin to make this easier.
Settings
ai_toxicity_enabled: Enables or disables the module
ai_toxicity_inference_service_api_endpoint: URL where the API is running for the toxicity module. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.
ai_toxicity_inference_service_api_key: API key for the toxicity API configured above. If you are using CDCK hosting this is automatically handled for you. If you are self-hosting check the self-hosting guide.
ai_toxicity_inference_service_api_model: ai_toxicity_inference_service_api_model: We offer three different models: original, unbiased, and multilingual. unbiased is recommended over original because it’ll try not to carry over biases introduced by the training material into the classification. For multilingual communities, the last model supports Italian, French, Russian, Portuguese, Spanish, and Turkish.
ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.
ai_toxicity_groups_bypass: Users on those groups will not have their posts classified by the toxicity module. By default includes staff users.
I would say the higher the threshold, the more lenient it would be. A lower threshold would be more apt to flag a post as being toxic since it would take less to trigger a flag, thus a higher threshold would require more to trigger a flag.
Low threshold = easy to cross
High threshold = harder to cross
I know this was a year ago but please let me know how this implementation went! I am personally vested in it ^^ That said, please correct me if I’m wrong @Discourse, but the attributes you mention on this page ARE Perspective’s atomic metrics, as implemented through Detoxify so adding Perspective is a bit of a moot point right?
ai_toxicity_flag_automatically: Automatically flag posts/chat messages when the classification for a specific category surpasses the configured threshold. Available categories are toxicity, severe_toxicity, obscene, identity_attack, insult, threat, and sexual_explicit. There’s an ai_toxicity_flag_threshold_${category} setting for each one.
Regardless, Detoxify can be implemented by the Kaggle community community. That’s a great place to find someone to implement it because that’s precisely what Kaggle does
What we found, is that while it works great if you have a zero tolerance for typical toxicity on your instances, like what more “brand” owned instance are, for other more community oriented Discourse instances, the toxicity models were too strict, generating too much flags in more lenient instances.
Because of that our current plan is to Depreate Toxicity and move this feature to our AI Triage plugin, where we give a customizable prompt for admins to adapt their automatic Toxicity detection to the levels of what are allowed in their instance.