We integrated GitHub - unitaryai/detoxify: Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai. models to handle automatic post toxicity classification and perform automatic flagging when over a configurable threshold.
What we found, is that while it works great if you have a zero tolerance for typical toxicity on your instances, like what more “brand” owned instance are, for other more community oriented Discourse instances, the toxicity models were too strict, generating too much flags in more lenient instances.
Because of that our current plan is to Depreate Toxicity and move this feature to our AI Triage plugin, where we give a customizable prompt for admins to adapt their automatic Toxicity detection to the levels of what are allowed in their instance.
We also plan on offering our customer a hosted moderation LLM, in the likes of ShieldGemma | Google AI for Developers or [2312.06674] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, which perfomed very well in our internal evals against the same dataset used in the original Jigsaw Kaggle competition that spawned Detoxify.