A production-grade toxic comment filtering system for real-time gaming chat moderation. Fine-tuned DistilBERT on 50,000 stratified samples to perform multi-label classification across six toxicity categories with severity scoring and automated action routing.
Online gaming platforms host millions of daily active users engaging in real-time chat. Manual moderation is prohibitively expensive and slow, creating a critical need for automated content moderation that can classify toxicity in real-time with under 100ms latency.
The challenge was extreme class imbalance — 89.8% clean comments, only 1.0% severe toxic — combined with linguistic complexity: obfuscation ("f*ck"), sarcasm, evolving slang, and multi-label co-occurrence (toxic + insult often appear together).
| Requirement | Target | Achieved | Status |
|---|---|---|---|
| Severe Toxic Precision | ≥ 80% | 63.2% | △ Approaching |
| False Positive Rate | ≤ 5% | 0.9% | ✓ Exceeded |
| Toxic Recall | ≥ 60% | 75.6% | ✓ Exceeded |
| Multi-label Capability | Required | ✓ | ✓ Met |
| Inference Latency | < 100ms | ~50ms | ✓ Met |
| Label | Precision | Recall | F1 | ROC-AUC | vs Baseline |
|---|---|---|---|---|---|
| Toxic | 0.78 | 0.82 | 0.80 | 0.971 | +12.7% |
| Severe Toxic | 0.63 | 0.45 | 0.53 | 0.982 | +43.2% |
| Obscene | 0.85 | 0.79 | 0.82 | 0.986 | +6.5% |
| Threat | 0.42 | 0.18 | 0.25 | 0.979 | +47.1% |
| Insult | 0.76 | 0.71 | 0.73 | 0.977 | +10.6% |
| Identity Hate | 0.58 | 0.38 | 0.46 | 0.981 | +24.3% |
Overall Macro F1: 0.557 (+11.2% over TF-IDF baseline). All ROC-AUC scores > 0.97.