Back to Portfolio
NLP · Deep Learning · Multi-Label · Project 02

Toxic Comment Classifier

A production-grade toxic comment filtering system for real-time gaming chat moderation. Fine-tuned DistilBERT on 50,000 stratified samples to perform multi-label classification across six toxicity categories with severity scoring and automated action routing.

0.557Macro F1
0.9%False Pos. Rate
75.6%Toxic Recall
63.2%Severe Prec.
50KSamples
<100msInference
Problem Statement

Online gaming platforms host millions of daily active users engaging in real-time chat. Manual moderation is prohibitively expensive and slow, creating a critical need for automated content moderation that can classify toxicity in real-time with under 100ms latency.

The challenge was extreme class imbalance — 89.8% clean comments, only 1.0% severe toxic — combined with linguistic complexity: obfuscation ("f*ck"), sarcasm, evolving slang, and multi-label co-occurrence (toxic + insult often appear together).

Business Requirements Status
Requirement Target Achieved Status
Severe Toxic Precision ≥ 80% 63.2% △ Approaching
False Positive Rate ≤ 5% 0.9% ✓ Exceeded
Toxic Recall ≥ 60% 75.6% ✓ Exceeded
Multi-label Capability Required ✓ Met
Inference Latency < 100ms ~50ms ✓ Met
Approach & Methodology
1
Dataset & Stratified Sampling
Used Jigsaw Toxic Comment Classification Challenge (159,571 Wikipedia comments). Took a stratified subset of 50,000 samples preserving rare class distribution — severe toxic: 1.02% (vs 1.00% original). Train/Val/Test split: 50K / 7.5K / 2K.
2
DistilBERT Fine-tuning with Focal Loss
Fine-tuned distilbert-base-uncased (66M parameters, 6 transformer layers). Used Focal Loss (α=0.25, γ=2.0) instead of standard BCE to down-weight easy examples and focus learning on hard, rare classes like severe toxic. Effective batch size: 16 via gradient accumulation.
3
Multi-Label Classification
Classified 6 labels simultaneously: toxic, severe toxic, obscene, threat, insult, identity hate. Applied per-class threshold optimization — threshold=0.50 default, tuned up to 0.78 for severe toxic to trade recall for precision. Hamming Loss: 0.039 (−17% vs baseline).
4
Severity Scoring & Action Routing
Built a 0–5 weighted severity scoring system (severe toxic weight=5, threat=4, identity hate=4, toxic=3, obscene=2, insult=2). Routes to 5 actions: allow (0–1.5), warning (1.5–3.0), mute 1h (3.0–4.0), mute 24h (4.0–4.5), auto-ban (4.5–5.0). 95% of validation set allowed.
Per-Class Model Performance
Label Precision Recall F1 ROC-AUC vs Baseline
Toxic 0.780.820.800.971 +12.7%
Severe Toxic 0.630.450.530.982 +43.2%
Obscene 0.850.790.820.986 +6.5%
Threat 0.420.180.250.979 +47.1%
Insult 0.760.710.730.977 +10.6%
Identity Hate 0.580.380.460.981 +24.3%

Overall Macro F1: 0.557 (+11.2% over TF-IDF baseline). All ROC-AUC scores > 0.97.

Key Challenges
Extreme Class Imbalance
89.8% clean, only 1% severe toxic (~500 samples in 50K). Standard BCE loss caused model to bias toward predicting clean. Solved with Focal Loss (γ=2.0) to focus on hard examples and stratified sampling to guarantee rare class representation.
CPU-Only Training Constraint
No GPU access meant 14.73 hours training time for 50K samples. Full dataset (160K) would take ~47 hours. Limited hyperparameter search. Used gradient accumulation (steps=2) to simulate larger effective batch size of 16 on limited hardware.
Precision vs. Recall Trade-off
Achieving 80% severe toxic precision requires threshold=0.78, but this drops recall to just 12%. Current threshold=0.50 gives 63.2% precision with 45% recall. Model has strong ranking ability (AUC=0.982) — the gap is data scarcity, not model capacity.
Tech Stack
Python 3.11 DistilBERT PyTorch 2.1 Transformers 4.36 HuggingFace Scikit-learn Pandas NumPy Matplotlib Seaborn Focal Loss Multi-label
Project Info
TypeNLP Classification
DomainContent Moderation
DatasetJigsaw / Kaggle
Samples50K trained
Labels6 categories
Training14.73 hrs (CPU)
Latency~50ms inference
OrganisationNewton AI Tech
Metrics
Macro F155.7%
Toxic Recall75.6%
FPR (lower = better)0.9%
Severe Toxic Prec.63.2%
Avg ROC-AUC97.9%
View on GitHub
Previous
Influencer Fraud Detection
Back to
All Projects