Toxic Comment Classifier

Problem Statement

Online gaming platforms host millions of daily active users engaging in real-time chat. Manual moderation is prohibitively expensive and slow, creating a critical need for automated content moderation that can classify toxicity in real-time with under 100ms latency.

The challenge was extreme class imbalance — 89.8% clean comments, only 1.0% severe toxic — combined with linguistic complexity: obfuscation ("f*ck"), sarcasm, evolving slang, and multi-label co-occurrence (toxic + insult often appear together).

Business Requirements Status

Requirement	Target	Achieved	Status
Severe Toxic Precision	≥ 80%	63.2%	△ Approaching
False Positive Rate	≤ 5%	0.9%	✓ Exceeded
Toxic Recall	≥ 60%	75.6%	✓ Exceeded
Multi-label Capability	Required	✓	✓ Met
Inference Latency	< 100ms	~50ms	✓ Met

Approach & Methodology

Dataset & Stratified Sampling

Used Jigsaw Toxic Comment Classification Challenge (159,571 Wikipedia comments). Took a stratified subset of 50,000 samples preserving rare class distribution — severe toxic: 1.02% (vs 1.00% original). Train/Val/Test split: 50K / 7.5K / 2K.

DistilBERT Fine-tuning with Focal Loss

Fine-tuned distilbert-base-uncased (66M parameters, 6 transformer layers). Used Focal Loss (α=0.25, γ=2.0) instead of standard BCE to down-weight easy examples and focus learning on hard, rare classes like severe toxic. Effective batch size: 16 via gradient accumulation.

Multi-Label Classification

Classified 6 labels simultaneously: toxic, severe toxic, obscene, threat, insult, identity hate. Applied per-class threshold optimization — threshold=0.50 default, tuned up to 0.78 for severe toxic to trade recall for precision. Hamming Loss: 0.039 (−17% vs baseline).

Severity Scoring & Action Routing

Built a 0–5 weighted severity scoring system (severe toxic weight=5, threat=4, identity hate=4, toxic=3, obscene=2, insult=2). Routes to 5 actions: allow (0–1.5), warning (1.5–3.0), mute 1h (3.0–4.0), mute 24h (4.0–4.5), auto-ban (4.5–5.0). 95% of validation set allowed.

Per-Class Model Performance

Label	Precision	Recall	F1	ROC-AUC	vs Baseline
Toxic	0.78	0.82	0.80	0.971	+12.7%
Severe Toxic	0.63	0.45	0.53	0.982	+43.2%
Obscene	0.85	0.79	0.82	0.986	+6.5%
Threat	0.42	0.18	0.25	0.979	+47.1%
Insult	0.76	0.71	0.73	0.977	+10.6%
Identity Hate	0.58	0.38	0.46	0.981	+24.3%

Overall Macro F1: 0.557 (+11.2% over TF-IDF baseline). All ROC-AUC scores > 0.97.

Key Challenges

Extreme Class Imbalance

89.8% clean, only 1% severe toxic (~500 samples in 50K). Standard BCE loss caused model to bias toward predicting clean. Solved with Focal Loss (γ=2.0) to focus on hard examples and stratified sampling to guarantee rare class representation.

CPU-Only Training Constraint

No GPU access meant 14.73 hours training time for 50K samples. Full dataset (160K) would take ~47 hours. Limited hyperparameter search. Used gradient accumulation (steps=2) to simulate larger effective batch size of 16 on limited hardware.

Precision vs. Recall Trade-off

Achieving 80% severe toxic precision requires threshold=0.78, but this drops recall to just 12%. Current threshold=0.50 gives 63.2% precision with 45% recall. Model has strong ranking ability (AUC=0.982) — the gap is data scarcity, not model capacity.

Tech Stack

Python 3.11 DistilBERT PyTorch 2.1 Transformers 4.36 HuggingFace Scikit-learn Pandas NumPy Matplotlib Seaborn Focal Loss Multi-label

Project Info

TypeNLP Classification

DomainContent Moderation

DatasetJigsaw / Kaggle

Samples50K trained

Labels6 categories

Training14.73 hrs (CPU)

Latency~50ms inference

OrganisationNewton AI Tech

Metrics

Macro F155.7%

Toxic Recall75.6%

FPR (lower = better)0.9%

Severe Toxic Prec.63.2%

Avg ROC-AUC97.9%

View on GitHub