ML · Unsupervised · Anomaly Detection · Project 05
Influencer Fraud Detection
An anomaly-detection tool scoring Instagram/TikTok handles for fake engagement using Isolation Forest + LOF ensemble, achieving 87.3% precision across 5,000 influencer profiles.
87.3%Precision
82.1%Recall
84.6%F1 Score
0.91ROC-AUC
5,000Profiles
15+Features
Problem Statement
Brands lose millions annually by partnering with influencers who have inflated follower counts and fake engagement. Manual vetting at scale is impossible. The goal was to build an automated system that scores any Instagram/TikTok profile on a 0–100 fraud risk scale using only public metrics — no private data required.
Risk Distribution — 5,000 Profiles
Fraud risk category distribution across 5,000 analyzed profiles
Approach & Methodology
-
1
Data Generation & CollectionSimulated 5,000 influencer profiles with realistic distributions — 70% legitimate, 20% suspicious, 10% fraud. Log-normal follower distribution (1K–1M+). Engagement rates 0.1%–15%.
-
2
Feature Engineering (15+ Features)Engineered engagement metrics (likes/follower ratio), growth patterns (spike detection via z-scores), content metrics (post frequency), and anomaly indicators (bot follower % estimation, authenticity scoring).
-
3
Ensemble Model TrainingTrained Isolation Forest (100 estimators, contamination=0.1) and Local Outlier Factor (20 neighbors). Combined both — flags profile if either model detects anomaly. Averages scores for final risk calculation.
-
4
Fraud Scoring & ValidationInverted anomaly scores → min-max normalized to 0–100. Applied 20% boost if flagged by both models. Validated on 200 manually labeled profiles. Achieved 87.3% precision on fraud class.
Risk Categories & Business Actions
| Score Range | Risk Level | Profiles | Action |
|---|---|---|---|
| 0 – 20 | Very Low | 1,234 (24.7%) | Auto-approve |
| 20 – 40 | Low | 1,456 (29.1%) | Auto-approve |
| 40 – 60 | Moderate | 1,123 (22.5%) | Manual Review |
| 60 – 80 | High | 789 (15.8%) | Recommend Reject |
| 80 – 100 | Very High | 398 (8.0%) | Auto-reject |
Key Challenges
No Ground Truth Labels
Unsupervised problem — no pre-labeled fraud data available. Solved by manually labeling 200 profiles using heuristic rules (engagement anomalies, growth spikes, bot indicators) for validation.
Single Model Reliability
Isolation Forest alone missed local density anomalies. Added Local Outlier Factor to catch neighborhood-specific patterns. Ensemble approach improved precision by ~6% over single model.
Interpretability for Business
Raw anomaly scores are not actionable. Built a 0–100 scoring system with 5 named risk tiers and specific business actions (auto-approve/reject/review) so non-technical stakeholders can act on results.