ML · Unsupervised · Anomaly Detection · Project 05

Influencer Fraud Detection

An anomaly-detection tool scoring Instagram/TikTok handles for fake engagement using Isolation Forest + LOF ensemble, achieving 87.3% precision across 5,000 influencer profiles.

87.3%Precision

82.1%Recall

84.6%F1 Score

0.91ROC-AUC

5,000Profiles

15+Features

Problem Statement

Brands lose millions annually by partnering with influencers who have inflated follower counts and fake engagement. Manual vetting at scale is impossible. The goal was to build an automated system that scores any Instagram/TikTok profile on a 0–100 fraud risk scale using only public metrics — no private data required.

Risk Distribution — 5,000 Profiles

Fraud risk category distribution across 5,000 analyzed profiles

Approach & Methodology

1

Data Generation & Collection

Simulated 5,000 influencer profiles with realistic distributions — 70% legitimate, 20% suspicious, 10% fraud. Log-normal follower distribution (1K–1M+). Engagement rates 0.1%–15%.
2

Feature Engineering (15+ Features)

Engineered engagement metrics (likes/follower ratio), growth patterns (spike detection via z-scores), content metrics (post frequency), and anomaly indicators (bot follower % estimation, authenticity scoring).
3

Ensemble Model Training

Trained Isolation Forest (100 estimators, contamination=0.1) and Local Outlier Factor (20 neighbors). Combined both — flags profile if either model detects anomaly. Averages scores for final risk calculation.
4

Fraud Scoring & Validation

Inverted anomaly scores → min-max normalized to 0–100. Applied 20% boost if flagged by both models. Validated on 200 manually labeled profiles. Achieved 87.3% precision on fraud class.

Risk Categories & Business Actions

Score Range	Risk Level	Profiles	Action
0 – 20	Very Low	1,234 (24.7%)	Auto-approve
20 – 40	Low	1,456 (29.1%)	Auto-approve
40 – 60	Moderate	1,123 (22.5%)	Manual Review
60 – 80	High	789 (15.8%)	Recommend Reject
80 – 100	Very High	398 (8.0%)	Auto-reject

Key Challenges

No Ground Truth Labels

Unsupervised problem — no pre-labeled fraud data available. Solved by manually labeling 200 profiles using heuristic rules (engagement anomalies, growth spikes, bot indicators) for validation.

Single Model Reliability

Isolation Forest alone missed local density anomalies. Added Local Outlier Factor to catch neighborhood-specific patterns. Ensemble approach improved precision by ~6% over single model.

Interpretability for Business

Raw anomaly scores are not actionable. Built a 0–100 scoring system with 5 named risk tiers and specific business actions (auto-approve/reject/review) so non-technical stakeholders can act on results.

PreviousData Cleaning Pipeline

Back toAll Projects

Tech Stack

Python Scikit-learn Isolation Forest LOF Pandas NumPy Matplotlib Seaborn Jupyter PyYAML

Project Info

TypeAnomaly Detection

DomainSocial Media / ML

Profiles5,000 analyzed

Labeled200 for validation

Features15+ engineered

OrganizationNewton AI Tech

Metrics

Precision87.3%

Recall82.1%

F1 Score84.6%

ROC-AUC91%

View on GitHub