Page9/22
Handling Class Imbalance · Page 1 of 2
The Imbalance Problem
Handling Class Imbalance
The Problem (Recap)
- Dataset: 1% fraud, 99% normal transactions
- Naive model: Always predict "normal" → 99% accuracy but 0% fraud detection!
- Metric problem: Accuracy is useless for imbalanced data.
Solution 1: Adjust Class Weights
Tell the algorithm: "Penalize False Negatives (missed fraud) 100x more than False Positives."
# Logistic Regression
model = LogisticRegression(class_weight='balanced')
# Class weight formula:
# weight_class_0 = n_samples / (n_classes * count_class_0)
# For 99% vs 1%: weight_0 = 100 / (2 * 99) ≈ 0.5, weight_1 = 100 / (2 * 1) = 50
The minority class (1) gets 100x more weight!
Solution 2: Resampling
Undersampling
Delete majority class samples until balanced.
- Pros: Fast training
- Cons: Lose data
Oversampling
Duplicate minority class samples.
- Pros: Keep all data
- Cons: Risk overfitting
SMOTE (Synthetic Minority Over-sampling)
Generate synthetic minority examples by interpolating between existing samples.
- Pros: Synthetic data avoids exact duplicates
- Cons: Slightly computationally expensive
Solution 3: Threshold Adjustment
By default, logistic regression uses threshold=0.5.
- If probability >= 0.5 → Class 1
- If probability < 0.5 → Class 0
For imbalanced data, increase threshold to 0.7 or 0.8 to be more conservative.
main.py
Loading...
OUTPUT
▶Click "Run Code" to execute…