9/22
Handling Class Imbalance · Page 1 of 2

The Imbalance Problem

Handling Class Imbalance

The Problem (Recap)

  • Dataset: 1% fraud, 99% normal transactions
  • Naive model: Always predict "normal" → 99% accuracy but 0% fraud detection!
  • Metric problem: Accuracy is useless for imbalanced data.

Solution 1: Adjust Class Weights

Tell the algorithm: "Penalize False Negatives (missed fraud) 100x more than False Positives."

# Logistic Regression
model = LogisticRegression(class_weight='balanced')

# Class weight formula:
# weight_class_0 = n_samples / (n_classes * count_class_0)
# For 99% vs 1%: weight_0 = 100 / (2 * 99) ≈ 0.5, weight_1 = 100 / (2 * 1) = 50

The minority class (1) gets 100x more weight!

Solution 2: Resampling

Undersampling

Delete majority class samples until balanced.

  • Pros: Fast training
  • Cons: Lose data

Oversampling

Duplicate minority class samples.

  • Pros: Keep all data
  • Cons: Risk overfitting

SMOTE (Synthetic Minority Over-sampling)

Generate synthetic minority examples by interpolating between existing samples.

  • Pros: Synthetic data avoids exact duplicates
  • Cons: Slightly computationally expensive

Solution 3: Threshold Adjustment

By default, logistic regression uses threshold=0.5.

  • If probability >= 0.5 → Class 1
  • If probability < 0.5 → Class 0

For imbalanced data, increase threshold to 0.7 or 0.8 to be more conservative.

main.py
Loading...
OUTPUT
Click "Run Code" to execute…