⚡

Module

Advanced ML & Model Interpretability

Progress41%

9 / 22 pages

Lesson 1: Advanced Evaluation Metrics

Lesson 2: Stratified K-Fold Cross-Validation

Lesson 3: SHAP (SHapley Additive exPlanations)

Lesson 4: LIME (Local Interpretable Model-agnostic Explanations)

Lesson 5: Data Distributions & Normality

Lesson 6: Feature Scaling & Normalization

Lesson 7: Handling Class Imbalance

Lesson 8: Hyperparameter Tuning (Grid & Random Search)

Lesson 9: Feature Engineering — Create Better Features

Lesson 10: XGBoost — The Best Algorithm

Lesson 11: Advanced Ensemble Methods

Lesson 12: Introduction to Neural Networks

Lesson 13: Model Deployment & Production

Lesson 14: Model Monitoring & Drift Detection

Lesson 15: ML Ethics & Fairness

Lesson 16: Time Series Basics

Lesson 17: Causal Inference & A/B Testing

Lesson 18: Model Calibration & Probability Estimates

Back to Module Overview

Page9/22

Handling Class Imbalance · Page 1 of 2

The Imbalance Problem

Handling Class Imbalance

The Problem (Recap)

Dataset: 1% fraud, 99% normal transactions
Naive model: Always predict "normal" → 99% accuracy but 0% fraud detection!
Metric problem: Accuracy is useless for imbalanced data.

Solution 1: Adjust Class Weights

Tell the algorithm: "Penalize False Negatives (missed fraud) 100x more than False Positives."

# Logistic Regression
model = LogisticRegression(class_weight='balanced')

# Class weight formula:
# weight_class_0 = n_samples / (n_classes * count_class_0)
# For 99% vs 1%: weight_0 = 100 / (2 * 99) ≈ 0.5, weight_1 = 100 / (2 * 1) = 50

The minority class (1) gets 100x more weight!

Solution 2: Resampling

Undersampling

Delete majority class samples until balanced.

Pros: Fast training
Cons: Lose data

Oversampling

Duplicate minority class samples.

Pros: Keep all data
Cons: Risk overfitting

SMOTE (Synthetic Minority Over-sampling)

Generate synthetic minority examples by interpolating between existing samples.

Pros: Synthetic data avoids exact duplicates
Cons: Slightly computationally expensive

Solution 3: Threshold Adjustment

By default, logistic regression uses threshold=0.5.

If probability >= 0.5 → Class 1
If probability < 0.5 → Class 0

For imbalanced data, increase threshold to 0.7 or 0.8 to be more conservative.

main.py

OUTPUT

▶Click "Run Code" to execute…