17/20
Weight Initialization, Regularization & Dropout Β· Page 1 of 2

Weight Initialization

Weight Initialization & Regularization

Why Weight Initialization Matters

Scenario 1: All weights = 0

All neurons produce same output
No diversity β†’ Can't learn!

Scenario 2: Random huge weights (e.g., N(0, 100))

Activations explode β†’ Gradients explode β†’ Training unstable

Scenario 3: Random tiny weights (e.g., N(0, 0.0001))

Activations too small β†’ Gradients vanish β†’ Learning too slow

Goal: Find the Goldilocks zone!

Xavier (Glorot) Initialization

W ~ Uniform(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))

Or Gaussian:
W ~ Normal(0, √(2/(n_in + n_out)))

Intuition: Scale weights based on layer size

  • Large layer β†’ smaller weights
  • Small layer β†’ larger weights
  • Keeps activations from exploding/vanishing

When: For sigmoid/tanh layers

He Initialization

W ~ Normal(0, √(2/n_in))

Better for ReLU:

  • ReLU doesn't saturate (unbounded on positive side)
  • Can use slightly larger weights
  • Better for deep networks

When: For ReLU layers (the modern default)

Comparison

Xavier:  Works OK for sigmoid
He:      Better for ReLU
Random:  Bad! Don't use!

Modern practice: Use He initialization!

Layer Normalization / Batch Normalization

Problem: Even with good initialization, activations drift during training.

Solution: Normalize activations before each layer!

Batch Normalization:
x_norm = (x - batch_mean) / √(batch_var + Ρ)
x_scaled = Ξ³ Γ— x_norm + Ξ²

Ξ³, Ξ² are learnable!

Effect: Stabilizes training, allows higher learning rates

Benefits:

  • Faster convergence
  • Less sensitive to initialization
  • Acts as regularizer
  • Allows higher learning rates

When: Add after dense/conv layers, before activation

main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…