- Also called weight regularization.
- Adds a penalty term to the loss function that discourages large weight values.
- Idea: smaller weights → simpler models → less chance of overfitting.
-
L2 Regularization (Ridge)
-
Penalty:
$$ \lambda \sum_{i} w_i^2 $$ -
Encourages weights to be small and spread out.
-
Prevents overfitting, improves stability, works well in deep learning.
-
-
L1 Regularization (Lasso)
-
Penalty:
$$ \lambda \sum_{i} |w_i| $$
-
Encourages sparsity (many weights become zero).
-
Useful for feature selection.
-
-
Elastic Net
- Combination of L1 and L2 penalties.
- Balances sparsity with weight shrinkage.
- Slows down growth of weights, forcing the optimizer to find smoother solutions.
- Works well with gradient-based optimizers.
- Artificially increasing the size and diversity of training data by applying transformations.
- Prevents overfitting by exposing the network to more variations of data.
-
Computer Vision:
- Flipping, rotation, cropping, scaling, brightness/contrast changes, adding noise.
- Cutout, Mixup, CutMix (advanced augmentations).
-
NLP:
- Synonym replacement, back-translation, random word insertion/deletion, paraphrasing.
-
Audio:
- Time-shifting, pitch shifting, background noise, speed changes.
- Slows training a bit (more diverse inputs), but improves generalization.
- Acts like data-driven regularization.
- Randomly “drops” (sets to zero) a fraction of neurons during training.
- Prevents co-adaptation of neurons (when neurons rely too much on each other).
where
- Each training step uses a slightly different network architecture.
- Equivalent to training an ensemble of smaller networks and averaging their predictions.
- Slows convergence slightly (since fewer neurons are active each step).
- Greatly improves generalization.
- At inference time, all neurons are used but scaled accordingly.
- Spatial Dropout: Drops entire feature maps in CNNs.
- DropConnect: Randomly drops weights instead of activations.
- A technique introduced to normalize activations within a layer during training.
- Each mini-batch’s activations are normalized to have zero mean and unit variance, then scaled and shifted with learnable parameters (
$\gamma, \beta$ ).
Where:
-
$\mu_B$ ,$\sigma_B^2$ → mean and variance of mini-batch. -
$\epsilon$ → small constant for numerical stability.
-
Reduces Internal Covariate Shift:
- As parameters update, the distribution of activations changes.
- BN stabilizes these distributions, making training smoother.
-
Allows Higher Learning Rates:
- Without BN, high learning rates often cause divergence.
- BN smooths the loss surface, so larger steps can be taken.
-
Acts as Regularization:
- Adds small noise due to batch statistics.
- Reduces overfitting, sometimes making Dropout less necessary.
-
Improves Gradient Flow:
- Prevents gradients from vanishing or exploding, especially in deep networks.
-
Speeds Up Convergence:
- Networks often converge in fewer epochs when BN is used.
- Depends on batch size (unstable if batches are too small).
- Adds computational overhead.
- For Recurrent Neural Networks (RNNs), BN is tricky due to sequence dependency → alternatives like Layer Normalization or Group Normalization are used.
| Technique | Primary Goal | How it Works | Effect on Training | Effect on Generalization |
|---|---|---|---|---|
| Parameter Norm Penalty (L1/L2) | Control weight growth | Adds penalty term to loss | Slower but stable updates | Reduces overfitting, sparsity (L1) |
| Dataset Augmentation | Increase data diversity | Transformations of data | Training longer but richer | Strong generalization |
| Dropout | Prevent co-adaptation | Random neuron removal | Slower convergence | Excellent generalization |
| Batch Normalization | Stabilize distributions | Normalize activations | Faster convergence | Mild regularization |
✅ Together, these techniques often complement each other:
- Use L2 regularization + BatchNorm for stability.
- Add Dropout for generalization.
- Apply Dataset Augmentation if dataset is small.