machine-learning

Definition

Vanishing Gradient Problem

The vanishing gradient problem is a phenomenon encountered during the training of deep artificial neural networks where the gradients of the loss function with respect to the network weights become increasingly small as they are propagated backwards through the layers. This results in the weights of the early layers receiving negligible updates, effectively halting the learning process.

Mathematical Mechanism

Chain Rule Multiplication: During backpropagation, the gradient for a layer is calculated by multiplying the gradients of subsequent layers. In a network with layers, the gradient of the initial layer involves a product of partial derivatives. If these derivatives are consistently smaller than 1, the product decays exponentially with the number of layers.

Activation Function Saturation: Traditional activation functions such as the sigmoid or hyperbolic tangent have derivatives that are strictly less than 1 (specifically, the sigmoid derivative is at most ). When these functions are used in deep architectures, the repeated multiplication of these small values rapidly drives the gradient towards zero in the lower layers.

Mitigation Strategies

Activation Function Selection: Utilizing non-saturating activation functions such as ReLU (Rectified Linear Unit) helps preserve gradient magnitude, as their derivative is for all positive inputs, preventing the exponential decay associated with sigmoid-like functions.

Weight Initialisation: Careful initialisation schemes, such as Xavier/Glorot or He initialisation, ensure that the variance of activations and gradients remains stable across layers, preventing them from shrinking too rapidly at the start of training.

Architectural Innovations: Techniques such as Batch Normalisation re-scale activations to maintain a healthy gradient flow. Similarly, Residual Connections (skip connections) provide a direct path for gradients to bypass layers, significantly reducing the impact of the vanishing gradient problem in extremely deep networks.