Layer Normalisation

Definition

Layer Normalisation

Layer Normalisation is a technique for stabilising the training of deep artificial neural networks by normalising the activations across the features for each training case individually. Unlike Batch Normalisation, which normalises across the mini-batch, Layer Normalisation computes statistics from all the hidden units in the same layer for a single input. This makes it particularly effective for Recurrent Neural Networks (RNNs) and Transformers, where batch statistics can be unstable or ill-defined.

Transformation Mechanism

Normalisation Step: For a single training instance represented by a vector of activations $x = [x_{1}, \dots, x_{H}]^{⊤}$ (where $H$ is the number of hidden units), the layer mean $μ$ and variance $σ^{2}$ are computed as:
$μ = \frac{1}{H} \sum_{i = 1}^{H} x_{i}, σ^{2} = \frac{1}{H} \sum_{i = 1}^{H} (x_{i} - μ)^{2}$
The activations are then normalised: $\overset{x}{^}_{i} = \frac{x _{i} - μ}{σ ^{2} + ϵ}$ .

Learnable Transformation: Similar to Batch Normalisation, a learnable scale $γ$ and shift $β$ are applied: $y_{i} = γ \overset{x}{^}_{i} + β$ . These parameters are shared across all units in the layer but are specific to each layer.

Advantages over Batch Normalisation

Batch Independence: Since Layer Normalisation does not depend on other examples in the mini-batch, it behaves identically during training and inference. This eliminates the need for maintaining running averages of statistics.

Sequence Flexibility: In sequence models like RNNs, the length of sequences can vary. Layer Normalisation can be applied to each time step independently, whereas Batch Normalisation requires careful handling of varying sequence lengths across the batch.

Stable Statistics: In very deep or complex architectures, batch statistics can have high variance, leading to unstable training. Normalising across the layer’s units often provides more stable statistics, especially when the number of hidden units is large.

Lukas' Notes

Layer Normalisation

Definition

Transformation Mechanism

Advantages over Batch Normalisation

Graph View

Table of Contents