Batch Normalisation

Definition

Batch Normalisation

Batch Normalisation is a technique used to improve the training of deep artificial neural networks by normalising the activations of each layer for every mini-batch. It involves re-scaling and re-centring the inputs to a layer to have zero mean and unit variance, followed by a learnable linear transformation. This process stabilises the learning process and significantly reduces the number of training epochs required to train deep networks.

Transformation Mechanism

Normalisation Step: For a mini-batch $B = {x_{1}, \dots, x_{m}}$ , the algorithm first computes the batch mean $μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$ and the batch variance $σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}$ . Each element is then normalised as $\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}$ , where $ϵ$ is a small constant for numerical stability.

Learnable Affine Transformation: To ensure that the normalisation does not restrict the representation power of the network, the normalised values are transformed by $y_{i} = γ \overset{x}{^}_{i} + β$ . The parameters $γ$ (scale) and $β$ (shift) are learned during training via backpropagation. This allows the network to undo the normalisation if that is optimal for the task.

Training and Inference

Internal Covariate Shift: By maintaining stable distributions of layer inputs throughout training, Batch Normalisation mitigates the problem of internal covariate shift, where changes in early layer parameters force later layers to constantly adapt to new distributions.

Regularisation Effect: During training, the mean and variance are estimated from mini-batches, introducing slight noise into the activations. This stochasticity acts as a form of regularisation, often reducing the need for techniques such as Dropout.

Inference Procedure: During inference, the mini-batch statistics are replaced by population statistics (typically running averages of the mean and variance computed during training). This ensures that the output depends deterministically on the input rather than on other elements in a batch.

Lukas' Notes

Batch Normalisation

Definition

Transformation Mechanism

Training and Inference

Graph View

Table of Contents

Backlinks