Quadratic Loss

Definition

Quadratic Loss

Quadratic loss is the loss function that penalises a prediction by the square of its residual. For a numerical target $y \in R$ and prediction $\overset{y}{^} \in R$ , let
$r = \overset{y}{^} - y .$
The quadratic loss is
$ℓ (\overset{y}{^}, y) = \frac{1}{2} (\overset{y}{^} - y)^{2} = \frac{1}{2} r^{2} .$
The factor $1/2$ does not change the minimiser; it cancels the $2$ when differentiating. The loss is small near the correct prediction and grows quadratically as the residual moves away from zero, so large errors are penalised disproportionately.

From residuals to a quadratic surface

For a dataset $S = {(x_{i}, y_{i})}_{i = 1}^{m}$ , the average quadratic loss is the mean squared error up to the factor $1/2$ :

L (w) = \frac{1}{2 m} i = 1 \sum m (h_{w} (x_{i}) - y_{i})^{2} .

If the model is linear, $h_{w} (x) = w^{⊤} x$ , then $L (w)$ is a quadratic function of the parameters:

L (w) = \frac{1}{2} w^{⊤} H w - b^{⊤} w + c,

where $H$ is positive semidefinite. This is why squared-error regression has a bowl-shaped optimisation surface: the residuals are linear in $w$ , and squaring them turns the objective into a quadratic surface.

Gradient and curvature

In one dimension, write

E (w) = \frac{1}{2} a w^{2} + b w + c, a > 0.

Then

E^{'} (w) = a w + b, E^{''} (w) = a .

The minimiser is the point where the derivative vanishes:

w^{⋆} = - \frac{b}{a} .

Because the second derivative is constant, the local quadratic approximation is not merely an approximation. It is the whole objective.

Relation to the learning rate

For gradient descent on the one-dimensional quadratic,

w_{k + 1} = w_{k} - η E^{'} (w_{k}),

the error $e_{k} = w_{k} - w^{⋆}$ evolves as

e_{k + 1} = (1 - η a) e_{k} .

Thus the optimal fixed learning rate in this one-dimensional case is

η^{⋆} = \frac{1}{a} = \frac{1}{E ^{''} ( w )} .

With this choice, gradient descent reaches the minimiser in one step. If $0 < η a < 1$ , convergence is monotone; if $1 < η a < 2$ , convergence oscillates; if $η a > 2$ , the iterates diverge.

In several dimensions, the same statement applies separately along the eigenvalue directions of the Hessian matrix. A single scalar learning rate must compromise between directions of small and large curvature.

Lukas' Notes

Quadratic Loss

Table of Contents

Definition

From residuals to a quadratic surface

Gradient and curvature

Relation to the learning rate

Backlinks