A Large Learning-Rate Ratio Makes Descent Slow

A neural network loss landscape is not usually shaped like a round bowl. Near a point, some directions can be sharp and others can be flat. A learning rate is only one scalar, so it has to move through all of those directions with the same global scale.

The difficulty can be measured by the spread of the direction-wise optimal learning rates:

\frac{max _{i} η _{i, opt}}{min _{i} η _{i, opt}} .

When this ratio is large, convergence is slow because the directions disagree about what a good step size is.

The left end of the axis is controlled by sharp directions. The right end is controlled by flat directions. A first-order method must choose one scalar $η$ , so it usually chooses something safe for the sharp directions. That makes the flat directions painfully slow.

Each direction has its own ideal step

Near a well-behaved point, a smooth loss can be approximated by a quadratic in the eigenbasis of the Hessian matrix:

L (θ^{⋆} + z) \approx L (θ^{⋆}) + \frac{1}{2} i \sum λ_{i} z_{i}^{2} .

Here the eigenvalue $λ_{i}$ is the local curvature in direction $i$ . In that one direction, gradient descent behaves like

z_{t + 1, i} = (1 - η λ_{i}) z_{t, i} .

The step size that would solve this one-dimensional quadratic in one step is

η_{i, opt} = \frac{1}{λ _{i}} .

So sharp directions, with large $λ_{i}$ , have small optimal learning rates. Flat directions, with small $λ_{i}$ , have large optimal learning rates.

A scalar step must obey the sharpest direction

For stability in every positive-curvature direction, the scalar learning rate must respect the largest curvature:

η < \frac{2}{λ _{m a x}} .

But the flattest direction wants a step on the scale of

\frac{1}{λ _{m i n}} .

The disagreement is exactly the spread of direction-wise ideal rates:

\frac{max _{i} η _{i, opt}}{min _{i} η _{i, opt}} = \frac{1/ λ _{m i n}}{1/ λ _{m a x}} = \frac{λ _{m a x}}{λ _{m i n}} .

When this number is large, the safe scalar step is set by the sharpest direction, while progress in the flattest direction is controlled by a multiplier close to one:

1 - η λ_{m i n} \approx 1.

That means the error in the flat direction shrinks only a little at each iteration.

Why this matters for neural networks

In neural networks, the Hessian spectrum can be very wide. Some parameter directions change the loss sharply; others barely change it. A first-order optimisation method sees the gradient, but its scalar learning rate cannot separately tune each curvature direction.

This is why training can feel slow even when every step is stable. The learning rate is not simply “too small”. It is small because some direction would become unstable if it were larger. The optimiser is moving with the speed limit imposed by the sharpest curvature while trying to make progress through much flatter directions.

A second-order optimisation method addresses this mismatch directly by using curvature information to rescale directions differently. Plain scalar learning-rate methods do not have that geometry built in.

A large ratio of optimal directional learning rates is therefore a warning that the landscape is badly scaled: one clock is being asked to keep time for many directions.

Lukas' Notes

A Large Learning-Rate Ratio Makes Descent Slow

Table of Contents

Each direction has its own ideal step

A scalar step must obey the sharpest direction

Why this matters for neural networks