One Learning Rate Has to Please Every Direction

A learning rate looks like a small technical choice: multiply the gradient by a scalar, take the step, repeat. On a round loss landscape, that scalar feels natural. Every direction has the same curvature, so one step size fits the whole space.

Most loss landscapes are not round. They are stretched, tilted, and uneven. Then one scalar has to serve every direction at once. It must be small enough not to explode in the steep direction, but large enough to make progress in the shallow direction. Those two wishes can contradict each other.

The important word is not really “distance”. It is curvature. A direction is steep if the loss bends sharply there. A direction is shallow if the loss bends slowly there. A scalar learning rate cannot know which direction it is scaling; it multiplies the whole gradient vector by the same number.

The round case is special

Near a minimum, a smooth loss often behaves like a quadratic bowl. In centred coordinates, write the local model as

L (θ) \approx \frac{1}{2} (θ - θ^{⋆})^{⊤} H (θ - θ^{⋆}),

where $H$ is the Hessian matrix. If the bowl is perfectly round, then

H = λ I .

Every direction has the same eigenvalue $λ$ , so every direction has the same curvature. Gradient descent gives

θ_{t + 1} - θ^{⋆} = (1 - η λ) (θ_{t} - θ^{⋆}) .

Now the scalar learning rate is enough. The same multiplier controls every direction. In the exact quadratic case, the choice

η = \frac{1}{λ}

jumps to the minimum in one step.

This is the clean world where learning-rate optimality feels simple.

A stretched bowl has many clocks

If the bowl is stretched, the Hessian has different eigenvalues. Along eigen-direction $i$ , gradient descent behaves like

e_{t + 1, i} = (1 - η λ_{i}) e_{t, i} .

Each direction has its own ideal learning rate:

η_{i}^{⋆} = \frac{1}{λ _{i}} .

A steep direction has large $λ_{i}$ , so it needs a small step. A shallow direction has small $λ_{i}$ , so it needs a large step. But ordinary gradient descent chooses one scalar $η$ .

That scalar must obey the steep direction first:

η < \frac{2}{λ _{m a x}} .

Otherwise the steep direction oscillates or diverges. Once $η$ is made that small, the shallow direction may barely move because $1 - η λ_{m i n}$ is close to $1$ .

This is why narrow valleys produce the familiar zig-zag. The step is restricted by the wall of the valley, not by the long floor of the valley. The optimiser keeps correcting the steep coordinate while slowly crawling along the shallow one.

The scalar step is not wrong; the geometry is mismatched

A bad learning rate is often not bad in isolation. It is bad for a particular direction of the landscape.

A large learning rate may be good for shallow directions but unstable for steep directions.
A small learning rate may be safe for steep directions but painfully slow for shallow directions.
A single scalar can be optimal for all directions only when the local curvature is the same in all directions.

Second-order methods make this explicit. Newton’s method multiplies the gradient by $H^{- 1}$ , so each curvature direction receives its own scale. Adaptive optimisers and preconditioners try to approximate this idea more cheaply: do not use the same ruler in every direction when the landscape itself is not measured the same way in every direction.

The learning-rate problem is therefore a geometry problem. A scalar step size works beautifully in a round bowl, but a stretched loss landscape asks for different clocks along different axes.

Lukas' Notes

One Learning Rate Has to Please Every Direction

Table of Contents

The round case is special

A stretched bowl has many clocks

The scalar step is not wrong; the geometry is mismatched

Backlinks