A Smooth Proxy Can Miss a Separator

It is tempting to think that backpropagation should find a separator whenever a separator exists. If the network can represent a perfect classifier, and if optimisation finds the global minimum, why would it ever leave a training point on the wrong side?

The quiet trap is that backpropagation does not usually minimise classification error directly. It minimises a smooth proxy, such as cross-entropy loss or another divergence function. The proxy is differentiable. The actual training error, counted by zero-one loss, is not.

So there are two landscapes. The classifier is judged by one landscape and trained on another. The smooth landscape is not a harmless copy of the hard one. It is a replacement.

A rare point can be quiet in an average

The perceptron rule is mistake-driven. If a point is misclassified, that point speaks loudly: it triggers an update. For linearly separable data, the perceptron algorithm keeps moving until no training point is wrong.

A smooth average loss behaves differently. Each point contributes only part of the total loss. If there are many ordinary points and only one difficult point, the difficult point may not pull the optimum far enough, especially when weights are bounded or regularised. A separator may exist, but the minimum of the proxy may still be a non-separating compromise.

The orange line separates every point. The dashed cyan line is the kind of compromise a smooth proxy may prefer: it does well on most probability scores, but leaves the spoiler on the wrong side. The classifier has a perfect separator available, but the surrogate objective has not been asked to prefer it at all costs.

This is not mainly an optimisation failure. Even if the smooth loss is minimised perfectly, the minimiser may not be the classifier with zero training error.

The perceptron listens to mistakes

The difference is not that the perceptron is wiser. It is more brittle, but more direct.

The perceptron rule reacts to a single mistake as a full event:

w \leftarrow w + y x .

The average divergence reacts to that same point as one term among many:

L (w) = \frac{1}{m} i = 1 \sum m ℓ_{i} (w) .

This is the learning-theory trade-off. The perceptron can have low approximation error on separable training data because it insists on a zero-error separator if one is reachable. But it may have high estimation error: a single extra point can swing the boundary a long way.

Backpropagation through a smooth divergence often behaves more stably. A few added points may only slightly change the average objective, so the learned boundary moves less. That lowers sensitivity to the sample, but it can introduce a systematic compromise: the model may fail to separate even when a separating function exists in the network class.

The lesson is gentle but sharp. Backpropagation is not wrong when this happens. It is faithfully optimising a smoother question than the one classification error asks.

Lukas' Notes

A Smooth Proxy Can Miss a Separator

Table of Contents

A rare point can be quiet in an average

The perceptron listens to mistakes

Backlinks