Lukas' Notes

You have a perceptron and you want to train it. You count misclassifications — the binary error. Move the decision boundary, count again. If the count drops, you moved in the right direction. If it rises, go back.

This sounds reasonable. It is not.

Move the line a little. Does any point change side? No. The binary error is unchanged. Move it more. Still no change. More. Still nothing. Then — one point crosses the boundary and the error jumps. But you don’t know whether the jump was toward the optimum or away from it, because everything in between was flat.

The binary error metric is a step function of the weights. Zero derivative almost everywhere. At the jumps, the derivative does not exist. You can vary the weights by a lot without changing the error at all. The landscape gives no signal — no slope, no gradient, no hint of which direction to move.

The problem is the activation function. A perceptron uses a hard threshold — output is or , with a vertical step at the threshold. The derivative is everywhere (flat) except at the threshold where it is infinite (a wall). There is no useful gradient. Small weight changes produce zero change in output, and therefore zero change in error. The training signal is blind.

This is not just an inconvenience. It compounds. In a multi-layer perceptron, every neuron uses the same hard threshold. The whole network is a flat, non-differentiable function. You can vary every weight in the network significantly without changing the final output. There is no direction. No gradient. No training.

The realisation that training an MLP was a combinatorial optimisation problem — searching over discrete labellings for hidden neurons, groping blindly through a flat error landscape — stalled neural network research for well over a decade.

The solution required two changes. First, replace the hard threshold with a smooth activation — the sigmoid, whose derivative is non-zero everywhere and points in a clear direction. Second, replace the binary error with a continuous loss — cross-entropy — that changes smoothly as predictions move closer to or farther from the targets.

With a differentiable activation and a differentiable loss, the entire network becomes a differentiable function. The gradient exists. It points downhill. You can follow it.

The binary error metric is not useless because it counts the wrong thing. It is useless because it is flat. A flat landscape gives no direction. And without direction, you cannot learn.