Lukas' Notes

A single perceptron can learn any linearly separable task. The perceptron algorithm cycles through the data, nudges the hyperplane on every mistake, and converges with a perfect classifier. The geometry is clean: every training point has a label, every label tells the neuron exactly what to do.

Now add a hidden layer. An MLP with one hidden layer can represent complex decision boundaries — non-convex shapes, XOR, any Boolean function. The architecture works. The representation exists. But how do you train it?

The perceptron algorithm needs a target output for every neuron. For the output neuron, the target is given — the training label. For a hidden neuron, there is no target. The data only says what the final output should be, not what each intermediate neuron should fire on.

The data tells you: this region is , the rest is . An MLP can draw this shape by composing several linear classifiers — each hidden neuron is a line, and the output neuron ANDs or ORs them together. But each hidden neuron needs its own classification problem, with its own labels for every training point.

What labels? Consider one hidden neuron that must learn one edge of the shape. It should fire on one side of its line, on the other. But which training points belong on which side? The original data only gives the final label — if the point is inside the yellow region, if outside. It does not tell this particular hidden neuron what to do.

Every point on the plane receives a label — or for this hidden neuron. But those labels are not in the training data. The data only says whether the point is inside the final yellow region. For the hidden neuron, you must invent a labelling that makes its task linearly separable and that, when combined with the other hidden neurons, reproduces the correct final output.

The number of possible labellings is exponential in the number of training points. For each hidden neuron, for each labelling, you must check whether the neuron can learn it — and whether the combination of all hidden neurons with those labellings yields the correct final boundary. This is a combinatorial optimisation problem, exponential in the number of inputs.

The same problem recurs at every hidden layer. A deep network has many layers of neurons, each without a teacher. The output layer knows what it should produce. The layer before it must guess. The layer before that must guess again. Error must propagate backwards — but the perceptron algorithm has no mechanism for that.

This is the credit assignment problem: how do you assign credit or blame to individual hidden neurons when only the final output is judged? The perceptron algorithm cannot answer it. The solution — backpropagation — would not arrive until 1986, when Rumelhart, Hinton, and Williams showed that error signals can be propagated backward through the network via the chain rule. But that is another story.

For the perceptron algorithm, hidden neurons have no teacher. And a neuron without a teacher cannot learn.