MLPs are universal classifiers

A perceptron on real-valued inputs draws a hyperplane. Everything on one side is $1$ , everything on the other is $0$ . This is a linear classifier — and nothing more.

A single line. Useful, but limited. The world rarely divides neatly into two half-planes.

Add a hidden layer and the picture changes. Each hidden neuron is a linear classifier — its own line. The output neuron ANDs them together: fire only if every hidden neuron fires. Geometrically, this is the intersection of half-planes. The result is a convex polygon.

One hidden layer gives you convex regions. Each edge is a hidden neuron. The output says: fire only if you are on the correct side of every edge. Five neurons, one pentagon.

Add more edges and the polygon rounds out. In the limit of many edges, a convex polygon becomes a circle. Place many such circles and OR them together — a second hidden layer — and you can tile any shape, convex or not, to arbitrary precision.

This is the circle-packing argument. A one-hidden-layer MLP with enough neurons can approximate any classification boundary. Each circle is a sub-network of hidden neurons. The output neuron sums them and thresholds. Where enough circles overlap, the output is $1$ . With enough small circles, the approximation becomes arbitrarily precise.

A one-hidden-layer MLP is a universal classifier.

But the price is width. A complex decision boundary may require an enormous number of hidden neurons — each edge, each circle, each tiny piece of the boundary demands its own neuron. A single hidden layer that tiles a complicated shape can be impractically wide, even infinite in the limit.

Depth changes the geometry more efficiently. A deeper network can compose boundaries hierarchically: first layer detects simple features, second layer combines them into contours, third layer assembles contours into shapes. The same decision boundary that would require a vast flat layer can be built with far fewer neurons stacked in a few layers.

MLPs are universal classifiers. A single hidden layer can approximate any decision boundary. But universality is the wrong metric. What matters is the representation: depth trades exponential width for polynomial depth. A shallow network may need a galaxy of neurons to draw a shape that a deep network captures with a handful of layers.

The geometry of classification is not about whether a boundary exists. It is about how many neurons it costs to draw it.

Lukas' Notes

MLPs are universal classifiers

Backlinks