A perceptron on real-valued inputs draws a hyperplane. Everything on one side is , everything on the other is . This is a linear classifier — and nothing more.
A single line. Useful, but limited. The world rarely divides neatly into two half-planes.
Add a hidden layer and the picture changes. Each hidden neuron is a linear classifier — its own line. The output neuron ANDs them together: fire only if every hidden neuron fires. Geometrically, this is the intersection of half-planes. The result is a convex polygon.
One hidden layer gives you convex regions. Each edge is a hidden neuron. The output says: fire only if you are on the correct side of every edge. Five neurons, one pentagon.
Add more edges and the polygon rounds out. In the limit of many edges, a convex polygon becomes a circle. Place many such circles and OR them together — a second hidden layer — and you can tile any shape, convex or not, to arbitrary precision.
This is the circle-packing argument. A one-hidden-layer MLP with enough neurons can approximate any classification boundary. Each circle is a sub-network of hidden neurons. The output neuron sums them and thresholds. Where enough circles overlap, the output is . With enough small circles, the approximation becomes arbitrarily precise.
A one-hidden-layer MLP is a universal classifier.
But the price is width. A complex decision boundary may require an enormous number of hidden neurons — each edge, each circle, each tiny piece of the boundary demands its own neuron. A single hidden layer that tiles a complicated shape can be impractically wide, even infinite in the limit.
Depth changes the geometry more efficiently. A deeper network can compose boundaries hierarchically: first layer detects simple features, second layer combines them into contours, third layer assembles contours into shapes. The same decision boundary that would require a vast flat layer can be built with far fewer neurons stacked in a few layers.
MLPs are universal classifiers. A single hidden layer can approximate any decision boundary. But universality is the wrong metric. What matters is the representation: depth trades exponential width for polynomial depth. A shallow network may need a galaxy of neurons to draw a shape that a deep network captures with a handful of layers.
The geometry of classification is not about whether a boundary exists. It is about how many neurons it costs to draw it.