A perceptron with a threshold activation outputs or . Wire three perceptrons together — two in a hidden layer, one summing the output — and you get something more useful: a square pulse.
The first hidden neuron fires when . The second fires when . The output neuron sums them with weights and , firing only when is between the two thresholds. Three neurons produce a pulse of height on the interval , and everywhere else.
Scale this up. Vary the thresholds. Multiply each pulse by a different height. Sum them all together.
Each pulse is a sub-network of three neurons. Stack enough pulses side by side, each with its own height, and their sum traces a staircase. Make the pulses narrower and you can follow any curve. In the limit of infinitely many infinitesimally narrow pulses, the staircase converges to the function itself.
This is the pulse-approximation argument. Every pulse is a one-hidden-layer network. Their sum is a single output neuron with no activation — just addition. So a one-hidden-layer MLP with a linear output can approximate any continuous function on a bounded interval to arbitrary precision.
The argument scales to higher dimensions. In two dimensions, the pulse becomes a cylinder — a circle detector that outputs inside and outside. A two-dimensional function is approximated by a sum of scaled, shifted cylinders, each implemented by a sub-network of hidden neurons.
In dimensions, cylinders become hyperspheres. The idea is the same: tile the input space with localised bumps, scale each by the desired function value, and sum. A one-hidden-layer MLP with a linear output is a universal function approximator in any finite dimension.
A one-hidden-layer MLP is a universal function approximator.
But again, width is the hidden cost. Approximating a function that varies sharply may require an enormous number of narrow pulses — each demanding its own hidden neurons. A shallow network that tiles a high-dimensional space with hyperspheres can grow exponentially in the input dimension.
MLPs are universal approximators. But the architecture that actually fits is the one with enough layers to compose the function hierarchically, not the one that tiles the input space with an exponential number of bumps. What matters is not that a network can represent a function. It is how many neurons it takes to do so.