Definition
Perceptron
The perceptron is a learning algorithm traditionally used in online learning for training binary linear classifier.
Given a sequence of training examples , the algorithm iteratively updates the weight vector whenever an instance is misclassified.
Update Rule
Let be an element from the input space and be an element from the label space. Further, let be the hypothesis (the model’s prediction) at time .
If the model is correct, i.e. , the weights remain unchanged. If the model is wrong, we add the input vector scaled by the true label to the weights.
Example
where is the weight vector at time (orientation of the decision boundary), is the inner product, and be the signum function.
Why does this work?
Assume that we made a wrong prediction at time with input and true label . The model predicted the wrong sign, i.e.:
- If , the model predicted , i.e. the inner product was negative.
- If , the model predicted , i.e. the inner product was positive.
We want to update to such that the new inner product is closer to the correct sign than the old one.
Derivation: We want to see how the inner product (the score) changes for this specific input after the update. So we take the inner product on both sides with :
The inner product is linear, thus:
By definition of the Euclidean norm, the inner product of a vector with itself is its squared length. Substituting this rigorously gives us this equation:
The term is always positive, which means that the direction of the change is entirely controlled by the label .
- Case A: . The model predicted , meaning the old score was negative, i.e.: . We want the score to become positive. Since , the correction term is positive, . The new score increases and moves from the negative side closer to zero towards the positive side.
- Case B: . The model predicted , meaning the old score was positive, i.e.: . We want the score to become negative. Since , the correction term is negative, . The new score decreases and moves from the positive side closer to zero towards the negative side.
Expressiveness
Geometrically, the perceptron draws a hyperplane through the data. Everything on the side is , everything on the other side is . It can represent simple Boolean functions like AND and OR. But given that it’s a linear classifier, it’s not very expressive, and it can’t model everything, e.g. XOR.
Fixes: Kernel Perceptron, Multi-Layer Perceptron
Convergence Theorem
Novikoff (1962)
If the training data is linearly separable with a margin (i.e., there exists a unit vector such that for all ) and the instances are bounded by , then the perceptron algorithm converges after at most mistakes.
The XOR Problem
A simple linear model, such as the perceptron, fails on non-linearly separable data. A classic example is the XOR function, which cannot be partitioned by a single hyperplane. Resolving such tasks requires the composition of multiple layers, leading to artificial neural networks.