Perceptron

Definition

Perceptron

The perceptron is a learning algorithm traditionally used in online learning for training binary linear classifier.

Given a sequence of training examples $(x_{i}, y_{i})$ , the algorithm iteratively updates the weight vector $w$ whenever an instance is misclassified.

Update Rule

Let $x_{t} \in R^{d}$ be an element from the input space and $y_{t} \in {- 1, 1}$ be an element from the label space. Further, let $f_{t} (x)$ be the hypothesis (the model’s prediction) at time $t$ .

w_{0} w_{t} = 0 = w_{t - 1} + {0 y_{i} x_{i} if f_{t} (x_{t}) = y_{t} else

If the model is correct, i.e. $f_{t} (x) = y_{t}$ , the weights remain unchanged. If the model is wrong, we add the input vector scaled by the true label to the weights.

Example $f_{t}$

$f_{t} (x) = sign (⟨ x_{t}, w_{t} ⟩)$
where $w_{t}$ is the weight vector at time $t$ (orientation of the decision boundary), $⟨ \cdot, \cdot ⟩$ is the inner product, and $sign$ be the signum function.

Why does this work?

Assume that we made a wrong prediction at time $t$ with input $x_{t}$ and true label $y_{t}$ . The model predicted the wrong sign, i.e.:

If $y_{t} = + 1$ , the model predicted $- 1$ , i.e. the inner product $⟨ w_{t}, x_{t} ⟩$ was negative.

If $y_{t} = - 1$ , the model predicted $+ 1$ , i.e. the inner product $⟨ w_{t}, x_{t} ⟩$ was positive.

We want to update $w_{t}$ to $w_{t + 1}$ such that the new inner product $⟨ w_{t + 1}, x_{t} ⟩$ is closer to the correct sign than the old one.

Derivation: We want to see how the inner product (the score) changes for this specific input $x_{t}$ after the update. So we take the inner product on both sides with $x_{t}$ :
$⟨ x_{t}, w_{t + 1} ⟩ = ⟨ x_{t}, (w_{t} + y_{t} x_{t})⟩$
The inner product is linear, thus:
$⟨ x, w_{t + 1} ⟩ = ⟨ x_{t}, w_{t} ⟩ + y_{t} ⟨ x_{t}, x_{t} ⟩$
By definition of the Euclidean norm, the inner product of a vector with itself is its squared length. Substituting this rigorously gives us this equation:
$New Score ⟨ x_{t}, w_{t + 1} ⟩ = Old Score ⟨ x_{t}, w_{t} ⟩ + Correction Term y_{t} + ∣∣ x_{t} ∣ ∣^{2}$
The term $∣∣ x_{t} ∣ ∣^{2}$ is always positive, which means that the direction of the change is entirely controlled by the label $y_{t}$ .

Case A: $y_{t} = + 1$ . The model predicted $- 1$ , meaning the old score was negative, i.e.: $⟨ x, w ⟩ < 0$ . We want the score to become positive. Since $y_{t} = + 1$ , the correction term is positive, $+ ∣∣ x ∣ ∣^{2}$ . The new score increases and moves from the negative side closer to zero towards the positive side.

Case B: $y_{t} = - 1$ . The model predicted $+ 1$ , meaning the old score was positive, i.e.: $⟨ x, w ⟩ < 0$ . We want the score to become negative. Since $y_{t} = - 1$ , the correction term is negative, $- ∣∣ x ∣ ∣^{2}$ . The new score decreases and moves from the positive side closer to zero towards the negative side.

Expressiveness

Geometrically, the perceptron draws a hyperplane through the data. Everything on the side is $+ 1$ , everything on the other side is $- 1$ . It can represent simple Boolean functions like AND and OR. But given that it’s a linear classifier, it’s not very expressive, and it can’t model everything, e.g. XOR.

Fixes: Kernel Perceptron, Multi-Layer Perceptron

Convergence Theorem

Novikoff (1962)

If the training data is linearly separable with a margin $γ > 0$ (i.e., there exists a unit vector $u$ such that $y_{i} (x_{i}^{⊤} u) \geq γ$ for all $i$ ) and the instances are bounded by $∥ x_{i} ∥ \leq R$ , then the perceptron algorithm converges after at most $(R / γ)^{2}$ mistakes.

The XOR Problem

A simple linear model, such as the perceptron, fails on non-linearly separable data. A classic example is the XOR function, which cannot be partitioned by a single hyperplane. Resolving such tasks requires the composition of multiple layers, leading to artificial neural networks.

Lukas' Notes

Perceptron

Definition

Update Rule

Expressiveness

Convergence Theorem

The XOR Problem

Graph View

Table of Contents

Backlinks