statistics

Definition

Variance

Variance measures how far a set of numbers is spread out from their mean. It is the average of the squared differences from the mean. A high variance indicates a large spread of the numbers in the dataset.

where is the number of elements in the population and is the mean of the population.

Variance

Variance is a measure of dispersion of a random variable and is given by:

Properties

Let be two random variables and be scalars. Then:

Linearity

The variance is linear for any real-valued random variables and any real-valued scalar :

Sum

Note that if and are independent, then and:

Geometric Analogy

This equation remembers me of the cosine law from trigonometry.

The formula for the variance of a sum of random variables is the statistical equivalent of the geometric Law of Cosines. This reveals an analogy where random variables can be treated as vectors in an abstract space.

The direct correspondence is as follows:

Geometric Concept (Vectors)Statistical Concept (Random Variables)
Vector, Random Variable,
Squared Length, Variance,
Dot Product, Covariance,
Cosine of Angle, Correlation,

The Law of Cosines for the sum of two vectors and is:

This perfectly mirrors the variance formula:

In this analogy, the correlation coefficient () is literally the cosine of the angle between the two random variable vectors, making the connection mathematically precise.

The connection gets even deeper: The dot product in geometry is defined as:

In probability theory, the covariance is related to correlation by:

Intuition

Intuition

After finding the centre, the next logical question is: “How spread out are the data points from this centre?“. Are they all clustered tightly around the mean, or are they widely dispersed.

To measure this, one could try calculating the average distance of each point from the mean, i.e.:

However, some points are to the left mean (negative distance) and some are to the right (positive distance). These will cancel each other out, and the result will always be zero.

Thus, all distances should be non-negative. There are two obvious main options:

  1. Absolute value:
  2. Squaring:

Statisticians prefer squaring for a few reasons: it’s mathematically easier to work with (continuous, differentiable), and it heavily penalises values that are far from the mean, which is often a desirable property.

This measure, the average of the squared distances from the mean, is exactly the variance:

Link to original