machine-learning

Definition

Risk

Risk, also known as the generalisation error or true risk, is the expected loss of a model over the true, underlying data distribution. It quantifies how well a model is expected to perform on unseen data points drawn from the same distribution as the training data.

Let be the true joint probability distribution over the input-output space . Let be our model (hypothesis) and be the loss function that measures the cost of predicting when the true label is .

The risk of the hypothesis , denoted as , is the expected value of the loss with respect to the true data distribution :

This can be expressed as an integral (for continuous variables) or a sum (for discrete variables) over all possible data points, weighted by their probability of occurrence.

True Risk vs. Empirical Risk

It is crucial to distinguish between true risk and empirical risk:

True Risk:

Definition

True Risk

True risk , being a model is the expected loss on all possible data, including future, unseen examples. In practice, is incomputable because the true distribution is unknown.

Link to original

Empirical Risk:

Definition

Empirical Risk

Empirical Risk , being a model, is the average loss calculated on a finite training dataset . It serves as a practical, computable estimate of the true risk.

Link to original

The ultimate goal of a machine learning algorithm is to find a model that minimises the true risk . However, since we only have access to the training data, we minimise the empirical risk as a proxy. A significant gap between low empirical risk and high true risk is known as overfitting.