194.025 Introduction to Machine Learning

Machine Learning
Learning Paradigms:
Learning Methods:
Kernel Methods:
- Kernel Function
- Gram Matrix
- Hinge Loss
- Common Kernels:
- Ensemble Learning
  - Bagging
  - Boosting
- Artificial Neural Networks (ANN)
Dimensionality Reduction:
- Dimensionality Reduction
Risk and Evaluation:
- ML Evaluation
- Empirical Risk
- True Risk
- Underfitting
- Overfitting
- Bias-Variance Tradeoff
- Dataset Splitting
- k-fold Cross-Validation
- Baselines
- Student-t Test
- Metrics:
  - MAE
  - RMSE
  - Confusion Matrix
  - Accuracy
  - Precision
  - Sensitivity
  - F1-Score
Probabilistic Machine Learning:
- Bayes Optimal Classifier
- Discriminative Learning
- Generative Learning
- Inference and Estimation:
- Probabilistic Graphical Models:
  - Bayesian Network
  - Naive Bayes
Optimisation:
Fundamental Assumptions:
Learning Theory:
Mathematical Foundations:
Graph Representations:
Bias and Fairness:
- Ethical AI
Data Analysis:
Data Preprocessing:
Data Types:

Lecture 1 Overview and Mathematical Framing

This lecture positions machine learning as a mathematical discipline for constructing predictors from data under uncertainty. The central object is a mapping $h : X \to Y$ selected from a hypothesis class $H$ , with performance evaluated relative to an unknown data-generating distribution $P$ .

Problem Formulation

Let $(X, Y) \sim P$ with input space $X$ and label space $Y$ . A training sample is

S = {(x_{i}, y_{i})}_{i = 1}^{n} \sim P^{n},

with the standard i.i.d. assumption. Learning is the task of selecting

\hat{h} \in H

such that generalisation to unseen draws from $P$ is good.

In the classification setting with 0-1 loss,

L (\hat{h} (x), y) = 1 {\hat{h} (x) \neq = y},

the true risk and empirical risk are

R_{P} (h) = (x, y) \sim P Pr (h (x) \neq = y), R_{S} (h) = \frac{1}{n} i = 1 \sum n 1 {h (x_{i}) \neq = y_{i}} .

The course repeatedly studies the gap between these two quantities and conditions under which minimising $R_{S}$ controls $R_{P}$ .

Conceptual Position of Machine Learning

The opening slides place ML at the interface of statistics, optimisation, pattern recognition, data mining, and AI. A useful operational distinction is:

data mining emphasises extraction of structure from existing data;
machine learning emphasises construction of predictive rules with explicit generalisation goals.

This distinction is not absolute, but exam questions typically expect you to articulate learning in terms of hypotheses, loss, distributions, and guarantees.

Historical Arc and Methodological Consequence

The historical timeline (Bayes, Markov, perceptron, PAC, SVM, deep learning, transformers) motivates a recurring pattern: representational power, optimisation tractability, and statistical guarantees must be balanced. The modern toolkit is best understood as combinations of these three axes rather than as isolated algorithms.

Perceptron as Canonical Linear Learner

For binary labels $y_{t} \in {- 1, + 1}$ and inputs $x^{(t)} \in R^{d}$ , the perceptron predicts

f_{t} (x) = sign (⟨ w^{(t)}, x ⟩) .

Given example $(x^{(t)}, y_{t})$ , the update is

w^{(t + 1)} = {w^{(t)} w^{(t)} + y_{t} x^{(t)} if f_{t} (x^{(t)}) = y_{t}, otherwise .

Equivalent mistake form:

⟨ x^{(t)}, w^{(t + 1)} ⟩ = ⟨ x^{(t)}, w^{(t)} ⟩ + y_{t} ∥ x^{(t)} ∥^{2} .

Interpretation: each mistake moves the separating hyperplane towards correct classification of the current point.

Noise, Plausible Hypotheses, and Margin Principle

The geometric sequence of slides illustrates:

with low noise, many linear separators can interpolate the sample;
with increasing noise, the set of plausible separators contracts;
maximum-margin selection yields a robust separator by maximising the minimum signed distance to training points.

This motivates support vector machines and, more generally, the role of inductive bias in choosing one predictor among many empirically adequate ones.

Expressivity Limits and Non-Linearity

Linear threshold functions can represent AND/OR-type concepts but fail on XOR in the original input space. Two standard remedies are introduced:

Kernel lifting: linear separation in a feature space $ϕ (x)$ via inner products $k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'})⟩$ .
Multi-layer perceptron composition: stacked affine maps and non-linear activations to realise non-linear decision boundaries.

This is the first appearance of the representation trick: make the task linearly simple in a transformed space.

Toy Generalisation Guarantee from Threshold Learning

The airport suitcase example instantiates one-dimensional threshold learning. Let $θ^{*}$ denote the true threshold and $\hat{θ}$ the learned threshold from $n$ i.i.d. examples. Suppose the disagreement region between $θ^{*}$ and $\hat{θ}$ has probability mass $ε$ .

An error larger than $ε$ occurs only if no training point falls inside that disagreement region. Hence

Pr (bad event) = (1 - ε)^{n} .

Using $(1 - ε)^{n} \leq e^{- ε n}$ , it suffices to require

e^{- ε n} \leq δ ⟺ n \geq \frac{1}{ε} ln \frac{1}{δ} .

Therefore,

Pr (R_{P} (\hat{h}) \leq ε) \geq 1 - δ

for $n$ above the bound. This is the core PAC-style sample-complexity template that recurs throughout the course.

Exam-Oriented Takeaways

You should be able to reconstruct without notes:

the formal learning setup $(X, Y, P, H, L)$ ;
perceptron prediction and update equations;
why linear models fail on XOR and how kernels/layers address this;
the threshold-learning bound $n \geq \frac{1}{ε} ln \frac{1}{δ}$ from $(1 - ε)^{n}$ .

These four items are the mathematical backbone of the introductory lecture and connect directly to later topics in PAC learning, sample complexity, generalisation bounds, and VC dimension.

Lecture 2 Data and (Pre-)Processing

This lecture formalises the pipeline from raw observations to vector representations suitable for learning algorithms. The core principle is that model quality is bounded by representation quality: if the data map is inadequate, no downstream optimiser can recover the lost structure.

Data Model and Data Types

Let a dataset be written as

X = {x_{i}}_{i = 1}^{n}, x_{i} \in X,

with optional labels $y_{i} \in Y$ . The feature space $X$ may be heterogeneous, combining numeric, categorical, textual, image, or graph-valued attributes. In practice, most classical learners require an embedding

ϕ : X \to R^{d}

that preserves task-relevant structure.

Numeric attributes are either discrete or continuous. Categorical attributes are nominal (unordered) or ordinal (ordered). This distinction determines admissible pre-processing: ordinal variables may be relabelled monotonically, while nominal variables require non-ordinal encodings such as one-hot encoding.

Data Analysis Before Pre-Processing

Pre-processing starts with exploratory data analysis rather than immediate transformation. For each feature one should inspect scale, missingness, outliers, and empirical distribution. The minimal summary comprises median and quantiles (5-number summary), mean, variance/standard deviation, and pairwise dependency diagnostics (e.g. correlation matrices for numeric attributes).

The statistical purpose is to estimate whether a transformation will improve numerical conditioning and comparability across features.

Numeric Pre-Processing and Scaling Maps

Let $x_{ij}$ denote feature $j$ of example $i$ . Scaling is performed per feature (column-wise), not globally across all coordinates.

For feature $j$ , define

m_{j} = i min x_{ij}, M_{j} = i max x_{ij}, μ_{j} = \frac{1}{n} i = 1 \sum n x_{ij}, σ_{j}^{2} = \frac{1}{n} i = 1 \sum n (x_{ij} - μ_{j})^{2} .

The main transforms are:

min-max to [0, 1] : x_{ij}^{'} = \frac{x _{ij} - m _{j}}{M _{j} - m _{j}},

min-max to [a, b] : x_{ij}^{'} = a + \frac{x _{ij} - m _{j}}{M _{j} - m _{j}} (b - a),

mean normalisation : x_{ij}^{'} = \frac{x _{ij} - μ _{j}}{M _{j} - m _{j}},

standardisation : x_{ij}^{'} = \frac{x _{ij} - μ _{j}}{σ _{j}} .

Standardisation is typically preferred for gradient-based models and margin-based methods because it centres and rescales coordinates to comparable magnitudes.

Categorical and String Encodings

For a nominal feature with alphabet $A = {a_{1}, \dots, a_{k}}$ , one-hot encoding defines

ψ (a_{r}) = e_{r} \in {0, 1}^{k},

where $e_{r}$ is the $r$ -th standard basis vector. Missing categories are represented either by a dedicated “unknown” symbol or by imputation before encoding.

The same principle extends to strings at symbol level: with alphabet $Σ$ , each symbol maps to a one-hot vector in ${0, 1}^{∣Σ∣}$ . Concatenating symbol vectors yields a sparse representation. Word-level encodings in text follow the same algebraic idea, with vocabulary size replacing alphabet size.

Image and Text Feature Construction

For images, a traditional pipeline is: geometric/photometric pre-processing (resize, greyscale, augmentation), followed by feature extraction (e.g. HOG/SIFT/SURF). Deep architectures instead consume tensor-valued pixel arrays directly and learn feature maps end-to-end.

For text, a basic pipeline is token normalisation, stop-word handling, stemming/lemmatisation, then vectorisation (TF, TF-IDF, BM25, neural embeddings). The key geometric fact is sparsity: bag-of-words vectors live in high-dimensional spaces with mostly zero coordinates.

Distances as Inductive Geometry

Distance-based methods induce neighbourhood structure. A map $d : X \times X \to R_{\geq 0}$ is a metric if for all $x, u, v$ :

d (x, v) = d (v, x), d (x, v) \geq 0, d (x, x) = 0, d (x, v) \leq d (x, u) + d (u, v) .

For vectors in $R^{m}$ :

d_{ℓ_{2}} (x, v) = i = 1 \sum m (x_{i} - v_{i})^{2}, d_{ℓ_{1}} (x, v) = i = 1 \sum m ∣ x_{i} - v_{i} ∣,

d_{ℓ_{p}} (x, v) = (i = 1 \sum m ∣ x_{i} - v_{i} ∣^{p})^{1/ p}, p \geq 1.

For strings, the Levenshtein distance counts minimum edit operations (insert/delete/replace).

The cosine similarity

cos (x, v) = \frac{⟨ x , v ⟩}{∥ x ∥ _{2} ∥ v ∥ _{2}}

is a similarity, not a metric; it captures orientation rather than absolute magnitude.

Graph Data and Representation Dependence

A graph is $G = (V, E)$ , optionally augmented to

G = (V, E, X, W),

where $X$ stores vertex features and $W$ stores edge features. Common encodings are adjacency set, adjacency list, and adjacency matrix $A \in {0, 1}^{∣ V ∣ \times ∣ V ∣}$ .

Two structural difficulties drive graph learning theory:

permutation non-uniqueness: one graph admits many equivalent vertex orderings;
variable size: $∣ V ∣$ and $∣ E ∣$ differ across instances.

Hence representation must be permutation-invariant or permutation-equivariant.

The Weisfeiler-Leman (WL) algorithm computes iterative colour refinements. Let $c_{x}^{(t)}$ be the colour of vertex $x$ at iteration $t$ . Initialisation is constant colouring $c^{(0)}$ . Update rule:

c_{x}^{(t + 1)} = HASH (c_{x}^{(t)}, {{c_{y}^{(t)} : y \in N (x)}}),

where ${{\cdot}}$ denotes a multiset and HASH is injective on its input tuples. Iteration stops when the number of distinct colours stabilises.

A finite-dimensional graph feature vector is obtained by counting colour frequencies across iterations. If $C$ is the global colour index set, one graph-level representation is

Φ (G) = (# {x \in V : c_{x}^{(t)} = γ})_{γ \in C, t = 0, \dots, T} .

Dot products between such vectors induce the Weisfeiler-Leman graph kernel.

Exam-Oriented Takeaways

For this lecture, you should be able to derive and explain:

feature-wise scaling maps (min-max, mean normalisation, standardisation);
formal metric axioms and the $ℓ_{1} / ℓ_{2} / ℓ_{p}$ family;
why cosine is a similarity measure rather than a metric;
graph representation choices and permutation issues;
the WL update equation and how colour histograms yield graph vectors.

This lecture underpins later choices of distance-based methods, feature engineering, kernel methods, and graph learning pipelines.

Lecture 3 Core Concepts of Machine Learning

This lecture consolidates the pipeline from learning paradigms to optimisation, then to statistical evaluation and complexity control. The unifying question is: how can a model fit observed data while maintaining predictive reliability on unseen data?

Learning Paradigms

Let $X$ denote the instance space and $Y$ the target space. In supervised learning, we observe labelled samples

S = {(x_{i}, y_{i})}_{i = 1}^{m} \in (X \times Y)^{m},

and learn a predictor $h : X \to Y$ . In unsupervised learning, only ${x_{i}}_{i = 1}^{m}$ is observed and structure must be inferred without explicit labels.

The lecture also positions semi-supervised, self-supervised, active/passive, and online/batch settings as protocol variants that change information access or update timing, not the core objective of generalisable prediction.

Supervised Tasks and Loss Functions

Two canonical supervised tasks are:

classification: $Y$ finite/discrete,
regression: $Y \subseteq R$ continuous.

Given loss $L (h, x, y)$ , the empirical risk is

R_{S} (h) = \frac{1}{m} i = 1 \sum m L (h, x_{i}, y_{i}) .

Typical losses from the lecture:

0-1 loss: L (h, x, y) = 1 {h (x) \neq = y},

squared loss: L (h, x, y) = (h (x) - y)^{2} .

For linear regression with parameters $θ = (a, b)^{⊤}$ and $h_{θ} (x) = a x + b$ , training corresponds to minimising mean squared error on $S$ .

Empirical Risk Minimisation and Training Objective

Let $H$ be a hypothesis class. Empirical risk minimisation seeks

h^{*} \in ar g h \in H min R_{S} (h) .

This is an optimisation problem over parameter space (for parametric models) or function space (more generally). The key caveat highlighted in the lecture: low training error alone is insufficient, because optimisation can over-adapt to sample idiosyncrasies.

Gradient Descent and Parameter Updates

For differentiable objective $f : R^{d} \to R$ (e.g. empirical risk), gradient descent uses

θ_{t} = θ_{t - 1} - η \nabla f (θ_{t - 1}),

with learning rate $η > 0$ . Since $\nabla f (θ)$ points to steepest ascent, $- \nabla f (θ)$ is a local steepest-descent direction.

In stochastic variants (SGD), the full gradient is replaced by a mini-batch estimate. This introduces gradient noise but improves scalability and often helps escape sharp local structures.

Risk, Empirical Risk, and Generalisation

With data-generating distribution $D$ , true risk is

R_{D} (h) = E_{(x, y) \sim D} [L (h, x, y)] .

The empirical risk $R_{S} (h)$ is a sample estimate of $R_{D} (h)$ . Their mismatch defines generalisation error:

gen (h) = R_{D} (h) - R_{S} (h) .

Model selection must therefore target low true risk, not merely low training error.

Bias-Variance and Fit Regimes

As complexity increases, approximation error (bias) tends to decrease, while estimation sensitivity (variance) tends to increase. The resulting bias-variance trade-off explains:

underfitting at low complexity (high bias, low variance),
overfitting at high complexity (low bias, high variance),
an intermediate regime of best expected generalisation.

The “good fit” in the lecture corresponds to this intermediate region.

Regularisation as Complexity Control

Regularisation augments the data-fit term by a complexity penalty:

L_{reg} (h, x, y) = L (h, x, y) + λ Ω (h),

or dataset-level objective

h \in H min R_{S} (h) + λ Ω (h) .

For parameter vector $ω \in R^{d}$ :

Ω_{ℓ_{1}} (ω) = i = 1 \sum d ∣ ω_{i} ∣, Ω_{ℓ_{2}} (ω) = i = 1 \sum d ω_{i}^{2} .

$ℓ_{1}$ tends to induce sparse solutions; $ℓ_{2}$ shrinks weights smoothly. The hyperparameter $λ$ controls the fit-complexity trade-off.

Exam-Oriented Takeaways

For this lecture, you should be able to state and manipulate:

supervised setup $(X, Y, S, h, L)$ and empirical risk definition;
task-specific losses (0-1 and squared loss) and when they apply;
ERM objective and gradient descent update equation;
true risk vs empirical risk and the meaning of generalisation;
regularised objective with $ℓ_{1} / ℓ_{2}$ penalties and the role of $λ$ .

This lecture is the conceptual bridge from introductory examples to formal learning theory, optimisation analysis, and robust model selection.

Lecture 4 Basic Algorithms I

This lecture develops least-squares regression from first principles, then lifts the same optimisation pattern to polynomial and multivariate settings. The central message is that many seemingly different regressors are linear models in a transformed feature space.

Why Fit Models

Given data pairs, a fitted model serves four technical roles: compression of observations into a small parameter vector, interpolation/explanation of observed patterns, prediction on unseen inputs, and downstream decision support. In mathematical terms, we seek a map with low approximation error and acceptable generalisation behaviour while maintaining low representational complexity.

Linear Regression Setup

Assume training data

S = {(x_{i}, y_{i})}_{i = 1}^{m} \subseteq R \times R,

and a linear hypothesis

h_{w} (x) = w_{1} x + w_{0},

with parameters $w = (w_{0}, w_{1})^{⊤} \in R^{2}$ . A standard noise model is

y_{i} = w_{1} x_{i} + w_{0} + ε_{i} .

The least-squares objective (empirical risk with squared loss) is

R_{S} (w) = \frac{1}{m} i = 1 \sum m (y_{i} - (w_{1} x_{i} + w_{0}))^{2} .

Optimisation by Gradient Descent

The gradient components are

\frac{\partial R _{S}}{\partial w _{1}} = - \frac{2}{m} i = 1 \sum m x_{i} (y_{i} - (w_{1} x_{i} + w_{0})),

\frac{\partial R _{S}}{\partial w _{0}} = - \frac{2}{m} i = 1 \sum m (y_{i} - (w_{1} x_{i} + w_{0})) .

With learning rate $η > 0$ , gradient descent updates are

w^{(t + 1)} = w^{(t)} - η \nabla R_{S} (w^{(t)}) .

This is the iterative route used broadly in machine learning, but linear least squares also admits an exact solution.

Closed-Form Least-Squares Solution

Because $R_{S} (w)$ is quadratic in $w$ , its minimiser is obtained by setting partial derivatives to zero. The scalar closed form is

w_{1} = \frac{\sum _{i = 1}^{m} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i = 1}^{m} ( x _{i} - x ˉ ) ^{2}}, w_{0} = \overset{y}{ˉ} - w_{1} \overset{x}{ˉ},

where

\overset{x}{ˉ} = \frac{1}{m} i = 1 \sum m x_{i}, \overset{y}{ˉ} = \frac{1}{m} i = 1 \sum m y_{i} .

Interpretation: $w_{1}$ is empirical covariance divided by empirical variance of the input coordinate.

Matrix Form and Normal Equations

Using the lecture orientation (samples as columns), define

X = [1 x_{1} \dots \dots 1 x_{m}] \in R^{2 \times m}, y = [y_{1} \dots y_{m}]^{⊤} \in R^{m} .

Then $\overset{y}{^} = X^{⊤} w$ and minimisation yields normal equations

X X^{⊤} w = X y,

hence

w = (X X^{⊤})^{- 1} X y .

Equivalent row-sample notation gives the familiar $w = (X^{⊤} X)^{- 1} X^{⊤} y$ .

Polynomial Regression as Linear Regression in Features

For degree $p$ , define

h_{w} (x) = k = 0 \sum p w_{k} x^{k} .

Introduce feature map

ϕ_{p} (x) = [1 x x^{2} \dots x^{p}]^{⊤} .

Then $h_{w} (x) = w^{⊤} ϕ_{p} (x)$ is linear in parameters. With design matrix

Φ = [ϕ_{p} (x_{1}) \dots ϕ_{p} (x_{m})] \in R^{(p + 1) \times m},

the same normal-equation solution applies:

w = (Φ Φ^{⊤})^{- 1} Φ y .

Thus polynomial regression is not a different optimiser; it is a different representation.

Multiple Polynomial Regression

Now let $x_{i} \in R^{d}$ and $y_{i} \in R$ . The lecture writes

h (x) = w_{0} + k = 1 \sum p w_{k}^{⊤} x^{⊙ k},

where $x^{⊙ k}$ denotes component-wise powers. This keeps linearity in parameter blocks $w_{k} \in R^{d}$ and yields a stacked parameter vector in $R^{p d + 1}$ .

More generally, one may use any polynomial basis (including interaction monomials) and solve the same linear least-squares system in the induced feature space.

Numerical Remarks

The inverse in normal equations requires non-singularity of the Gram matrix ( $X X^{⊤}$ or $Φ Φ^{⊤}$ ). In ill-conditioned settings, one uses pseudoinverse or regularised least squares:

w = (Φ Φ^{⊤} + λ I)^{- 1} Φ y,

which connects directly to L2 regularisation from the previous lecture.

Exam-Oriented Takeaways

For this lecture, you should be able to derive and explain:

least-squares objective for linear regression and its gradient;
scalar closed-form parameters $(w_{0}, w_{1})$ via vanishing derivatives;
matrix normal equations and orientation-dependent formula forms;
polynomial regression as linear regression after feature expansion;
multivariate polynomial parameterisation and dimension counting.

This lecture establishes the algebraic template reused by many later linear-in-parameter models.

Lecture 5 Basic Algorithms II

This lecture extends linear modelling from regression to classification, then addresses two central limitations: non-linearly separable categorical patterns and model misspecification. The resulting toolkit combines discriminative linear models, tree-based symbolic partitioning, and ensemble aggregation.

Linear Classifiers and Hyperplane Geometry

Given

S = {(x_{i}, y_{i})}_{i = 1}^{m}, x_{i} \in R^{d}, y_{i} \in {- 1, + 1},

we seek a hypothesis $h : R^{d} \to {- 1, + 1}$ of the form

h_{a, b} (x) = sign (x^{⊤} a - b),

with normal vector $a \in R^{d}$ and threshold $b \in R$ . The decision boundary is the hyperplane

H = {x \in R^{d} : x^{⊤} a = b} .

Using augmented coordinates $\overset{x}{ˉ} = (1, x^{⊤})^{⊤}$ and $w = (- b, a^{⊤})^{⊤}$ , this becomes $h_{w} (x) = sign (w^{⊤} \overset{x}{ˉ})$ .

A Simple Linear Classifier via Least Squares

Direct optimisation of the sign loss is discontinuous. The lecture motivates a surrogate chain

sign (z) \approx tanh (z) \approx z

near $z = 0$ via the expansion $tanh (z) = z - \frac{1}{3} z^{3} + O (z^{5})$ . This yields a least-squares classifier:

w^{*} \in ar g \overset{w}{^} min ∥ X^{⊤} \overset{w}{^} - y ∥_{2}^{2},

with

X = [1 x_{1} \dots \dots 1 x_{m}] \in R^{(d + 1) \times m}, y = (y_{1}, \dots, y_{m})^{⊤} \in {- 1, + 1}^{m} .

When $X X^{⊤}$ is invertible,

w^{*} = (X X^{⊤})^{- 1} X y,

and prediction is performed by $sign (w^{* ⊤} \overset{x}{ˉ})$ rather than by the raw linear score.

The Perceptron as Online Learning

The perceptron is presented as an online learning algorithm. At round $t$ , observe $x_{t}$ , predict $\overset{y}{^}_{t} = sign (w_{t}^{⊤} x_{t})$ , reveal $y_{t} \in {- 1, + 1}$ , and update only on mistakes:

w_{t + 1} = {w_{t}, w_{t} + y_{t} x_{t}, y_{t} w_{t}^{⊤} x_{t} > 0, y_{t} w_{t}^{⊤} x_{t} \leq 0.

Thus the algorithm is mistake-driven and data-stream compatible, unlike closed-form least squares requiring a full batch and matrix inversion.

Novikoff Convergence Theorem and Mistake Bound

Assume there exist $R, γ > 0$ and a unit vector $u$ such that

∥ x_{i} ∥ \leq R, y_{i} ⟨ x_{i}, u ⟩ \geq γ \forall i .

Then the perceptron makes at most

M \leq (\frac{R}{γ})^{2}

mistakes.

Proof structure used in class:

∥ w_{t + 1} ∥^{2} = ∥ w_{t} + y_{t} x_{t} ∥^{2} = ∥ w_{t} ∥^{2} + 2 y_{t} ⟨ w_{t}, x_{t} ⟩ + ∥ x_{t} ∥^{2} \leq ∥ w_{t} ∥^{2} + R^{2},

hence after $M$ mistakes, $∥ w_{M} ∥^{2} \leq M R^{2}$ . Conversely,

⟨ w_{t + 1}, u ⟩ = ⟨ w_{t}, u ⟩ + y_{t} ⟨ x_{t}, u ⟩ \geq ⟨ w_{t}, u ⟩ + γ,

so $⟨ w_{M}, u ⟩ \geq M γ$ and therefore $∥ w_{M} ∥ \geq M γ$ . Combining both inequalities gives

M^{2} γ^{2} \leq ∥ w_{M} ∥^{2} \leq M R^{2} \Rightarrow M \leq (R / γ)^{2} .

Logistic Regression

Logistic regression models class probability by

p_{w} (x) = σ (w^{⊤} x), σ (z) = \frac{1}{1 + e ^{- z}} .

Hence $p_{w} (x) = Pr (Y = 1 ∣ X = x, w)$ and $1 - p_{w} (x) = Pr (Y = 0 ∣ X = x, w)$ , giving a confidence score in addition to a label.

With $y_{i} \in {- 1, + 1}$ , one objective used in the lecture is

R_{S} (w) = i = 1 \sum m lo g (1 + exp (- y_{i} w^{⊤} x_{i})) .

Equivalent Bernoulli negative log-likelihood form (for $y_{i} \in {0, 1}$ ):

L (w) = - i = 1 \sum m [y_{i} lo g p_{w} (x_{i}) + (1 - y_{i}) lo g (1 - p_{w} (x_{i}))] .

The objective is convex, but there is no closed-form minimiser; optimisation is done with gradient methods.

Categorical Data, XOR, and Decision Trees

For Boolean attributes, linear separators can fail even after numeric encoding. The canonical obstruction is XOR-type behaviour, which is not linearly separable in the original two-bit space.

The lecture motivates decision trees via Boolean formula structure. Any Boolean function can be represented in disjunctive normal form (DNF):

h (x_{1}, \dots, x_{d}) = c \in C_{h} ⋁ j = 1 ⋀ d b_{c, j}, b_{c, j} \in {x_{j}, \neg x_{j}} .

Trees implement this logic as hierarchical tests: internal split nodes evaluate attributes; leaves store class outputs. A prediction follows one root-to-leaf path, so trees naturally classify unseen attribute combinations.

Split Quality, Top-Down Induction, and Stopping

To choose a split attribute $A$ , impurity is minimised. For class set $Y$ ,

I (A = a) = 1 - c \in Y \sum p (c ∣ A = a)^{2}, I (A) = a \sum p (A = a) I (A = a),

which is the Gini index criterion. Lower impurity indicates better class separation.

The induction pattern is recursive: start at the full sample, choose best split, partition into child subsets, remove/restrict used tests, and repeat. Without stopping, trees can memorise training data; practical constraints include maximum depth, maximum number of leaves, and minimum subset size. Impure terminal nodes typically predict the majority class.

Ensemble Methods: Bagging, Forests, and Boosting

Ensembling combines weak or unstable learners to improve predictive robustness.

For binary classifiers $H$ , majority voting is

h_{B} (x) = {+ 1, - 1, \sum_{h \in H} h (x) \geq 0, otherwise .

Bagging creates diversity by bootstrap resampling: train base models on sampled subsets $S_{s} \subseteq S$ (with replacement), then aggregate votes. Random forests instantiate this with decision trees (often with feature subspace sampling at splits).

Boosting builds an additive model by sequential error correction. In regression form, with residuals $r_{t} (x_{i}) = y_{i} - f_{t} (x_{i})$ ,

h_{t} \approx ar g h \in H min i = 1 \sum m (r_{t} (x_{i}) - h (x_{i}))^{2}, f_{t + 1} = f_{t} + η_{t} h_{t} .

A fixed $η_{t} = η$ is common in simple derivations; practical variants use line search

η_{t} \in ar g η^{'} \in R min L_{S} (f_{t} + η^{'} h_{t}) .

The name gradient boosting reflects that $h_{t}$ approximates a descent direction in function space.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

linear classifier geometry $h (x) = sign (x^{⊤} a - b)$ and augmented-vector form;
least-squares linear classification as a surrogate for sign optimisation;
perceptron mistake-driven update and Novikoff bound $M \leq (R / γ)^{2}$ ;
logistic regression objective, probabilistic interpretation, and convex optimisation properties;
why XOR defeats linear separation and how decision trees solve this via hierarchical tests;
Gini-based split selection, stopping criteria, and overfitting control;
bagging/majority vote, random forests, and boosting as additive residual correction.

This lecture closes the transition from single-model linear methods to structured and ensemble predictors that trade interpretability, statistical robustness, and representational capacity.

Lecture 6 Experiment Design and Evaluation

This lecture formalises the end-to-end experimental protocol for machine learning: choose hyperparameters, train models, estimate generalisation, and verify that performance reflects intended behaviour rather than artefacts. The core principle is methodological separation of concerns: model fitting, model selection, and final assessment must use disjoint information.

Hyperparameters and Their Role

A hyperparameter is a configuration variable fixed before or outside parameter optimisation (for example, tree depth limits, number of estimators, regularisation strengths, or search budgets). If a model class is written as

h_{θ, λ},

then $θ$ denotes trainable parameters, while $λ$ denotes hyperparameters selected by an outer procedure.

The lecture stresses that predictive performance can vary strongly with $λ$ , but the search space often scales combinatorially with the number of tuned dimensions.

Why Dataset Splitting Is Necessary

There are three logically distinct tasks:

train model parameters;
choose hyperparameters;
estimate out-of-sample performance.

Hence one should separate a dataset $S$ into disjoint subsets:

S = S_{train} \dot{\cup} S_{val} \dot{\cup} S_{test} .

Training uses $S_{train}$ , hyperparameter selection uses $S_{val}$ , and only the final locked model is evaluated on $S_{test}$ . Reusing test information during tuning biases the estimated generalisation error downward.

Holdout and k-fold Cross-Validation

For sufficiently large datasets, holdout splitting is often adequate. For smaller datasets, the variance of holdout estimates can be high due to small validation/test subsets.

k-fold cross-validation reduces estimator variance by partitioning data into folds $F_{1}, \dots, F_{k}$ and averaging fold scores:

\hat{M}_{CV} = \frac{1}{k} j = 1 \sum k M (h^{(- j)}, F_{j}),

where $h^{(- j)}$ is trained on $⋃_{i \neq = j} F_{i}$ . Stratified splits are preferred in classification so label proportions remain approximately stable across folds.

Data Contamination and Preprocessing Order

Any preprocessing step whose statistics depend on data (for example standardisation parameters) must be fitted on training data only. For a transform $T_{ϕ}$ with fitted state $ϕ$ :

ϕ \leftarrow fit (S_{train}), \tilde{S}_{train} = T_{ϕ} (S_{train}), \tilde{S}_{val} = T_{ϕ} (S_{val}), \tilde{S}_{test} = T_{ϕ} (S_{test}) .

Fitting transforms on full data leaks information from validation/test into training and invalidates evaluation.

Evaluation Metrics by Task

Evaluation is a mapping from predictions and ground truth to a scalar criterion, and must match the task objective.

For regression $(y_{i}, \overset{y}{^}_{i}) \in R^{2}$ :

MAE = \frac{1}{n} i = 1 \sum n ∣ y_{i} - \overset{y}{^}_{i} ∣, MSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}, RMSE = MSE .

For binary classification with confusion-matrix counts $(TP, FP, TN, FN)$ :

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}, Precision = \frac{TP}{TP + FP},

Recall = \frac{TP}{TP + FN}, F1 = \frac{2 Precision \cdot Recall}{Precision + Recall}, FPR = \frac{FP}{FP + TN} .

The lecture highlights that metric choice is application-dependent, because false positives and false negatives may have asymmetric costs.

Hyperparameter Tuning as Outer Optimisation

Let $Λ$ be a hyperparameter search space and $M_{val} (λ)$ the validation score of the model trained with $λ$ . Tuning is

λ^{*} \in ar g λ \in Λ max M_{val} (λ)

(or $ar g min$ for loss metrics).

Two strategies discussed in class:

grid search: exhaustive over a finite grid $Λ_{grid}$ ;
random search: sample $λ \sim P_{Λ}$ for a fixed trial budget.

Grid search is simple but scales poorly in dimension; random search often uses budgets more efficiently when only a subset of hyperparameters strongly affects performance.

Baselines, Randomness, and Significance

Raw scores are uninterpretable without baselines. Typical baselines include majority-class prediction in classification and mean prediction in regression. A model should substantially exceed these references on unseen data.

Because many pipelines are stochastic (random splits, random forests, random search), reproducibility requires fixed seeds where possible. The lecture explicitly warns against tuning the random seed as if it were a model hyperparameter.

When comparing models across folds, apparent improvements may arise from sampling noise. The recommended protocol is to predefine a significance level $α$ (for example $0.05$ ), then test pairwise differences with a Student-t test on fold-wise scores.

Practical Reliability and Failure Modes

A high test score is necessary but not sufficient. Models may exploit spurious correlates (for example background artefacts) rather than task-causal structure. Therefore, evaluation must include sanity checks and targeted verification beyond aggregate metrics.

Practical guidance from the lecture:

improve and inspect data quality before increasing model complexity;
start with simple baselines and small prototypes;
inspect raw inputs and transformed features;
verify model behaviour under perturbations and plausible distribution shifts.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

why train/validation/test separation is required for unbiased generalisation estimates;
holdout versus k-fold cross-validation and when each is appropriate;
leakage-safe preprocessing order (fit on train only, apply elsewhere);
regression and classification metric formulas and their trade-offs;
grid versus random search as hyperparameter optimisation strategies;
the role of baselines, fixed seeds, and statistical significance testing;
why good test metrics still require behavioural verification.

This lecture provides the experimental discipline needed to convert algorithmic knowledge into trustworthy, reproducible machine learning practice.

Lecture 7 Machine Learning Theory

This lecture introduces the statistical-learning-theory core of the course: a formal data-generating model, explicit assumptions, and finite-sample guarantees for generalisation. The central question is not whether a model fits observed data, but when fitting implies low true risk.

Formal Learning Setup

Let $X$ be the instance space, $Y = {0, 1}$ the label space, and $D$ an unknown distribution on $X \times Y$ . A sample of size $m$ is

S = ((x_{1}, y_{1}), \dots, (x_{m}, y_{m})) \sim D^{m} .

For hypothesis class $H$ and loss

L (h, x, y) = 1 {h (x) \neq = y},

the empirical and true risks are

R_{S} (h) = \frac{1}{m} i = 1 \sum m L (h, x_{i}, y_{i}), R_{D} (h) = E_{(x, y) \sim D} [L (h, x, y)] .

A hypothesis is consistent if $R_{S} (h) = 0$ . The theoretical objective is to guarantee small $R_{D} (h)$ .

Hypothesis Classes and Learning Algorithms

The hypothesis class $H$ encodes representational bias (for example rectangles, linear separators, perceptrons, trees). It determines what can be learned in principle.

A learning algorithm is a map

A : (X \times Y)^{m} \to H,

returning $A (S) \in H$ . In particular, an ERM algorithm outputs

h_{S} \in ar g h \in H min R_{S} (h) .

Core Assumptions

Two assumptions are made explicit:

i.i.d. assumption: examples are independent draws from one fixed distribution $D$ ;
realisability assumption: there exists $h^{*} \in H$ with $R_{D} (h^{*}) = 0$ .

Realisability means no irreducible label noise relative to $H$ and perfect class-model alignment.

Realisable PAC Learning

PAC learning asks for probabilistic approximate correctness. A class $H$ is realisable PAC-learnable if there exists an algorithm $A$ and sample-complexity function $m_{H} : (0, 1)^{2} \to N$ such that for all $ε, δ \in (0, 1)$ , for all realisable $D$ , and all $m \geq m_{H} (ε, δ)$ ,

S \sim D^{m} Pr (R_{D} (A (S)) \leq ε) \geq 1 - δ .

Here $ε$ is the target error and $δ$ the failure probability.

Finite-Class Mistake Bound

If $∣ H ∣ < \infty$ and realisability holds, returning any consistent hypothesis yields PAC learnability with

m_{H} (ε, δ) \leq \frac{1}{ε} (ln ∣ H ∣ + ln \frac{1}{δ}) .

Proof skeleton from the lecture:

For a bad $h$ with $R_{D} (h) > ε$ ,

Pr (h is consistent on S) \leq (1 - ε)^{m} \leq e^{- ε m} .

Applying a union bound over all bad hypotheses,

Pr (\exists h \in H : R_{S} (h) = 0 \land R_{D} (h) > ε) \leq ∣ H ∣ e^{- ε m} .

Imposing this probability to be at most $δ$ gives the stated bound.

Papaya Example and Sample Complexity Interpretation

The lecture’s papaya toy calculation instantiates the bound numerically. With $ε = 0.1$ , $δ = 0.05$ , and a finite class size estimate $∣ H ∣ \leq 11816$ :

m \geq \frac{1}{0.1} (ln (11816) + ln \frac{1}{0.05}) \approx 124.

Interpretation: about 124 i.i.d. labelled examples suffice to guarantee error at most $10%$ with confidence at least $95%$ under realisability.

Infinite Classes, Shattering, and VC Dimension

Finite-cardinality bounds do not directly control infinite classes. The structural replacement is shattering.

For $X = {x_{1}, \dots, x_{n}} \subseteq X$ , class $H$ shatters $X$ if for every label vector $y \in {0, 1}^{n}$ there exists $h \in H$ with

h (x_{i}) = y_{i} \forall i \in {1, \dots, n} .

The VC dimension is

vc (H) = sup {∣ X ∣ : X \subseteq X is shattered by H},

with value $\infty$ if arbitrarily large finite sets are shattered.

Standard examples from class:

vc (H) \leq lo g_{2} ∣ H ∣ for finite H, vc (thresholds on R) = 1,

vc (axis-aligned rectangles in R^{2}) = 4, vc (linear separators in R^{d}) = d + 1.

VC Dimension and PAC Learnability

For classes with finite VC dimension $d = vc (H) < \infty$ , ERM is PAC-learnable with sample complexity of order

m_{H} (ε, δ) = O (\frac{d ln ( 1/ ε ) + ln ( 1/ δ )}{ε}) .

Thus capacity control moves from counting hypotheses to measuring combinatorial expressiveness via shattering.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

the formal tuple $(X, Y, D, H, L)$ and definitions of $R_{S}$ and $R_{D}$ ;
i.i.d. and realisability assumptions and why each is needed in guarantees;
the realisable PAC statement with $(ε, δ)$ quantifiers;
the finite-class sample-complexity derivation via consistency probability, union bound, and $1 - x \leq e^{- x}$ ;
shattering and VC dimension definitions and canonical examples;
the qualitative VC-based bound and its dependence on $d$ , $ε$ , and $δ$ .

This lecture establishes the theoretical bridge from empirical training success to explicit, distribution-level generalisation guarantees.

Lecture 8 Support Vector Machines and Kernel Methods

This lecture develops the max-margin principle for linear classification, then shows how kernelisation transfers linear methods to non-linear feature spaces without explicit coordinate lifting. The unifying theme is geometric control of generalisation through margin and function-class complexity.

From Linear Classifiers to Max-Margin Separation

For binary labels $y_{i} \in {- 1, + 1}$ and inputs $x_{i} \in R^{d}$ , a linear decision function is

f_{w} (x) = ⟨ w, x ⟩, h_{w} (x) = sign (f_{w} (x)) .

Max-margin geometry introduces a margin parameter $γ > 0$ and normalised constraints

y_{i} ⟨ x_{i}, \frac{w}{∥ w ∥} ⟩ \geq γ \forall i,

equivalently $y_{i} ⟨ w, x_{i} ⟩ \geq γ ∥ w ∥$ . The corresponding margin hyperplanes are

{x : ⟨ x, \frac{w}{∥ w ∥} ⟩ = γ}, {x : ⟨ x, \frac{w}{∥ w ∥} ⟩ = - γ} .

Points attaining equality are support vectors; they determine the boundary.

SVM Optimisation View

Under separability, the hard-margin SVM can be written (after standard scaling) as

w, b min \frac{1}{2} ∥ w ∥^{2} s.t. y_{i} (⟨ w, x_{i} ⟩ + b) \geq 1 \forall i .

For non-separable data, slack variables $ξ_{i} \geq 0$ yield the soft-margin form

w, b, ξ min \frac{1}{2} ∥ w ∥^{2} + C i = 1 \sum m ξ_{i} s.t. y_{i} (⟨ w, x_{i} ⟩ + b) \geq 1 - ξ_{i} .

This objective corresponds to regularised minimisation with hinge loss

ℓ_{hinge} (y, f (x)) = max {0, 1 - y f (x)},

which is a convex upper bound on classification error.

Why Perceptrons Fail on XOR and How to Fix It

The lecture revisits the XOR obstruction: single linear separators cannot represent certain Boolean patterns in original coordinates. Two remedies are contrasted:

layering (multi-layer compositions) to increase representational depth;
mapping (feature lifting) so the problem becomes linear in transformed space.

Kernel methods implement the second route while avoiding explicit high-dimensional coordinates.

Kernel Perceptron and the Kernel Trick

Perceptron-style predictors admit an expansion

f (x) = i = 1 \sum m α_{i} ⟨ x_{i}, x ⟩,

where $α_{i}$ are update-dependent coefficients. Replacing inner products by a kernel function $k$ gives

f_{k} (x) = i = 1 \sum m α_{i} k (x_{i}, x) .

If $k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩_{H}$ for some feature map $ϕ$ , then this is linear learning in feature space $H$ without computing $ϕ (x)$ explicitly.

Example Kernel Construction

For $a, b \in R^{2}$ , consider

k (a, b) = ⟨ a, b ⟩^{2} = (a_{1} b_{1} + a_{2} b_{2})^{2} .

Expanding,

k (a, b) = a_{1}^{2} b_{1}^{2} + 2 a_{1} a_{2} b_{1} b_{2} + a_{2}^{2} b_{2}^{2} = ⟨ ϕ (a), ϕ (b)⟩

with one valid map

ϕ (x) = (x_{1}^{2}, 2 x_{1} x_{2}, x_{2}^{2})^{⊤} .

This illustrates polynomial lifting through kernel evaluation alone.

Common Kernels and Inductive Bias

The lecture highlights standard families:

k_{lin} (x, x^{'}) = ⟨ x, x^{'} ⟩,

k_{poly} (x, x^{'}) = (⟨ x, x^{'} ⟩ + c)^{d},

k_{rbf} (x, x^{'}) = exp (- γ ∥ x - x^{'} ∥^{2}) .

These correspond to different smoothness/locality assumptions and therefore different generalisation behaviour.

Kernel Matrices and PSD Structure

Given data $x_{1}, \dots, x_{n}$ , the kernel matrix (Gram matrix)

K \in R^{n \times n}, K_{ij} = k (x_{i}, x_{j})

must be symmetric and positive semidefinite:

c^{⊤} Kc \geq 0 \forall c \in R^{n} .

Equivalent PSD characterisations used in the appendix:

$c^{⊤} Kc \geq 0$ for all $c$ ;
factorisation $K = F^{⊤} F$ ;
eigendecomposition $K = U D U^{⊤}$ with diagonal $D_{ii} \geq 0$ .

Closure example: if $G, H ⪰ 0$ , then $G + H ⪰ 0$ , so sums of valid kernels remain valid kernels.

A Small Learning-Theory Bridge

The lecture connects kernel methods to generalisation bounds of the form

test error ≲ training surrogate error + capacity term + concentration term .

Because direct 0-1 risk minimisation is non-convex/intractable, one minimises a convex surrogate (hinge-type objective) plus norm control. In RKHS formulations, restricting $∥ f ∥_{H} \leq θ$ controls complexity (linked to Rademacher complexity bounds).

Applications and Structured Data

Kernel methods are presented as a generic similarity-learning framework beyond vectors: spectra, spatial statistics, and graphs. For graph kernels, naive subgraph-based kernels can be computationally hard; practical work uses tractable embeddings or restricted graph classes.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

max-margin constraints and the role of support vectors;
hard-margin versus soft-margin SVM objectives and the effect of $C$ ;
kernel perceptron form $f (x) = \sum_{i} α_{i} k (x_{i}, x)$ ;
explicit feature-map recovery for simple polynomial kernels;
PSD kernel-matrix criteria and their equivalent formulations;
why convex surrogate minimisation (hinge) is used instead of direct 0-1 loss;
how kernels extend linear methods to non-linear and structured domains.

This lecture completes the line from linear separators to high-capacity similarity-based methods with explicit geometric and statistical control.

Lecture 9 Probabilistic Aspects of Machine Learning

This lecture shifts from deterministic label prediction to probabilistic inference. The central objective is to model uncertainty, incorporate prior beliefs, and exploit distributional structure beyond pure empirical-risk language.

Why Probabilistic Modelling

Compared with strict point prediction, a probabilistic model additionally provides confidence and calibrated uncertainty for decisions. It also enables prior-informed inference when data are scarce and allows modelling assumptions on the joint distribution rather than only worst-case distribution-free guarantees.

Bayes Optimal Classifier

Given data-generating distribution $D$ over $X \times Y$ , the Bayes-optimal classifier is

f_{D} (x) = ar g y \in Y max P (Y = y ∣ X = x) .

For 0-1 loss it minimises true risk among all measurable classifiers $g : X \to Y$ :

R_{D} (f_{D}) \leq R_{D} (g) \forall g .

Two modelling routes follow:

discriminative learning: model $P (Y ∣ X)$ directly;
generative learning: model $P (X, Y)$ (or $P (X ∣ Y), P (Y)$ ), then infer posteriors.

Probability Identities Used for Inference

For random variables $A, B, C$ :

P (A) = b \sum P (A, B = b), P (A) = b, c \sum P (A, B = b, C = c),

P (A, B) = P (A ∣ B) P (B) = P (B ∣ A) P (A),

P (A, B, C) = P (A ∣ B, C) P (B ∣ C) P (C) .

Conditioning follows

P (A ∣ B = b) = \frac{P ( A , B = b )}{P ( B = b )} = \frac{P ( A , B = b )}{\sum _{a} P ( A = a , B = b )} .

Bayes Theorem and Coin-Type Example

For hypothesis variable $Θ$ and observed data $X$ ,

P (Θ ∣ X) = \frac{P ( X ∣ Θ ) P ( Θ )}{P ( X )} = \frac{P ( X ∣ Θ ) P ( Θ )}{\sum _{θ} P ( X ∣ Θ = θ ) P ( Θ = θ )} .

Interpretation in the urn example: prior over coin types is updated by toss outcomes into a posterior over coin types. Sequential evidence is incorporated recursively by reusing posterior-as-prior.

MLE, MAP, and Full Bayesian Inference

Three inference levels are distinguished:

θ_{MLE}^{*} \in ar g θ max P (X ∣ Θ = θ),

which ignores priors;

θ_{MAP}^{*} \in ar g θ max P (Θ = θ ∣ X) = ar g θ max P (X ∣ Θ = θ) P (Θ = θ),

which selects the posterior mode;

full Bayesian inference: retain the complete posterior $P (Θ ∣ X)$ rather than a point estimate.

Posterior Predictive Distribution

For observed data $D$ and future datum $\tilde{D}$ , the posterior predictive distribution is

P (\tilde{D} ∣ D) = θ \sum P (\tilde{D} ∣ Θ = θ) P (Θ = θ ∣ D),

or an integral in continuous-parameter models. It averages predictions across hypotheses weighted by posterior plausibility and therefore propagates parameter uncertainty into predictions.

Bayesian Networks and Factorised Joint Models

A Bayesian network is a DAG over variables ${X_{i}}_{i = 1}^{n}$ with factorisation

P (X_{1}, \dots, X_{n}) = i = 1 \prod n P (X_{i} ∣ pa (X_{i})) .

The graph encodes conditional independences that reduce representation size and inference cost. Without structure, a joint over $n$ binary variables needs $2^{n} - 1$ parameters; conditional-independence factorisations can be exponentially more compact.

For chain-like structures, marginals can be computed by ordered elimination instead of brute-force summation over all assignments, reducing practical complexity substantially.

Naive Bayes as a Special Bayesian Network

In Naive Bayes, class variable $Θ$ is a parent of all features $x_{i}$ , and features are conditionally independent given class:

P (X ∣ Θ) = i = 1 \prod n P (x_{i} ∣ Θ), P (X, Θ) = P (Θ) i = 1 \prod n P (x_{i} ∣ Θ) .

Prediction is

θ^{*} \in ar g θ max P (Θ = θ ∣ X) = ar g θ max P (Θ = θ) i = 1 \prod n P (x_{i} ∣ Θ = θ),

since $P (X)$ does not depend on $θ$ for argmax classification.

Empirical estimators from labelled data:

\hat{P} (Θ = θ) = \frac{# { j : y _{j} = θ }}{m}, \hat{P} (x_{i} = 1 ∣ Θ = θ) = \frac{# { j : y _{j} = θ , x _{j, i} = 1 }}{# { j : y _{j} = θ }} .

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

Bayes-optimal decision rule and why it minimises 0-1 risk;
discriminative versus generative modelling viewpoints;
Bayes theorem derivation and prior-likelihood-posterior decomposition;
conceptual and mathematical differences between MLE, MAP, and full posterior inference;
posterior predictive averaging over hypotheses;
Bayesian-network factorisation and why it improves representational/inference efficiency;
Naive Bayes assumptions, classification rule, and parameter estimation from counts.

This lecture introduces the probabilistic foundation for uncertainty-aware machine learning and structured probabilistic modelling.

Lecture 10 Dimensionality Reduction and Distance-Based Algorithms

This lecture studies two connected themes: reducing high-dimensional representations while preserving useful structure, and learning directly from distances in feature space. The practical motivation is that many real datasets are high-dimensional, noisy, and geometrically sparse.

Why Dimensionality Reduction

High-dimensional geometry behaves counterintuitively. A standard concentration intuition is that, as dimension grows, volume mass shifts in ways that make naive neighbourhood reasoning brittle (often called the curse of dimensionality). Consequently, reducing dimension can improve computation, denoising, and visual interpretability.

The lecture frames reduction as a trade-off between:

retaining informative variance/geometry;
minimising reconstruction or neighbourhood distortion;
reducing computational cost.

Principal Component Analysis

Let centred data matrix be $X \in R^{n \times p}$ and covariance

C = \frac{1}{n - 1} X^{⊤} X .

For a unit direction $w$ ( $∥ w ∥_{2} = 1$ ), one-dimensional projection is $Xw$ . PCA chooses $w$ maximising projected variance:

∥ w ∥_{2} = 1 max Var (Xw) = ∥ w ∥_{2} = 1 max w^{⊤} Cw .

Equivalent viewpoint: minimise rank-1 reconstruction error

∥ w ∥_{2} = 1 min ∥ X - Xw w^{⊤} ∥_{F}^{2} .

Using the Lagrangian

L (w, λ) = w^{⊤} Cw - λ (w^{⊤} w - 1),

stationarity gives $Cw = λ w$ , so the first principal component is the eigenvector of $C$ with largest eigenvalue. Higher components are orthogonal eigenvectors ordered by eigenvalue magnitude.

Random Projection and the Johnson-Lindenstrauss Principle

When $p$ is very large, PCA may be costly (eigendecomposition typically scales poorly with $p$ ). Random projection uses a random linear map to dimension $k ≪ p$ with complexity roughly $O (n p k)$ .

The Johnson-Lindenstrauss guarantee states that for $n$ points, one can choose

k = Ω (\frac{lo g n}{ε ^{2}})

so that pairwise distances are preserved up to $(1 \pm ε)$ multiplicative distortion with high probability.

Proof sketch idea in class: with Gaussian random matrix $R \in R^{k \times p}$ and map

f (x) = \frac{1}{k} R x,

the squared norm of projected difference vectors concentrates sharply around the original norm; then a union bound over all $(2 n)$ pairs yields the global preservation result.

t-SNE

t-SNE is a non-linear embedding method emphasising local neighbourhood preservation for visualisation. It converts distances into neighbour probabilities in high-dimensional space, then finds a low-dimensional configuration whose induced probabilities are close (via divergence minimisation).

Operational properties stressed in lecture:

non-linear;
stochastic/non-deterministic;
hyperparameter-sensitive (not parameter-free);
typically strong for visual cluster separation, but not a faithful global metric map.

k-Nearest Neighbours

k-NN predicts by local majority vote (classification) or local averaging (regression) among the $k$ closest training points under a chosen metric $d$ .

For binary classification,

\overset{y}{^} (x) = mode {y_{i} : x_{i} \in N_{k} (x)},

where $N_{k} (x)$ is the set of $k$ nearest neighbours of $x$ . The hyperparameter $k$ controls bias-variance behaviour: small $k$ gives high variance/local sensitivity, large $k$ gives smoother but potentially biased decisions.

k-Means Clustering

k-means partitions points into $k$ clusters by minimising within-cluster squared distances:

S_{1}, \dots, S_{k} min i = 1 \sum k x \in S_{i} \sum ∥ x - μ_{i} ∥_{2}^{2}, μ_{i} = \frac{1}{∣ S _{i} ∣} x \in S_{i} \sum x .

Lloyd-style iteration:

initialise $k$ centroids (often from random data points);
assign each point to nearest centroid;
recompute centroids as cluster means;
repeat until assignments stabilise.

The algorithm monotonically decreases objective value but can converge to local minima; initialisation therefore matters.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

why high-dimensional settings motivate dimension reduction;
PCA objective, covariance-eigenvector solution, and variance/reconstruction equivalence;
random projection complexity and JL scaling $k \sim lo g (n) / ε^{2}$ ;
conceptual difference between PCA and t-SNE (linear-global vs non-linear-local focus);
k-NN decision rule and effect of $k$ ;
k-means objective and alternating optimisation steps.

This lecture links geometric representation learning with simple, powerful distance-based prediction and clustering methods.

Lecture 11 Deep Neural Networks and Backpropagation

This lecture extends perceptron models to deep compositions of affine maps and non-linear activations. The central result is that depth increases representational capacity, but efficient training requires structured gradient computation via backpropagation.

From Linear Separation to Deep Non-Linear Models

Stacking purely linear layers collapses to one linear map, so depth alone does not resolve non-linearly separable tasks such as XOR. Let

h (x) = W_{k} (W_{k - 1} (\dots (W_{1} x))) .

Then $h (x) = W x$ for $W = W_{k} \dots W_{1}$ , hence no additional expressive class over a single linear classifier.

Introducing elementwise activation functions $σ$ yields a non-linear hypothesis class:

h (x) = W_{k} σ (W_{k - 1} σ (\dots σ (W_{1} x + b_{1}) \dots) + b_{k - 1}) + b_{k} .

Typical activations discussed in lecture are sigmoid, tanh, and ReLU.

Multi-Layer Perceptron Notation and Forward Pass

For layers $l = 1, \dots, L$ (input $l = 1$ , output $l = L$ ), define:

s^{l} = W^{l - 1} a^{l - 1} + b^{l - 1}, a^{l} = σ (s^{l}) .

Here $a^{1} = x$ is the input, $s^{l}$ is the pre-activation vector, and $a^{l}$ the activation vector. For a supervised target $y$ , prediction is $\overset{y}{^} = a^{L}$ .

This gives the standard multi-layer perceptron forward recursion used throughout deep learning architectures.

Universal Approximation Perspective

The lecture states the universal approximation theorem intuition: for suitable non-linear activations, a feed-forward network with at least one hidden layer and finitely many parameters can approximate continuous functions on compact domains arbitrarily well.

A representative form is:

\forall f \in C ([0, 1]^{n}), \forall ε > 0, \exists network g such that x \in [0, 1]^{n} sup ∣ f (x) - g (x) ∣ < ε .

This is an expressivity statement, not an optimisation guarantee: approximation existence does not imply easy training or strong generalisation.

Learning Objective and Loss Design

Training minimises a differentiable empirical objective

Θ min \frac{1}{n} i = 1 \sum n L (\overset{y}{^}_{i}, y_{i}), Θ = {W^{l}, b^{l}}_{l = 1}^{L - 1} .

Lecture emphasis:

for classification, cross-entropy is standard;
for regression, mean squared error is standard;
differentiability of the computational path is required for gradient-based learning.

From Gradient Descent to Backpropagation

Vanilla gradient descent updates parameters by

θ \leftarrow θ - η \nabla_{θ} L,

but naive symbolic differentiation of a deep composite is computationally redundant. Backpropagation exploits chain-rule factor reuse by traversing the network in reverse topological order.

For a scalar example $f (x, y, z) = (x + y) z$ , write $q = x + y$ , $f = q z$ . Then

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} = z \cdot 1, \frac{\partial f}{\partial y} = z, \frac{\partial f}{\partial z} = q .

The same decomposition principle scales to full neural networks.

Error Signals, Jacobians, and Parameter Gradients

Define layer error (pre-activation gradient)

δ^{l} := \frac{\partial L}{\partial s ^{l}} .

Backward recursion for hidden layers (vector form):

δ^{l - 1} = (W^{l - 1})^{⊤} δ^{l} ⊙ σ^{'} (s^{l - 1}),

which is the Jacobian-chain-rule form highlighted in lecture. Parameter gradients then become

\frac{\partial L}{\partial W ^{l - 1}} = δ^{l} (a^{l - 1})^{⊤}, \frac{\partial L}{\partial b ^{l - 1}} = δ^{l} .

Hence one training step is: forward pass, compute loss, backward pass for all $δ^{l}$ , then gradient update for all $(W^{l}, b^{l})$ .

Building-Block View of Deep Architectures

The lecture reframes deep learning as composition of modules with forward and backward interfaces. A linear feed-forward layer is one block; recurrent or attention/convolutional modules are alternative blocks. Architectures are formed by:

stacking blocks (depth);
adding skip connections;
branching into multiple outputs.

This abstraction explains how MLPs, CNNs, RNNs, transformers, and generative architectures share the same optimisation principle while differing in structural priors.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

why linear-layer stacking without activations is still linear;
forward equations $s^{l} = W^{l - 1} a^{l - 1} + b^{l - 1}$ and $a^{l} = σ (s^{l})$ ;
role of differentiable loss functions and typical choices by task;
backpropagation as efficient chain-rule reuse rather than a different optimiser;
error recursion $δ^{l - 1} = (W^{l - 1})^{⊤} δ^{l} ⊙ σ^{'} (s^{l - 1})$ ;
gradient formulas for weights and biases;
modular building-block view and its connection to modern deep architectures.

This lecture provides the computational core that makes high-capacity non-linear models trainable in practice.

Lecture 12 Deep Learning Architectures as Composable Building Blocks

This lecture reframes deep learning implementation as composition of reusable computational modules. The key abstraction is that each module exposes a forward map and a backward gradient map, so complex networks can be assembled while preserving end-to-end trainability.

Building-Block Abstraction

Let a block be a parametric map

z = ϕ (x; θ),

with local Jacobian $\partial z / \partial x$ and parameter gradient $\partial z / \partial θ$ . During backpropagation, upstream error $δ_{z} = \partial L / \partial z$ induces

δ_{x} = δ_{z} \frac{\partial z}{\partial x}, \frac{\partial L}{\partial θ} = δ_{z} \frac{\partial z}{\partial θ} .

Hence block internals are hidden once these interfaces are available.

Composition Patterns for Network Design

The lecture identifies three core composition patterns:

depth by stacking blocks sequentially (standard feed-forward composition);
skip connections by routing an earlier representation into a later block;
branching by producing multiple outputs from shared intermediate features.

A residual-style skip pattern can be written as

h_{l + 1} = h_{l} + F (h_{l}; θ_{l}),

illustrating that architecture design is fundamentally graph construction over differentiable operators.

Canonical Block Families

A linear feed-forward layer (MLP block) is the baseline module:

s = W x + b, a = σ (s) .

The lecture then positions additional families by inductive bias:

recurrent blocks for sequential dependencies;
convolutional blocks for local spatial structure;
attention-based blocks for content-dependent interactions.

All remain trainable with the same gradient-flow principle from Lecture 11.

Architecture-Level Taxonomy

The closing overview maps tasks to common deep architectures:

MLP for generic tabular/feed-forward settings;
RNN for sequential data;
CNN for vision-like local structure;
transformer models for attention-centric sequence modelling;
generative families including auto-encoder (AE), variational auto-encoder (VAE), and generative adversarial network (GAN);
graph neural architectures for relational/graph-structured data.

The common theme is not the optimisation algorithm, but the chosen structural prior encoded by block topology.

Differentiability and Practical Constraint

A direct implementation requirement is that optimisation by backpropagation needs differentiable computational paths. The lecture notes that non-differentiable deep models can exist, but then standard backpropagation is not directly applicable and alternative optimisation schemes are required.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

the forward/backward interface view of a neural block;
why stacking, skipping, and branching are the core architecture operators;
how different architecture families encode different inductive biases;
why all differentiable architectures share the same backpropagation machinery;
when differentiability assumptions fail and what that implies for training.

This lecture bridges mathematical training rules and practical neural-network system design via a unified compositional perspective.

Lecture 13 Bias and Fairness in AI

This lecture formalises fairness questions for machine-learning systems that directly affect human decisions. The core message is that fairness is not a single mathematical property but a family of partially incompatible constraints that must be selected with domain-specific policy judgement.

Decision Setting and Error Decomposition

Assume binary classification with predicted label $\hat{Y} \in {0, 1}$ , true label $Y \in {0, 1}$ , and protected attribute $A \in {0, 1}$ . For each group $A = a$ , fairness analysis starts from confusion-matrix quantities:

FPR_{a} = P (\hat{Y} = 1 ∣ Y = 0, A = a), FNR_{a} = P (\hat{Y} = 0 ∣ Y = 1, A = a) .

Overall error,

Err = \frac{FP + FN}{TN + FP + FN + TP},

is often insufficient because two systems with similar total error can distribute harms differently across groups.

COMPAS as a Motivating Case

The COMPAS recidivism-risk system illustrates that close aggregate error rates do not imply comparable group-level treatment. In the lecture data, error rates are numerically similar across groups, but false-positive and false-negative rates differ substantially between Black and White defendants. Hence fairness auditing must inspect conditional error profiles, not only aggregate accuracy.

This is a canonical example where stakeholders value different error types: defendants are disproportionately affected by high false-positive risk assignments, while judicial institutions may focus on false negatives related to missed high-risk cases.

Fairness Through Unawareness Is Insufficient

Removing protected attributes from the input does not generally remove discriminatory effects. Let $X$ denote non-sensitive features and $A$ protected status; in practice $X$ may contain proxies with high mutual information with $A$ (for example through location, spending, or administrative history). A learner can therefore reconstruct group information implicitly and reproduce disparate outcomes even when $A$ is omitted.

Group Fairness Criteria

The lecture contrasts three standard criteria.

Demographic parity (naive statistical parity):

P (\hat{Y} = 1 ∣ A = 1) = P (\hat{Y} = 1 ∣ A = 0) .

This equalises acceptance rates but ignores label prevalence and can be satisfied by trivial predictors.

Calibration of risk scores $S$ :

P (Y = 1 ∣ A = 1, S = s) = P (Y = 1 ∣ A = 0, S = s) = s .

This requires equal semantic meaning of the score across groups, but does not force equal error rates.

Error-rate balance:

P (\hat{Y} = 1 ∣ Y = 0, A = 1) = P (\hat{Y} = 1 ∣ Y = 0, A = 0),

P (\hat{Y} = 0 ∣ Y = 1, A = 1) = P (\hat{Y} = 0 ∣ Y = 1, A = 0) .

This enforces parity of false-positive and false-negative rates across groups.

Impossibility and Policy Choice

A central theoretical result is that, except for special cases (notably perfect prediction or equal base rates $P (Y = 1 ∣ A = 0) = P (Y = 1 ∣ A = 1)$ ), one cannot in general satisfy calibration and error-rate balance simultaneously. Therefore fairness design is not a purely technical optimisation target with a unique solution; it is a constrained policy choice requiring explicit normative priorities.

Individual Fairness Perspective

Group criteria can be complemented by individual fairness: similar individuals should receive similar predictions. If $d_{X} (x, x^{'})$ is a task-relevant metric on feature space, one seeks score mappings $S$ with controlled Lipschitz behaviour, informally

∣ S (x) - S (x^{'}) ∣ \leq L d_{X} (x, x^{'}) .

The major practical difficulty is defining a defensible similarity metric $d_{X}$ .

Fair Regression Construction

For linear regression with sensitive variable $s$ ,

y = b_{0} + b_{1} x_{1} + \dots + b_{k} x_{k} + β s,

the lecture presents a simple post-estimation adjustment:

\overset{y}{^}_{fair} = b_{0} + b_{1} x_{1} + \dots + b_{k} x_{k} + c,

where a group-invariant constant $c$ replaces the sensitive term. Typical choices include matching one reference group (for example $c = 0$ or $c = β$ depending on coding) or preserving the training-set mean effect via $c = β \overset{s}{ˉ}$ . This is an average correction, not a full guarantee of non-discrimination.

Bias, Cost Asymmetry, and Transparency

Dataset bias propagates into learned predictors and may even be amplified by modelling choices. In deployment, false positives and false negatives often carry asymmetric costs, so objective design should weight them according to domain risk rather than default symmetric error minimisation.

To support accountability, the lecture highlights model cards as structured documentation of intended use, evaluation setting, limitations, and ethical considerations.

Human Decision Makers and Risk Scores

The Kentucky Pretrial Risk Assessment (KPRA) example shows that “human in the loop” does not automatically neutralise bias. After a policy change that made non-financial bonds the default for low/moderate risk categories, evidence indicates stronger judicial deviation from recommendations for moderate-risk Black defendants than for comparable White defendants. Thus unequal outcomes can emerge from interaction between model outputs, policy defaults, and human discretion.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

why fairness analysis must separate aggregate error from group-conditional error;
formal definitions and trade-offs of demographic parity, calibration, and error-rate balance;
why fairness through unawareness is insufficient under proxy features;
the incompatibility insight linking base rates, calibration, and error-rate parity;
individual fairness as a metric-based constraint and its practical bottleneck;
the fair-regression constant-replacement construction and its limits;
why transparency artefacts and human-in-the-loop deployment are necessary but not sufficient safeguards.

This lecture shifts machine learning from pure predictive performance to socio-technical system design under explicit fairness constraints.

Lecture 14 Reinforcement Learning

This lecture introduces reinforcement learning (RL) as a sequential decision framework in which an agent interacts with an environment, receives rewards, and learns a policy that optimises long-run return rather than one-step prediction accuracy.

Core Agent-Environment Formalism

At time step $t$ , the agent observes state $S_{t}$ , chooses action $A_{t}$ , receives reward $R_{t + 1}$ , and transitions to $S_{t + 1}$ . For discount factor $γ \in [0, 1)$ , return is

G_{t} := R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = k = 0 \sum \infty γ^{k} R_{t + k + 1} .

The objective is to find policy $π$ maximising

E_{π} [G_{t} ∣ S_{t} = s] .

Hence RL optimisation is dynamic and trajectory-level, not i.i.d. sample-level as in supervised learning.

Value Functions and Bellman Structure

For a fixed policy $π$ , the state-action value function is

q_{π} (s, a) := E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a] .

Control methods approximate optimal action values and derive greedy or near-greedy policies. The lecture emphasises that Bellman operators provide the structural recursion behind update rules.

Q-Learning

Given transition sample $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ , tabular Q-learning updates only the visited pair:

Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + α_{t} (r_{t + 1} + γ a max Q_{t} (s_{t + 1}, a) - Q_{t} (s_{t}, a_{t})),

with learning rate $α_{t} \in [0, 1]$ . This is a stochastic approximation / semi-gradient form with temporal-difference target

r_{t + 1} + γ a max Q_{t} (s_{t + 1}, a) .

Speedy Q-Learning

The lecture discusses Speedy Q-learning, which introduces a correction using both $Q_{t}$ and $Q_{t - 1}$ under the empirical Bellman operator to accelerate convergence behaviour. Conceptually, it modifies the basic TD recursion by adding a term that compensates for drift between successive iterates.

Distributional Reinforcement Learning

Instead of learning only expected return, distributional RL models the full return random variable

Z_{π} (s, a) := t = 0 \sum \infty γ^{t} R_{t}, Q_{π} (s, a) = E [Z_{π} (s, a)] .

This yields risk-sensitive information (tail behaviour, spread, multimodality) unavailable from expectation-only methods. In categorical distributional RL, one discretises return support and studies contraction of projected Bellman updates in suitable metrics (here, Cramer-type $ℓ_{2}$ distance).

PAC-Style Guarantees for Distributional RL

A key theoretical block in the lecture is a Probably Approximately Correct (PAC) analysis for a speedy categorical distributional update under assumptions including finite state-action space, bounded rewards, $γ < 1$ , and Robbins-Monro style step-size conditions.

The resulting high-probability bound has the canonical form

ℓ_{2} (η^{C}, η_{k}) \leq C (γ^{k} + \frac{lo g ( 1/ δ )}{k})

with probability at least $1 - δ$ , where $η_{k}$ is the iterate and $η^{C}$ the fixed point of the projected distributional Bellman operator. The first term captures contraction bias decay; the second captures stochastic estimation error.

Proof Architecture and Concentration

The proof strategy presented in the lecture follows five components: stability of iterates as probability measures, martingale-difference representation of error processes, inductive recursion bounds, maximal Hoeffding-Azuma concentration at atom level, and inversion to obtain explicit $ε (δ)$ confidence-style guarantees.

This connects RL convergence analysis to the same probability toolkit used in generalisation theory: concentration inequalities translate random-sample effects into explicit confidence bounds.

Policy Evaluation and Off-Policy Difficulty

After learning a policy, independent evaluation is essential. In particular, off-policy evaluation (estimating performance of a target policy from data generated by another behaviour policy) is highlighted as intrinsically difficult and variance-sensitive, motivating robust lower-confidence validation procedures in high-stakes applications.

Empirical Scope and Application Cases

The lecture situates RL across game domains (Atari, chess, Go), control (autonomous driving, industrial automation), and scientific/medical decision support. AlphaGo/AlphaZero-style results are used to illustrate that a single RL framework with minimal domain priors can achieve superhuman performance in complex sequential tasks.

Applied case studies from autonomous driving, cryogenic detector control, and sepsis treatment are used to motivate distributional RL and policy-evaluation methodology in safety-critical settings.

Exam-Oriented Takeaways

For this lecture, you should be able to reconstruct and justify:

the agent-environment loop and discounted return objective;
definitions of $q_{π}$ , Bellman-style targets, and the Q-learning update;
why Speedy Q-learning modifies standard temporal-difference recursion;
expectation-based versus distributional RL, including $Z_{π}$ and $Q_{π} = E [Z_{π}]$ ;
the structure of PAC-style high-probability error bounds in distributional RL;
where martingale concentration (Hoeffding-Azuma) enters the analysis;
why off-policy evaluation is difficult and central for reliable deployment.

This lecture extends machine learning from static prediction to sequential optimisation under uncertainty, where both asymptotic convergence and finite-sample reliability must be analysed.

Lukas' Notes

194.025 Introduction to Machine Learning

Lecture 1 Overview and Mathematical Framing

Problem Formulation

Conceptual Position of Machine Learning

Historical Arc and Methodological Consequence

Perceptron as Canonical Linear Learner

Noise, Plausible Hypotheses, and Margin Principle

Expressivity Limits and Non-Linearity

Toy Generalisation Guarantee from Threshold Learning

Exam-Oriented Takeaways

Lecture 2 Data and (Pre-)Processing

Data Model and Data Types

Data Analysis Before Pre-Processing

Numeric Pre-Processing and Scaling Maps

Categorical and String Encodings

Image and Text Feature Construction

Distances as Inductive Geometry

Graph Data and Representation Dependence

Weisfeiler-Leman Refinement and Graph Vectors

Exam-Oriented Takeaways

Lecture 3 Core Concepts of Machine Learning

Learning Paradigms

Supervised Tasks and Loss Functions

Empirical Risk Minimisation and Training Objective

Gradient Descent and Parameter Updates

Risk, Empirical Risk, and Generalisation

Bias-Variance and Fit Regimes

Regularisation as Complexity Control

Exam-Oriented Takeaways

Lecture 4 Basic Algorithms I

Why Fit Models

Linear Regression Setup

Optimisation by Gradient Descent

Closed-Form Least-Squares Solution

Matrix Form and Normal Equations

Polynomial Regression as Linear Regression in Features

Multiple Polynomial Regression

Numerical Remarks

Exam-Oriented Takeaways

Lecture 5 Basic Algorithms II

Linear Classifiers and Hyperplane Geometry

A Simple Linear Classifier via Least Squares

The Perceptron as Online Learning

Novikoff Convergence Theorem and Mistake Bound

Logistic Regression

Categorical Data, XOR, and Decision Trees

Split Quality, Top-Down Induction, and Stopping

Ensemble Methods: Bagging, Forests, and Boosting

Exam-Oriented Takeaways

Lecture 6 Experiment Design and Evaluation

Hyperparameters and Their Role

Why Dataset Splitting Is Necessary

Holdout and k-fold Cross-Validation

Data Contamination and Preprocessing Order

Evaluation Metrics by Task

Hyperparameter Tuning as Outer Optimisation

Baselines, Randomness, and Significance

Practical Reliability and Failure Modes

Exam-Oriented Takeaways

Lecture 7 Machine Learning Theory

Formal Learning Setup

Hypothesis Classes and Learning Algorithms

Core Assumptions

Realisable PAC Learning

Finite-Class Mistake Bound

Papaya Example and Sample Complexity Interpretation

Infinite Classes, Shattering, and VC Dimension

VC Dimension and PAC Learnability

Exam-Oriented Takeaways

Lecture 8 Support Vector Machines and Kernel Methods

From Linear Classifiers to Max-Margin Separation

SVM Optimisation View

Why Perceptrons Fail on XOR and How to Fix It

Kernel Perceptron and the Kernel Trick

Example Kernel Construction

Common Kernels and Inductive Bias

Kernel Matrices and PSD Structure

A Small Learning-Theory Bridge

Applications and Structured Data