k-Means Clustering

Definition

k-Means Clustering

k-means clustering is an unsupervised partitioning algorithm that seeks to divide a set of $d$ -dimensional observations ${x_{1}, \dots, x_{n}}$ into $k \leq n$ disjoint sets $S = {S_{1}, \dots, S_{k}}$ to minimise the within-cluster sum of squares (WCSS). Formally, the algorithm identifies the partition $S$ that satisfies:

$ar g min_{S} \sum_{i = 1}^{k} \sum_{x \in S_{i}} ∥ x - μ_{i} ∥^{2}$

where $μ_{i}$ is the centroid (arithmetic mean) of the points in cluster $S_{i}$ .

Lloyd’s Algorithm

The optimisation problem is NP-hard; however, it is typically solved using the iterative Lloyd’s algorithm, which alternates between two steps:

Assignment Step: Each observation $x_{j}$ is assigned to the cluster whose centroid $μ_{i}$ is closest in terms of Euclidean distance:

S_{i}^{(t)} = {x_{p} : ∥ x_{p} - μ_{i}^{(t)} ∥^{2} \leq ∥ x_{p} - μ_{j}^{(t)} ∥^{2} \forall j, 1 \leq j \leq k}

Update Step: The centroids are recalculated based on the new cluster assignments:
$μ_{i}^{(t + 1)} = \frac{1}{∣ S _{i}^{(t)} ∣} \sum_{x_{j} \in S_{i}^{(t)}} x_{j}$

Limitations and Variants

Geometric Assumptions: k-means is biased toward finding spherical, equally-sized clusters. It fails to correctly identify non-convex structures, such as the two-moon dataset, where spectral clustering is preferred.

Initialisation Sensitivity: The algorithm is sensitive to the initial placement of centroids and may converge to a local minimum. Techniques such as k-means++ are utilised to ensure better initial seeding.

Medoids vs Centroids: While k-means uses the arithmetic mean as the cluster centre, k-medoids restricted the centre to be one of the actual data points from the set, providing robustness to outliers.

Lukas' Notes

k-Means Clustering

Definition

Lloyd’s Algorithm

Limitations and Variants

Graph View

Table of Contents

Backlinks