machine-learning clustering statistics

Definition

k-Means Clustering

k-means clustering is an unsupervised partitioning algorithm that seeks to divide a set of -dimensional observations into disjoint sets to minimise the within-cluster sum of squares (WCSS). Formally, the algorithm identifies the partition that satisfies:

where is the centroid (arithmetic mean) of the points in cluster .

Lloyd’s Algorithm

The optimisation problem is NP-hard; however, it is typically solved using the iterative Lloyd’s algorithm, which alternates between two steps:

Assignment Step: Each observation is assigned to the cluster whose centroid is closest in terms of Euclidean distance:

Update Step: The centroids are recalculated based on the new cluster assignments:

Limitations and Variants

Geometric Assumptions: k-means is biased toward finding spherical, equally-sized clusters. It fails to correctly identify non-convex structures, such as the two-moon dataset, where spectral clustering is preferred.

Initialisation Sensitivity: The algorithm is sensitive to the initial placement of centroids and may converge to a local minimum. Techniques such as k-means++ are utilised to ensure better initial seeding.

Medoids vs Centroids: While k-means uses the arithmetic mean as the cluster centre, k-medoids restricted the centre to be one of the actual data points from the set, providing robustness to outliers.