machine-learning clustering statistics
Definition
k-Means Clustering
k-means clustering is an unsupervised partitioning algorithm that seeks to divide a set of -dimensional observations into disjoint sets to minimise the within-cluster sum of squares (WCSS). Formally, the algorithm identifies the partition that satisfies:
where is the centroid (arithmetic mean) of the points in cluster .
Lloyd’s Algorithm
The optimisation problem is NP-hard; however, it is typically solved using the iterative Lloyd’s algorithm, which alternates between two steps:
Assignment Step: Each observation is assigned to the cluster whose centroid is closest in terms of Euclidean distance:
Update Step: The centroids are recalculated based on the new cluster assignments:
Limitations and Variants
Geometric Assumptions: k-means is biased toward finding spherical, equally-sized clusters. It fails to correctly identify non-convex structures, such as the two-moon dataset, where spectral clustering is preferred.
Initialisation Sensitivity: The algorithm is sensitive to the initial placement of centroids and may converge to a local minimum. Techniques such as k-means++ are utilised to ensure better initial seeding.
Medoids vs Centroids: While k-means uses the arithmetic mean as the cluster centre, k-medoids restricted the centre to be one of the actual data points from the set, providing robustness to outliers.