machine-learning unsupervised-learning visualisation
Definition
t-distributed Stochastic Neighbour Embedding
t-distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear, stochastic dimensionality reduction technique specifically designed for the visualisation of high-dimensional datasets in 2D or 3D space. Formally, it converts the high-dimensional Euclidean distances between points into conditional probabilities that represent similarities:
The algorithm then identifies a low-dimensional embedding that minimises the Kullback–Leibler divergence between the high-dimensional distribution and a low-dimensional t-distribution :
Comparison with PCA
Unlike PCA, which is a deterministic, linear mapping that prioritises the preservation of global variance, t-SNE is non-deterministic and non-linear. It excels at preserving the local structure of the data, ensuring that nearby points in the high-dimensional space remain close in the low-dimensional embedding, making it superior for identifying clusters and sub-manifolds.
Computational Considerations
Stochastic Nature: Due to its non-deterministic optimisation (typically performed via gradient descent), multiple runs of t-SNE may result in different visual representations.
Parameter Sensitivity: The results are highly dependent on the perplexity parameter, which effectively balances the model’s focus between local and global aspects of the data.