Definition
Data
In the context of machine learning, data refers to the collection of observations or instances utilised to train and evaluate models. Formally, a dataset is a set where each instance resides in an instance space . Data serves as the experience from which a learning algorithm identifies patterns or models.
Data Modality
The structural nature of the data determines the appropriate model architecture and preprocessing pipeline.
Non-sequential Data: Observations where the ordering of instances does not convey information. This is typically represented as a static feature vector in a tabular format (e.g., medical records, physical dimensions).
Sequential Data: Data where the relative ordering of elements is critical to its semantic meaning. Examples include:
- Text: Sequences of characters or tokens.
- Audio/Speech: Temporal waveforms or spectral sequences.
- Time Series: Successive measurements over time (e.g., stock prices).
Representation
Regardless of modality, data must be transformed into a numerical format, typically a tensor or vector, to be processed by an algorithm. This transformation is achieved through feature engineering or the learning of latent embeddings.