data-analysis

Definition

Data

In the context of machine learning, data refers to the collection of observations or instances utilised to train and evaluate models. Formally, a dataset is a set where each instance resides in an instance space . Data serves as the experience from which a learning algorithm identifies patterns or models.

Data Modality

The structural nature of the data determines the appropriate model architecture and preprocessing pipeline.

Non-sequential Data: Observations where the ordering of instances does not convey information. This is typically represented as a static feature vector in a tabular format (e.g., medical records, physical dimensions).

Sequential Data: Data where the relative ordering of elements is critical to its semantic meaning. Examples include:

  • Text: Sequences of characters or tokens.
  • Audio/Speech: Temporal waveforms or spectral sequences.
  • Time Series: Successive measurements over time (e.g., stock prices).

Representation

Regardless of modality, data must be transformed into a numerical format, typically a tensor or vector, to be processed by an algorithm. This transformation is achieved through feature engineering or the learning of latent embeddings.