machine-learning preprocessing

Definition

Feature Engineering

Feature engineering is the process of utilising domain knowledge to extract, transform, or select attributes from raw data that improve the performance of a learning algorithm. Formally, it involves defining a mapping into a representation space that more effectively captures the relevant structural properties of the problem.

Images

The transformation of raw pixel grids into informative vectors typically follows a two-stage pipeline:

Preprocessing: Initial operations to normalise the data, including resizing to a fixed resolution, greyscale conversion, and geometric transformations (e.g., rotation, mirroring) to ensure invariance.

Feature Extraction: The application of hand-crafted descriptors to identify salient patterns. Common techniques include:

  • HOG (Histogram of Oriented Gradients): Captures object shape and structure.
  • SIFT (Scale-Invariant Feature Transform): Identifies points of interest invariant to scale and rotation.
  • Haar Cascades: Utilises digital image features for object detection (e.g., face detection).

Text

Natural Language Processing (NLP) requires the conversion of unstructured sequences into numerical formats through systematic pipelines:

Preprocessing: Includes the removal of stop-words (common, low-information words like “the”, “is”) and vocabulary reduction.

  • Stemming: The heuristic removal of suffixes to identify word roots (e.g., “transporting” “transport”).
  • Lemmatisation: Context-aware conversion of words to their base dictionary form (e.g., “better” “good”).

Vectorisation: Mapping words or sequences to the instance space:

  • One-Hot Encoding: Represents unique words as binary basis vectors.
  • TF-IDF (Term Frequency-Inverse Document Frequency): Quantifies word importance relative to a corpus.
  • Embeddings: Learned mappings (e.g., Word2Vec) that capture semantic relationships in continuous latent spaces.

Feature Importance

Feature importance refers to techniques used to quantify the contribution of each individual attribute to the model’s predictive performance.

Global Importance: Evaluates the overall impact of a feature across the entire dataset. In decision trees and Random Forests, this is often calculated based on the total reduction in the Gini impurity (Gini importance) provided by that feature.

Permutation Importance: A model-agnostic technique that measures the increase in the model’s prediction error after randomly shuffling the values of a single feature. A significant increase in error indicates that the model relied heavily on that feature for generalisation.