machine-learning statistics

Definition

Dataset Splitting

Dataset splitting is the practice of partitioning a dataset into disjoint subsets to ensure unbiased model development and evaluation. To avoid data contamination (leakage), information used for preprocessing or parameter estimation must only be derived from the training portion.

Standard partitions include:

  1. Training Set (): Utilised for the estimation of model parameters .
  2. Validation Set (): Utilised for the selection of hyperparameters and model selection.
  3. Test Set (): Utilised exclusively for the final evaluation of the model’s generalisation performance. It must remain strictly unseen during all stages of training and tuning.