Imputation

Definition

Imputation

Imputation is the statistical process of replacing missing data with estimated values to enable the utilisation of standard learning algorithms that require complete datasets. Formally, for a dataset $D$ with missing entries, an imputation function $f$ is applied to generate a complete dataset $\hat{D}$ such that the underlying distribution $P (X, Y)$ is preserved as accurately as possible.

Common Strategies

Mean/Median Imputation: Replacing missing numerical values with the arithmetic mean or median of the observed instances for that feature. While simple, it can reduce the variance of the dataset and ignore inter-feature correlations.

Mode Imputation: utilised for categorical data, where the most frequent label in the column is used to fill missing entries.

Predictive Imputation: Modelling the feature with missing values as a target variable and using other available features to predict its value (e.g., using k-NN or linear regression). This preserves structural relationships between variables more effectively than simple statistical replacement.

Deletion: The removal of instances (rows) or features (columns) containing missing values. This is only recommended when the data is missing completely at random (MCAR) and the remaining sample size is sufficient.

Lukas' Notes

Imputation

Table of Contents

Definition

Common Strategies

Backlinks