data-analysis preprocessing statistics
Definition
Imputation
Imputation is the statistical process of replacing missing data with estimated values to enable the utilisation of standard learning algorithms that require complete datasets. Formally, for a dataset with missing entries, an imputation function is applied to generate a complete dataset such that the underlying distribution is preserved as accurately as possible.
Common Strategies
Mean/Median Imputation: Replacing missing numerical values with the arithmetic mean or median of the observed instances for that feature. While simple, it can reduce the variance of the dataset and ignore inter-feature correlations.
Mode Imputation: utilised for categorical data, where the most frequent label in the column is used to fill missing entries.
Predictive Imputation: Modelling the feature with missing values as a target variable and using other available features to predict its value (e.g., using k-NN or linear regression). This preserves structural relationships between variables more effectively than simple statistical replacement.
Deletion: The removal of instances (rows) or features (columns) containing missing values. This is only recommended when the data is missing completely at random (MCAR) and the remaining sample size is sufficient.