Definition
Numerical Data
Numerical data (or quantitative data) is a data type consisting of numerical values that represent measurable quantities. Formally, numerical data is characterised by values from a set where arithmetic operations (e.g., addition, subtraction) and ordinal comparisons are well-defined. Numerical data is further categorised into two fundamental types:
- Discrete Data: Observations that take distinct, separate values (typically integers), such as counts.
- Continuous Data: Observations that can take any value within a given interval, such as measurements of length or time.
Discrete vs. Continuous
Discrete
These variables are restricted to a countable set of values. In the context of the course examples, this includes the number of items (e.g., {1, 2, 3} apples) or specific timestamps.
Continuous
These variables reside in an uncountable set, typically or a sub-interval thereof. This includes physical dimensions such as height, weight, or the diameter of a mushroom cap.
Preprocessing
- Deal with missing values (delete or impute)
- Discretise where necessary/applicable (e.g. put age values into age groups, such as 15-25)
- especially interesting for numeric labels (go from regression to classification, reduce number of classes)
- Scale your data (e.g.: min-max, mean normalisation, …)
- Scaling should be done on feature level, not on data set level
- each feature/attribute is scaled individually as a feature such that no statistics from other features influence the scaling