machine-learning classification regression
Definition
Decision Tree
A decision tree is a non-parametric supervised learning model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Formally, it is a directed tree structure consisting of:
- Split Nodes: Represent a test on an attribute (e.g., “is ?” or “is ?”).
- Edges: Represent the outcome of the test, leading to child nodes.
- Leaf Nodes: Store the final predicted value or class label.
Representational Expressivity
Any Boolean function can be represented by a decision tree. This is demonstrated by converting the function into its Disjunctive Normal Form (DNF), where each product term corresponds to a path from the root to a leaf with label . As a result, a sufficiently deep decision tree can achieve zero empirical risk on any binary training set, which typically necessitates stopping criteria to prevent overfitting.
Top Down Induction
Trees are typically constructed recursively using a greedy approach (e.g., ID3, C4.5, CART). The process involves identifying the attribute that achieves the optimal result for a splitting criterion , such as the Gini index or Entropy.
Construction Algorithm
Given a dataset and a splitting criterion :
Step 1 (Initialization): Assign all to the root node.
Step 2 (Attribute Evaluation): For each attribute in , partition all at the current node by the attribute’s value and compute the criterion for this partitioning.
Step 3 (Splitting): Identify the attribute that achieves the optimal result for and set it as the splitting feature. If (perfect purity), designate the node as a leaf and return.
Step 4 (Recursion): Create child nodes for each partition. For each node, if it is pure, designate as a leaf; otherwise, repeat from step 2 using the subset of data in the child node.