Classification and Regression Tree/cart
CART is a learning method for the conditional probability distribution of the output random variable Y given the input random variable X.
definition
CART assumes that the decision tree is a binary tree, the values of the internal node features are "yes" and "no", the left branch is the branch with the value of "yes", and the right branch is the branch with the value of "no". Such a decision tree is equivalent to recursively splitting each feature into two, dividing the input space, i.e., the feature space, into a finite number of units, and determining the predicted probability distribution on these units, that is, the conditional probability distribution of the output under the given input conditions.
The CART algorithm consists of the following two steps:
- Tree generation: Generate a decision tree based on the training data set. The generated decision tree should be as large as possible.
- Tree pruning: Use the validation data set to prune the generated tree and select the optimal subtree. The minimum loss function is used as the pruning criterion.
The generation of a decision tree is a process of recursively constructing a binary decision tree. The square error minimization criterion is used for the regression tree, and the Gini index minimization criterion is used for the classification tree to perform feature selection and generate a binary tree.
Generation of regression trees
Assume that X and Y are input and output vectors respectively, and Y is a continuous variable. Given the training data set consider how to generate a regression tree.
A regression tree corresponds to a partition of the input space (i.e., feature space) and the output value on each unit of the partition. Assume that the input space has been divided into M units $latex R_{1}, R_{2}, \ldots, R_{M}$ and each unit $latex R_{m} $ has a fixed output value $latex c_{m} $ So the regression tree model can be expressed as $latex f(x)=\sum_{m=1}^{M} c_{m} I\left(x \in R_{m}\right) $
When the partition of the input space is determined, the square error can be used$latex \sum_{x_{i} \in R_{m}}\left(y_{i}-f\left(x_{i}\right)\right)^{2}$
To represent the prediction error of the regression tree for the training data, the optimal output value of each unit is solved using the criterion of minimum square error. It is easy to know that the optimal value of $latex c_{m} $ on unit $latex R_{m} $ is the mean of the output $latex y_{i} $ for all input instances $latex x_{i} $ on $latex R_{m} $, that is

The question is how to divide the input space. Here we use a heuristic method to select the first j variables $latex x^{(j)}$ and the values it takes s , as the segmentation variable and segmentation point, and define two regions:

Then find the optimal split variable j and the optimal split point s Specifically, solve

For fixed input variables j The optimal split point can be found s

Traverse all input variables and find the optimal segmentation variable j , forming a pair $latex (j, s) $, which divides the input space into two regions in turn. Then, the above division process is repeated for each region until the stopping condition is met.
Generation of classification trees
The classification tree uses the Gini index to select the optimal feature and determine the optimal binary cut point for the feature.
Gini Index
In the classification process, assuming there are K classes, the probability that a sample point belongs to the kth class is $latex p_{k} $, then the Gini index of the probability distribution is defined as:

For a two-class classification problem, if the probability that a sample point belongs to the first class is p, the Gini index of the probability distribution is:

For a given sample set D, its Gini index is:

Among them, $latex C_{k} $ is the subset of samples in D belonging to the kth class, and K is the number of classes.
If the sample set D is divided into two parts, $latex D_{1}$ and $latex D_{2}$, according to whether the feature A takes a certain possible value a, that is:

Then under the condition of feature A, the Gini index of set D is defined as:

The Gini index $latex {Gini}(D) $ represents the uncertainty of set D, and the Gini index $latex {Gini}(D, A) $ represents the uncertainty of set D after partitioning by A=a. The larger the Gini index, the greater the uncertainty of the sample set, which is similar to entropy.
Generative Algorithms
Input: training data set D, conditions for stopping calculation
Output: CART decision tree
According to the training data set, starting from the root node, recursively perform the following operations on each node to build a binary decision tree:
- Suppose the training data set of the node is D, and calculate the Gini coefficient of the existing features for the data set. At this time, for each feature A, for each possible value a, D is divided into two parts, $latex D_{1} $ and $latex D_{2} $, according to whether the test of sample point A=a is "yes" or "no", and the Gini index when A=a is calculated
- Among all possible features A and all possible split points a, select the feature with the smallest Gini index and its corresponding split point as the optimal feature and the optimal split point. Based on the optimal feature and the optimal split point, generate two child nodes from the current node and distribute the training data set to the two child nodes according to the features.
- Recursively call (1) and (2) for the two child nodes until the stopping condition is met.
- Generate CART decision tree