Pruning
PruningIt is a method to stop the decision tree from branching. It is used to solve the problem of overfitting in decision trees, mainly to simplify the decision tree.
The reason for pruning is that during the decision tree learning process, in order to classify the training samples as correctly as possible, nodes will be continuously generated, which will cause too many branches in the decision tree, thereby reducing efficiency. At this time, pruning operations are needed to simplify the decision tree.
The significance of pruning
The decision tree algorithm needs to determine the optimal size of the tree. A tree that is too large will overfit and will be difficult to generalize to new samples. At the same time, a small tree may not be able to capture structural information about the sample space.
It is also difficult to determine when to stop a tree algorithm, because it is impossible to tell whether a single node will reduce the error rate. The most common strategy is to grow the tree until each node contains a small number of instances, and then use pruning to remove unnecessary nodes.
Ideas and methods of pruning
The pruning principle lies in how to determine the size of the decision tree:
- Use training and validation sets to evaluate the effect of pruning methods on pruning nodes;
- Use the entire training set for training, but use statistical tests to determine whether pruning specific nodes improves performance on data outside the training set.
- Use explicit criteria to measure the complexity of training examples and decision trees.
The specific operation of pruning is: subtract some subtrees or leaf nodes from the decision tree, and then use the root node or parent node as a leaf node.
Classification of pruning
Pruning is usually divided into two categories: Pre-Pruning and Post-Pruning