Python implementation of decision tree for classification of meteorological dataset
Here I used a improved CART model as my basic decision tree model, which would split dataset and choose best feature based on Gini or Entropy.
Several tree pruning methods were implemented in class DecisionTree
to avoid overfitting.
- Reduced Error Pruning
- Pessimistic Pruning Unlike other pruning methods, pessimistic pruning is a top-down algorithm, which is normally done by going through the nodes from the top of the tree. Here I also used a bottom-up method, which was brought up in a lecture from USC years ago (Machine Learning CSCI-567).
- Minimum Error Pruning
For now, I just use the ensemble methods provided in Scikit-Learn.
- AdaBoost
- Bagging
- Random Forest
Q: Why is it recommended not to prune the trees while training random forest / bagging?
A: Pruning methods are usually used to prevent overfitting. As random forests do sampling with replacement along with random selection of features at each node for splitting the dataset, the correlation between the weak learners (individual tree models) would be low. So generally random forests can do a great job with just full depth. As for bagging, only variance can be reduced through the bagging process, not bias (we can see high bias as underfitting and high variance as overfitting, see bias-variance tradeoff). So we'd like the individual trees to have lower bias, in which case, overfitting trees are more than suitable.
The dataset contains 40,000 entries of hourly meteorological data from the paper Assessing Beijing's PM 2.5 pollution: severity, weather impact, APEC and winter heating and China Meteorological Data Service Center.