Skip to content

Python implementation of decision tree for classification of meteorological dataset

Notifications You must be signed in to change notification settings

DennisHanyuanXu/Decision-Tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Decision Tree Project

Python implementation of decision tree for classification of meteorological dataset

1. Models and Methods

1.1 Decision Tree Models

Here I used a improved CART model as my basic decision tree model, which would split dataset and choose best feature based on Gini or Entropy.

1.2 Pruning Methods

Several tree pruning methods were implemented in class DecisionTree to avoid overfitting.

  1. Reduced Error Pruning
  2. Pessimistic Pruning Unlike other pruning methods, pessimistic pruning is a top-down algorithm, which is normally done by going through the nodes from the top of the tree. Here I also used a bottom-up method, which was brought up in a lecture from USC years ago (Machine Learning CSCI-567).
  3. Minimum Error Pruning

1.3 Ensemble Methods

For now, I just use the ensemble methods provided in Scikit-Learn.

  1. AdaBoost
  2. Bagging
  3. Random Forest

1.4 QnA

Q: Why is it recommended not to prune the trees while training random forest / bagging?
A: Pruning methods are usually used to prevent overfitting. As random forests do sampling with replacement along with random selection of features at each node for splitting the dataset, the correlation between the weak learners (individual tree models) would be low. So generally random forests can do a great job with just full depth. As for bagging, only variance can be reduced through the bagging process, not bias (we can see high bias as underfitting and high variance as overfitting, see bias-variance tradeoff). So we'd like the individual trees to have lower bias, in which case, overfitting trees are more than suitable.

2. Dataset

The dataset contains 40,000 entries of hourly meteorological data from the paper Assessing Beijing's PM 2.5 pollution: severity, weather impact, APEC and winter heating and China Meteorological Data Service Center.

About

Python implementation of decision tree for classification of meteorological dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages