This repository contains a C++ implementation of the CART algorithm (currently, only supporting classification), offering a similar API for decision trees and random forests as the sklearn implementation. In addition to the C++ library, the python package twigy
offers python bindings for the main user facing classes.
There are the following three options to get started using twigy in python:
pip install twigy
The manylinux wheel is build in a CentOS 7 docker container, so you need to have docker installed. Also make sure you pulled the pybind11
git submodule as well with
git submodule update --init --recursive
Then run
./build_manylinux_wheel.sh
which will build the manylinux wheel. Then you can
pip install ./build/wheelhouse/twigy-0.0.1-cp36-cp36m-manylinux2014_x86_64.whl
To directly build the python extension library, you need cmake >= 3.10 and boost >= 1.66 on your system. Make sure you pulled the pybind11
git submodule as well with
git submodule update --init --recursive
Then
cd ./build && cmake ..
cmake --build . --target twigy
Implements a CART decision tree classifier. See also ./example.py
for usage.
impurity_measure
sets the measure of impurity used for the splits. Takes twigy.ImpurityMeasure.gini
or twigy.ImpurityMeasure.gini
. Default is twigy.ImpurityMeasure.gini
.
max_depth
sets the maximum depth to which the tree is grown. Default is -1
, which corresponds to no restriction on the depth.
min_samples_split
sets the minimum number of samples for a node to be split. Default is 2
.
min_samples_leaf
sets the minimum number of samples at a leaf node. Split that would lead to a lower number are not considered. Default is 1
.
max_features
sets the maximum number of randomly selected features to be considered at each split. Default is -1
, which corresponds chosing the number of features accoding to the max_features_method
.
max_features_method
sets the method by which the number of features to be considerd at each split is chosen unless it is explicitly specified by max_features
. Possible value twigy.MaxFeaturesMethod.sqrt_method
, twigy.MaxFeaturesMethod.log2_method
and twigy.MaxFeaturesMethod.all_method
. Default is twigy.MaxFeaturesMethod.all_method
.
min_impurity_split
sets the minimal impurity for a node to be considered for another split. Default is 0.0
.
build_tree(X, y)
grows the tree on the training set given by the features X
and the labels y
. Note that the class labels need to be given by 0,1,2,..., n_classes - 1
.
print_tree()
print a list of the nodes of the decision tree.
predict_classes(X)
predicts the class labels for the given samples X
.
Implements a random forest classifier. See also ./example.py
for usage.
n_estimators
sets the number of decision tree estimators to train.
impurity_measure
sets the measure of impurity used for the splits. Takes twigy.ImpurityMeasure.gini
or twigy.ImpurityMeasure.gini
. Default is twigy.ImpurityMeasure.gini
.
max_depth
sets the maximum depth to which the tree is grown. Default is -1
, which corresponds to no restriction on the depth.
min_samples_split
sets the minimum number of samples for a node to be split. Default is 2
.
min_samples_leaf
sets the minimum number of samples at a leaf node. Split that would lead to a lower number are not considered. Default is 1
.
max_features
sets the maximum number of randomly selected features to be considered at each split. Default is -1
, which corresponds chosing the number of features accoding to the max_features_method
.
max_features_method
sets the method by which the number of features to be considerd at each split is chosen unless it is explicitly specified by max_features
. Possible value twigy.MaxFeaturesMethod.sqrt_method
, twigy.MaxFeaturesMethod.log2_method
and twigy.MaxFeaturesMethod.all_method
. Default is twigy.MaxFeaturesMethod.sqrt_method
(note that the default is different that for the DecisionTreeClassifier
).
min_impurity_split
sets the minimal impurity for a node to be considered for another split. Default is 0.0
.
max_samples
sets the number of randomly selected samples to be used to train the individual trees. Default is -1
, which corresponds to all samples being used for each tree.
build_forest(X, y)
trains the random forest on the training set given by the features X
and the labels y
. Note that the class labels need to be given by 0,1,2,..., n_classes - 1
.
predict_classes(X)
predicts the class labels for the given samples X
.
twigy has been benchmarked (on an i3-7100 CPU @ 3.90 with 16GB memory) against scikit learn:
The benchmark shows that twigy is up to 8 times faster on this dataset. For more details see benchmark.py
.
twigy can also be used as c++ library as illustrated in example.cpp
.
To build the example you need cmake >= 3.10 and boost >= 1.66 on your system. Then run
cd ./build && cmake ..
cmake --build . --target example