Skip to content

Latest commit

 

History

History
57 lines (40 loc) · 1.97 KB

README.md

File metadata and controls

57 lines (40 loc) · 1.97 KB

Dask Backward Feature Selection

Backward step-wise feature selection using Dask, scikit-learn compatible.

Scale out feature seletion using distributed computing/Dask!

I created this due to the fact that mlxtend's SequentialFeatureSelector did not use joblib in a Dask compatable way.

Install

pip install git+https://github.com/pr38/dask_backward_feature_selection

Example Usage

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston

from dask.distributed import Client, LocalCluster

from dask_backward_feature_selection import DaskBackwardFeatureSelector

#You should be useing Dask's yarn or kubernates cluster deployments
#if you are going to be running this localy you are better off useing mlxtend's SequentialFeatureSelector 
cluster = LocalCluster(3)
client = Client(cluster)

boston = load_boston()
X = boston['data']
y = boston['target']

dfs = DaskBackwardFeatureSelector(DecisionTreeRegressor(),client)
#kwargs for DaskBackwardFeatureSelector are:
#k_features: the smallest combination of features DaskBackwardFeatureSelector will examine.
#cv: if "cv" is an int, it will refer to the number of  cross validation folds for each feature combination tested. 
#cv can also be a scikitlearn CV class.
#scoring: can be string (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.get_scorer.html#sklearn.metrics.get_scorer)
#, or a scikitlearn scoring class.
#if scatter is true, each thread in the cluster will keep a copy of the training data and estimator.

dfs.fit(X,y)

#positions of top performing combination of features in X matrix.
dfs.k_feature_idx_

#we can treat DaskBackwardFeatureSelector as an estimator after training.
dfs.predict(X)


#also DaskBackwardFeatureSelector can act as transformer.
dfs.transform(X,y)

#finally we can examine the best performing feature combinations for each step, for other use cases (ie:one-standard-error rule).
pd.DataFrame(dfs.metric_dict_ )