Backward step-wise feature selection using Dask, scikit-learn compatible.
Scale out feature seletion using distributed computing/Dask!
I created this due to the fact that mlxtend's SequentialFeatureSelector did not use joblib in a Dask compatable way.
pip install git+https://github.com/pr38/dask_backward_feature_selection
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from dask.distributed import Client, LocalCluster
from dask_backward_feature_selection import DaskBackwardFeatureSelector
#You should be useing Dask's yarn or kubernates cluster deployments
#if you are going to be running this localy you are better off useing mlxtend's SequentialFeatureSelector
cluster = LocalCluster(3)
client = Client(cluster)
boston = load_boston()
X = boston['data']
y = boston['target']
dfs = DaskBackwardFeatureSelector(DecisionTreeRegressor(),client)
#kwargs for DaskBackwardFeatureSelector are:
#k_features: the smallest combination of features DaskBackwardFeatureSelector will examine.
#cv: if "cv" is an int, it will refer to the number of cross validation folds for each feature combination tested.
#cv can also be a scikitlearn CV class.
#scoring: can be string (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.get_scorer.html#sklearn.metrics.get_scorer)
#, or a scikitlearn scoring class.
#if scatter is true, each thread in the cluster will keep a copy of the training data and estimator.
dfs.fit(X,y)
#positions of top performing combination of features in X matrix.
dfs.k_feature_idx_
#we can treat DaskBackwardFeatureSelector as an estimator after training.
dfs.predict(X)
#also DaskBackwardFeatureSelector can act as transformer.
dfs.transform(X,y)
#finally we can examine the best performing feature combinations for each step, for other use cases (ie:one-standard-error rule).
pd.DataFrame(dfs.metric_dict_ )