Feature Selection for Regression problems

Extensive Feature reduction and Feature Selection using multiple Techniques

Hello Everyone, In this project we will be showing various kinds of feature reduction and feature selection techniques for preprocessing of data before feedig it to the machine learning model.

Getting Started

Prerequisites

Install and setup Anaconda Find an easy installation and setup guide using this link Make sure you install Anaconda for python 3.6 or above
Install the required packages: Open Anaconda prompt and run these commands

conda install pandas, numpy, matplotlib, seaborn, xgboost, catboost, scikit-learn, statsmodels

Verify Installation by running

import pandas as pd
pd.__version__

Download the data sets: We will be using Superconductivty Data Data Set. The goal here is to predict the critical temperature based on the features extracted. Data can be downloaded through this link
Clone my repository and unzip the repository into a folder

How to run?

Open 'Anaconda Prompt' from Start Menu
Change working directory to the 'Feature_Selector' folder
run this command to open the jupyter notebook jupter notebook
open the Feature_Section_Regression.ipynb file run the cells one by one

Some Theory

Techniques used for Feature reduction/Selection:

1. Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.
GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator. This function take as input a scoring function that returns univariate scores and p-values.

modes:

‘percentile’ - removes all but a user-specified highest scoring percentage of features
‘k_best’ - removes all but the 'k' highest scoring features
‘fpr’ -  false positive rate
‘fdr’ - false discovery rate
‘fwe’ - family wise error

score_fun :

    For regression: f_regression, mutual_info_regression
    For classification: chi2, f_classif, mutual_info_classif

2. Backward Elimination using Statistical Significance

This method used p-values for elemination of the feature.
Significance level can be set using p_threshold.
We have used OLS(Ordinary Least Squares) regression (commonly known as Linear Regression) for finding p-values

3. Model-based (Select-from-Model)

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting.
The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.
Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”

4. RFE (Recursive feature elimination) and RFE-CV (Recursive feature elimination with Cross Validation)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.
First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_immportances_ attribute. Then, the least important features are pruned from current set of features.
That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

5. 'feature_selector' module

Feature selector is a tool for dimensionality reduction of machine learning datasets.

Technique available to identify features to remove:

Missing Values
Single Unique Values
Collinear Features
Zero Importance Features
Low Importance Features

Analysis of Model score vs the number of features using the RFE

In this analysis we have used the values for feature importance gained from the RFE technique and the number of features selected by RFE-CV technique. We then plot the model score vs the number of features selection in the descending order of the importances. The plots are shown below:

1. Linear Regression

2. Lasso Regression

3. Decision Tree Regression

4. Extra Trees Regression

5. XGB Regressor

Conclusion

Extra Trees Regression and XGBoost Regression are best models for calculating feature selection as the score is boosted at very less number of features as compared to other models
Lasso Regression does not need feature selection as it takes care of less importance features by itself. Hence Lasso is thw worst model to use for feature selectin tasks
Extra Trees Regression out-perfrom every other model used in this analysis acquiring score of 0.9258.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Authors

Sushant Gundla - Github Profile

Acknowledgments

WillKoehrsen - feature-selector

Resources

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html https://github.com/WillKoehrsen/feature-selector

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Feature_Section_Regression.ipynb		Feature_Section_Regression.ipynb
README.md		README.md
feature_selector.py		feature_selector.py
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Selection for Regression problems

Getting Started

Prerequisites

How to run?

Some Theory

Techniques used for Feature reduction/Selection:

1. Univariate feature selection

2. Backward Elimination using Statistical Significance

3. Model-based (Select-from-Model)

4. RFE (Recursive feature elimination) and RFE-CV (Recursive feature elimination with Cross Validation)

5. 'feature_selector' module

Analysis of Model score vs the number of features using the RFE

1. Linear Regression

2. Lasso Regression

3. Decision Tree Regression

4. Extra Trees Regression

5. XGB Regressor

Conclusion

Contributing

Authors

Acknowledgments

Resources

About

Releases

Packages

Languages

sushantgundla/Feature_Selector

Folders and files

Latest commit

History

Repository files navigation

Feature Selection for Regression problems

Getting Started

Prerequisites

How to run?

Some Theory

Techniques used for Feature reduction/Selection:

1. Univariate feature selection

2. Backward Elimination using Statistical Significance

3. Model-based (Select-from-Model)

4. RFE (Recursive feature elimination) and RFE-CV (Recursive feature elimination with Cross Validation)

5. 'feature_selector' module

Analysis of Model score vs the number of features using the RFE

1. Linear Regression

2. Lasso Regression

3. Decision Tree Regression

4. Extra Trees Regression

5. XGB Regressor

Conclusion

Contributing

Authors

Acknowledgments

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages