In this project, I used Python to build and evaluate several machine learning models to predict credit risk. I employed the following different techniques:
- Oversample the data using the
RandomOverSampler
andSMOTE
algorithms. - Undersample the data using the
ClusterCentroids
algorithm. - Use a combinatorial approach of over- and undersampling using the
SMOTEENN
algorithm. - Compare two machine learning models that reduce bias,
BalancedRandomForestClassifier
andEasyEnsembleClassifier
.
I will evaluate the performance of these models and make a written recommendation on whether they should be used to predict credit risk.
- Data Source: LoanStats_2019Q1.csv
- Software and Tools: Python, Anaconda, Jupyter Notebook & Git Bash
- Accuracy Score is 65.4%
- Precision High Risk Score is 1%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 73%
- Recall Low Risk Score is 58%
- Accuracy Score is 66.3%
- Precision High Risk Score is 1%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 63%
- Recall Low Risk Score is 69%
- Accuracy Score is 66.3%
- Precision High Risk Score is 1%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 69%
- Recall Low Risk Score is 40%
- Accuracy Score is 54.5%
- Precision High Risk Score is 1%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 79%
- Recall Low Risk Score is 56%
- Accuracy Score is 78.9%
- Precision High Risk Score is 3%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 70%
- Recall Low Risk Score is 87%
- Accuracy Score is 93.2%
- Precision High Risk Score is 9%
- Precision Low Risk Score is 100%
- Recall High Risk Score is 92%
- Recall Low Risk Score is 94%
In summary, the results of these machine learning models show that the two Ensemble Classifiers perform best. When we compare the accuracy scores of all the models, the Balanced Random Forest Classifier and Easy Ensemble Classifier models had the highest scores, at 78.9% and 93.2% respectively. Since the goal of our analysis is to find a model that can best detect when a loan is high risk, we should take a close look at how all the models performed for the recall high risk score. When we compare these statistics, we find that the Easy Ensemble Classifier had the highest score at 92%. Therefore, the model I would recommend to use for predicting high risk loans is the Easy Ensemble Classifier model due to its recall high risk score and its good performance overall.