Credit Scoring Model Comparison

Predictive Credit Scoring using Machine Learning algorithms under the SEMMA methodology in SAS Entreprise Miner

The dataset used can be found by clicking here

Project Introduction and Goals

Introduction:

Customer churn, also known as customer attrition is the loss of clients or customers. Banks rely on exploratory data analysis and predictive techniques to discover most striking behaviors of customers most likely to churn.

Aim of the project

Build a model that determines most prominent characteristics of customers most likely to churn on their credit card service so banks can be proactive and prevent customer attrition.

Strategy

Use SAS Enterprise Miner following SEMMA methodology to compare prediction effectiveness of the following models:

Decision Trees
Logistic Regression
Gradient Boosting

Hypothesis

Total Transaction Count (Last 12 months) better explains that a customer will churn rather than Age and credit limit of the customer. We expect the best model to perform in the gradient boosting, a widely used model in credit scoring modelling. ZhenyaTian, 2020

Bank Churners Dataset Description

21 Variables
6 Class variables
15 numerical variables (ordinal and interval)
No missing values

Diagram

Sample

Training set: 80%
Validation set: 10%
Test set: 10%

Explore

Imbalanced target variable.

Most of the customer didn’t churn 8.6% of the total

In this node we identify input variables which are useful for predicting the target variable or variables.

We use the Chi-square selection criterion.(Method available only for binary variables)

Number of Bins: default 50
Maximum Pass Number: default 6
Minimum Chi-Square: default 3.84

Variables with a Chi-square statistic higher than 3.84 will be accepted for training the model.Since we reject the null hypothesis that our feature is independent from the target variable.

This node helps us to choose the best variables or cluster components for analysis.

Variable clustering removes collinearity, decreases variable redundancy, and helps reveal the underlying structure of the input variables in a data set.

Since clustering source is based on the covariance matrix, variables with larger variances have more importance in the analysis

We include class variables through the use of dummy variables

We keep hierarchies on in order to create a hierarchical cluster structure

Modify

When using the variable selection method before:

Interactive binning results using Gini Statistic as variable selection method. Group rare levels with cutoff value percentage of 0.5

Model

Decision Trees

Decision Tree (1):

Variable selection & clustering.

Target criterion = Gini Coef

Leaf size = 5

Decision Tree (2):

Interactive Binning without variable selection or clustering.

Target criterion = Gini Coef

Leaf size = 50

Decision Tree (3):

Without binning, variable selection or clustering.

Target criterion = Gini Coef

Leaf size = 5

The ROC curve above shows a comparison of the three different decision trees. Sensitivity is on the vertical axis and plots the true positive rate while specificity is on the horizontal axis and observes the false positive rate. Performance is greatest when maximizing true positive rate while minimizing false positives. ROC curve sensitivity dips in test set meaning higher false positives. Model is a bit too overtrained.

Logistic Regression

Logistic Regression(1): Variable selection & interactive binning. Selection model - stepwise.
Logistic Regression(2): Variable selection & stepwise selection model.
Logistic Regression(3): Variable selection with no selection model.

The first regression model has the strongest predictive power of the three per the smallest misclassification rate of .081935.

Order of importance of grouped variables in the stepwise selection process with a p-value lower than 0.05

Gradient Boosting Model

Since this is a classification problem we decided to use misclassification as our assessment criteria. The misclassification rates are impressively low:

Train - .035
Valid - .028
Test - .030

Results from the gradient boosting model indicate these variables as most important:

Total_Trans_Ct
Total_Trans_Amt
Total_Revolving_Bal

Gradient Boosting is tree based algorithm that improves itself by building off the previous tree. By nature prone to overfitting. Lowest misclassification rate around the 200th iteration.

Final Model Assessment

Sensitivity vs specificity:

Gradient boosting was able to most accurately predict customers going to churn and predict customers not going to churn based on the largest AUC.

Observing the cumulative lift, gradient boosting is able to provide a better prediction rate of the three models up to the first 50% of observations.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
DataSources		DataSources
Meta		Meta
System		System
Workspaces		Workspaces
README.md		README.md
emproject.properties		emproject.properties
gitattributes		gitattributes
project.emp		project.emp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Scoring Model Comparison

Project Introduction and Goals

Introduction:

Aim of the project

Strategy

Hypothesis

Bank Churners Dataset Description

Diagram

Sample

Explore

Modify