Credit risk represents the likelihood of borrowers failing to meet their loan obligations, posing a significant challenge for financial institutions. To mitigate this risk, lenders monitor critical metrics such as Days Past Due (DPD), which tracks overdue payments. Loans that remain unpaid for over 90 days are classified as Non-Performing Assets (NPAs), signifying increased default risk. To evaluate portfolio health, institutions use the Portfolio at Risk (PAR) metric, which quantifies the Outstanding Principal (OSP) of delinquent loans. This proactive approach empowers banks to implement targeted risk management strategies, effective oversight of lending operations and safeguarding their financial stability.
The primary goal of this project is to create a predictive model that can:
- Analyze factors influencing credit risk and loan approval.
- Classify individuals into four priority categories based on their credit profiles to streamline loan approval and minimize NPAs.
- Aggregates credit history across multiple banks.
- Key features:
- Credit Score: Ranges from 469 to 811, a key indicator of creditworthiness.
- Trade Lines: Information on loans and credit accounts across banks.
- Payment History: Tracks payment behavior over time (e.g., 6 and 12 months), including delinquency indicators like Days Past Due (DPD).
- Provides a granular view of the customer’s relationship with the bank.
- Key features:
- Total/Active Trade Lines: Number of credit accounts showing engagement with credit products.
- Assets: Details on assets held with the bank, contributing to the financial profile.
One major challenge was missing data, common in financial datasets due to data entry errors or unreported information. For instance, the CIBIL dataset had 35k missing values for the delinquency column (max_del
), accounting for 70% of the data. Instead of imputing, this column was removed to avoid bias. Columns with more than 10,000 missing values (represented by -99999
) were similarly removed. This ensured that 70-80% of the data was retained, maintaining model reliability.
- Chi-Square Test: All categorical features were retained since p-values ≤ 0.05, indicating a statistically significant relationship with the target variable.
- Variance Inflation Factor (VIF): Removed numerical features with VIF > 6 to reduce multicollinearity, reducing the feature set from 72 to 39.
- ANOVA: Applied to remaining numerical features, retaining those with p-values ≤ 0.05, resulting in 37 statistically significant predictors for the approved .
After cleaning the data and selecting relevant features, the focus was on categorizing loan applicants into four priority categories (P1 to P4) to enhance loan approval decisions based on risk profiles and repayment likelihood. The XGBoost algorithm was selected as the base model and tuned to achieve:
- Train accuracy: 81%
- Test accuracy: 78%
- Beyond accuracy; Precision, recall, and F1 scores were calculated for each class to thoroughly assess model performance.
- This evaluation revealed class-specific issues, with P3 showing lower accuracy due to an ambiguous decision boundary within the credit score range.
- P3 Credit Score Range: Spans from 489 to 776, much broader than other classes (P1, P2, P4), affecting classification accuracy.
- Re-evaluating how credit scores are utilized for P3 could significantly improve model accuracy.