This project is aimed at predicting financial risk for companies, focusing on their ability to avoid defaults on debt obligations. It involves analyzing financial data from 2015 and predicting defaults based on net worth data for 2016. Key methods used include data preprocessing, feature engineering, machine learning, and visualization.
Defaults in companies can lead to lower credit ratings, higher borrowing costs, and challenges in raising capital. The objective is to predict the likelihood of default using historical financial data to help stakeholders make informed decisions.
- Description: Contains 67 columns representing financial metrics like Net Worth, Total Debt, Revenue, and Profit.
- Target Variable:
Networth_Next_Year
used to derive thedefault
variable.
- Description: Weekly stock prices for companies from 2014 to 2020.
- Column Renaming: Standardized column names (e.g., replaced spaces and special characters with underscores).
- Outlier Treatment: Capped outliers using the 5th and 95th percentiles.
- Missing Value Imputation: Used median imputation for filling missing values.
- Created the binary target variable
default
:- 1:
Networth_Next_Year < 0
(Defaulted). - 0:
Networth_Next_Year > 0
(Non-Defaulted).
- 1:
- Addressed multicollinearity using Variance Inflation Factor (VIF).
- Selected features based on univariate and bivariate analysis.
- Boxplots and Heatmaps:
- Variables like
Networth
andCapital_Employed
showed significant separation between default and non-default groups.
- Variables like
- Correlation Matrix:
- Highlighted multicollinearity among independent variables like
Gross_Block
,PBIDT
, andTotal_Debt
.
- Highlighted multicollinearity among independent variables like
Variable | Correlation with Target |
---|---|
Networth | 0.85 |
Capital_Employed | 0.78 |
PBIDT | 0.72 |
- Approach A: Removed highly correlated variables using VIF > 5.
- Approach B: Used all variables, iteratively removing those with p-values > 0.05.
- Built a base model with default parameters.
- Tuned hyperparameters using GridSearchCV.
- Applied SMOTE to address class imbalance.
- Explored LDA for classification but noted weaker performance compared to Random Forest.
- Metrics Used:
- Recall: Prioritized to minimize false negatives.
- Precision: Evaluated to avoid false positives.
- Accuracy: Provided overall performance.
Metric | Train Data | Test Data |
---|---|---|
Accuracy | 98% | 94% |
Recall | 93% | 91% |
Precision | 87% | 84% |
F1-Score | 90% | 87% |
- Visualization:
- Weekly stock price trends for companies like Infosys and SAIL.
- Highlighted volatility using boxplots.
Stock | Mean Return | Standard Deviation (Risk) |
---|---|---|
Shree Cement | 5.2% | 2.4% |
Infosys | 4.8% | 1.8% |
Idea Vodafone | -3.4% | 5.8% |
- Model B (all variables with p-values < 0.05) outperformed Model A.
- GridSearchCV-tuned model with SMOTE showed the highest performance.
- High-risk stocks like Idea Vodafone and Jet Airways showed negative returns and high volatility.
- Shree Cement and Infosys emerged as high-return, low-risk stocks.
- Investment Strategy:
- Focus on high-return, low-volatility stocks (e.g., Shree Cement, Infosys).
- Avoid high-risk stocks with low returns and high volatility.
- Model Deployment:
- Use the Random Forest model with SMOTE for default prediction.
- Regularly update the model with new data.