Credit card fraud detection is a critical task in financial institutions. The goal of this project is to build a system capable of identifying fraudulent transactions in a dataset where such cases are rare (0.17%). We use a multivariate normal distribution to model the probability density of legitimate transactions and classify those with lower densities as fraudulent.
The dataset used in this project contains credit card transactions made by European cardholders in September 2013. It is highly imbalanced, with only 0.172% of transactions being fraudulent.
-
Clone this repository:
git clone https://github.com/your-username/credit-card-fraud-detection.git cd credit-card-fraud-detection
-
Install the required libraries:
pip install numpy pandas matplotlib seaborn scikit-learn tqdm psutil
-
Download the dataset from Kaggle and place it in the project directory.
-
Dataset Loading: Load the dataset and split it into two main categories: legitimate transactions (
Class = 0
) and fraudulent transactions (Class = 1
). -
Splitting Data: The legitimate transactions are further divided into training, validation, and testing sets. The fraudulent data is split into validation and testing sets only.
-
Merging Validation and Testing Sets: Combine the legitimate and fraudulent data for validation and testing.
-
Time Feature Transformation: Decompose the
Time
feature intoDay
,Hour
,Minute
, andSecond
to extract more meaningful patterns. -
Amount Transformation: Apply a log transformation to the
Amount
feature to reduce skewness and stabilize variance.
- Histogram and KDE Plots: Visualize the distribution of key features like
Time
,Hour
, andAmount_transformed
to understand the underlying data patterns.
-
Multivariate Normal Distribution: Fit a multivariate normal distribution to the training data by calculating the mean and standard deviation of each selected feature.
-
Probability Density Calculation: Calculate the joint probability density function for the features, and classify transactions as fraudulent if their density falls below a certain threshold.
- Tuning the Threshold: Iterate over a range of alpha values and select the one that optimizes the F2-score, focusing on reducing false negatives due to the imbalanced dataset.
-
Confusion Matrix: Generate the confusion matrix to visualize the model's performance in classifying transactions.
-
Performance Metrics: Calculate accuracy, precision, recall, F1-score, F2-score, and MCC to evaluate the model.
The optimal threshold value was approximately 3.87 x 10^-19
, resulting in an F2-score of 0.836 on the validation set and 0.815 on the test set. The confusion matrix illustrates the classification performance.
This anomaly detection system effectively identifies fraudulent transactions by modeling legitimate transactions using a multivariate normal distribution. Key takeaways include:
- Feature Selection: The choice of meaningful features significantly impacts model performance.
- Threshold Optimization: Proper tuning of the decision threshold is essential for handling imbalanced datasets.
- Real-World Application: The model achieves a high F2-score, making it suitable for deployment in real-world fraud detection systems.
This project uses the credit card fraud dataset from Kaggle.
Feel free to contribute to this project by submitting pull requests or suggesting improvements.