This project aims to develop a deep learning model for detecting diabetes using clinical and physical data from the Pima Indians Diabetes dataset. By leveraging various deep learning techniques, the goal is to create an effective system for early diagnosis of diabetes.
- Pima Indians Diabetes Dataset: This dataset contains 768 observations of individuals, including information such as glucose levels, blood pressure, BMI, and whether the individual has diabetes (outcome).
- Dataset balancing methods applied to address class imbalance, including SMOTE, Instance Hardness Threshold, and Edited Nearest Neighbors.
- Data cleaned and normalized for effective model training.
- Random Forest
- K-Nearest Neighbors (KNN)
- XGBoost
- LightGBM
- LSTM (Long Short-Term Memory)
- GRU (Gated Recurrent Unit)
- CNN + LSTM
- CNN + GRU
- CNN + LSTM + GRU
- The best performing model was Random Forest with the Instance Threshold balancing method, achieving an accuracy of 92.66%.
- Ensemble models provided robust performance but were not significantly better than Random Forest for this dataset.
This project demonstrates that deep learning techniques, particularly Random Forest combined with effective dataset balancing, can achieve high accuracy in diagnosing diabetes. Future work could focus on further refining models and testing on larger datasets.
- Pima Indians Diabetes Dataset
- SMOTE for Imbalanced Classification
- Additional academic papers and resources on diabetes prediction and dataset balancing methods.