Predicting the Fate of Titanic Passengers
This repository examines the tragic story of the RMS Titanic through data science, utilizing machine learning techniques to analyze and predict passenger survival based on various factors. The project provides insights into survival probabilities and evaluates multiple predictive models.
The main objectives of this project are:
- Data Exploration: Understanding the dataset and identifying trends.
- Feature Engineering: Creating meaningful features to improve model performance.
- Model Implementation: Comparing different machine learning algorithms.
- Model Evaluation: Using metrics to assess model accuracy and performance.
- Insights and Future Directions: Drawing conclusions and suggesting improvements.
-
Data Cleaning:
- Addressed missing values in columns such as
Age
andEmbarked
. - Removed outliers that could skew the results.
- Standardized and normalized features for consistent model input.
- Addressed missing values in columns such as
-
Feature Engineering:
- Created new features like:
FamilySize
(combination ofSibSp
andParch
).IsAlone
(indicating whether a passenger was alone or with family).Title
(extracted from passenger names).
- Encoded categorical variables (e.g.,
Sex
,Embarked
) using one-hot encoding.
- Created new features like:
-
Data Splitting:
- Divided the dataset into:
- Training Set: 80% of the data for model training.
- Testing Set: 20% of the data for evaluation.
- Divided the dataset into:
The following machine learning models were implemented and evaluated:
Model | Accuracy |
---|---|
Decision Tree Classifier | 0.7989 |
Random Forest Classifier | 0.7877 |
Logistic Regression | 0.7821 |
AdaBoost | 0.7374 |
k-Nearest Neighbors | 0.6704 |
Support Vector Classifier | 0.6369 |
- Accuracy: Measures the percentage of correct predictions.
- Precision: Ratio of true positive predictions to total positive predictions.
- Recall: Ratio of true positive predictions to actual positive cases.
- F1-Score: The harmonic mean of precision and recall.
- Confusion Matrix: A matrix representation of true positive, true negative, false positive, and false negative predictions.
- Decision Tree Classifier was the top performer, achieving the highest accuracy.
- Random Forest Classifier and Logistic Regression also produced strong results and are viable options for further optimization.
- Models like AdaBoost, k-Nearest Neighbors, and Support Vector Classifier had lower accuracies and may require additional tuning or feature adjustments.
To further improve the analysis and predictions, the following steps are proposed:
-
Hyperparameter Tuning:
- Use grid search or randomized search to optimize hyperparameters for each model.
-
Ensemble Methods:
- Combine models (e.g., bagging or stacking) to leverage the strengths of multiple algorithms.
-
Advanced Feature Engineering:
- Explore domain-specific features or additional transformations to enhance predictive power.
-
Deep Learning Approaches:
- Implement deep neural networks to uncover complex patterns in the data.
-
Class Imbalance Handling:
- Apply techniques like SMOTE (Synthetic Minority Oversampling Technique) to address class imbalance.
git clone https://github.com/AzaamAhmed/Titanic-Survival-Prediction-Analysis.git
cd Titanic-Survival-Prediction-Analysis
Ensure you have Python and the required libraries installed. Install dependencies using:
pip install -r requirements.txt
-
Data Preprocessing:
- Execute the notebook or script to clean and preprocess the dataset.
-
Model Training:
- Train the machine learning models on the training dataset.
-
Evaluation:
- Use the testing dataset to evaluate and compare model performance.
-
Visualization:
- Generate visualizations to understand data trends and model predictions.
Titanic-Survival-Prediction-Analysis/
├── data/ # Dataset files
├── notebooks/ # Jupyter notebooks for analysis and modeling
├── scripts/ # Python scripts for data processing and model implementation
├── requirements.txt # Dependencies and libraries
└── README.md # Project documentation
Contributions are welcome! If you'd like to improve the project, feel free to submit a pull request. Ensure that your changes are well-documented and tested.
This project is licensed under the MIT License. See the LICENSE
file for details.
- Kaggle Titanic Dataset for providing the dataset.
- Open-source contributors and the data science community for inspiration and guidance.
Enjoy exploring the Titanic dataset and building predictive models!