Spaceship Titanic Prediction

Objective

The dataset was downloaded from Kaggle, Spaceship Titanic , on 18 October 2024.
This dataset is part of an open Kaggle competition, where the task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.
The data originally comes in 2 separate datasets, train.csv and test.csv . Each dataset contains a set of personal records recovered from the ship's damaged computer system. There are 13 columns of personal records, and the 14th column is the target.

Technology Used

python
pandas
matplotlib
seaborn
scikit-learn
xgboost
lightgbm
optuna
SHAP

Approach and Methodology

I approached this problem by following a structured methodology of data exploration, preprocessing, feature selection, model selection, modeling, and evaluation.

Data Cleaning: Examined data types and identified columns that needed encoding, splitting, or removal.
Data Exploration: The feature ‘CryoSleep’ showed the highest correlation with the target variable. Hypothesis testing confirmed that passengers with different values in categorical features had distinct survival probabilities.
Data Preprocessing: Dropped irrelevant features, imputed missing values, encoded categorical features, and scaled numerical data using pipelines for scalability.
Feature Selection: Compared models using all features versus those selected by mutual information and PCA. Chose mutual information for its simplicity and similar performance to PCA.
Model Selection: Compared Random Forest, Logistic Regression, XGBoost, and LightGBM models. Used a DummyClassifier as a baseline to ensure the model performance exceeded basic accuracy. Evaluated models using Accuracy (primary metric) and ROC AUC and F1 Score (secondary metrics). Selected the best threshold for the chosen model.
Modeling: Performed hyperparameter tuning using Optuna on the selected model. Retrained on the full dataset and tested on the test data.
Model Insights: Interpreted the model with feature importance and SHAP values, linking findings to the EDA analysis.

Results

The features listed both on feature importance and SHAP are congruent with EDA analysis. However, the order and strength shown on Feature Importance was surprising.

SHAP and Feature Importance
As EDA showed us, CryoSleep followed by luxury amenities have huge distinction between the 2 classes in SHAP. But in feature importance, these follow an opposite order. This maybe that CryoSleep is not important for splits in the decision tree, but is critical in specific context. CryoSleep is correlated with several other features, possibly CryoSleep are affected by those.
It was also surprising to see Age, that were not given much
attention/important, is high on feature importance.

Challenges and Learnings

This is the first iteration of this project, with submission score 0.79962. A part of the project, which is not seen, is that LabelEncoder was first used, but changed to OrdinalEncoder after hyperparameter tuning proved the latter better, by a few percent.

Future Work
For the second iteration, I would spend more time on the following improvements:

More thorough EDA and feature extraction/creation, because the final model interpretation had a few surprising details. For example, luxury amenities features can create a 'Total Spending' feature, which might have more impact.
Deeper analysis of encoding and imputation methods.
Try if PCA or Boruta would provide better feature selection.
Try CatBoost as the data has quite some categorical features, and other models.
Divide dataset into one more set for a final evaluation before submission.

How do I use it

curl --location 'https://spaceship-prediction-953e7e237ee4.herokuapp.com/predict' \
--header 'Content-Type: application/json' \
--data '{
    "features": [
        ["0029_01", "Europa", true, "B/2/P", "55 Cancri e", 21.0, false, 0.0, 0.0, 0.0, 0.0, 0.0, "Aldah Ainserfle"],
        ["0029_01", "Europa", true, "B/2/P", "55 Cancri e", 21.0, false, 0.0, 0.0, 0.0, 0.0, 0.0, "Aldah Ainserfle"]
    ],
    "threshold": 0.96
}
'

output: { "prediction": [ 1, 1 ], "probability": [ 0.9741316826843636, 0.9741316826843636 ] }

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
utils		utils
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt
spaceship_titanic.ipynb		spaceship_titanic.ipynb
spaceship_titanic.pkl		spaceship_titanic.pkl
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spaceship Titanic Prediction

Objective

Technology Used

Approach and Methodology

Results

Challenges and Learnings

How do I use it

About

Releases

Packages

Contributors 2

Languages

CelineChiLamNg/spaceship_prediction

Folders and files

Latest commit

History

Repository files navigation

Spaceship Titanic Prediction

Objective

Technology Used

Approach and Methodology

Results

Challenges and Learnings

How do I use it

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages