This project was developed by:
- Afonso Coelho (Bugss05) - FCUP_IACD:202305085
- Diogo Amaral (damaral31) - FCUP_IACD:202305187
- Miguel Carvalho (miguel-c05) - FCUP_IACD:202305229
This repository serves as the workspace for the Kaggle competition "Titanic - Machine Learning from Disaster", a Machine Learning exercise best suited for beginners, especially those new to the Kaggle platform.
The aim of the competition at hand is to study a Database of passengers aboard the Titanic and train a Machine Learning model on any relevant information extracted in order to be able to predict whether a passenger is likely to survive the ship's wreck or not. All of this with the maximum accuracy possible, of course.
In order to train the model a large enough amount of information is needed. As such, Kaggle gives the participants of this competition a dataset (train.csv) on which to work, regarding passenger's names, age, sex and nº of siblings, among others. It also contains, however, missing values as well as outliers, both of which harm data analysis and model training. Later, we will explain how such cases were handled.
On a first inspection of the facts, women and children will most likely have a higher chance of survival, simply because they were groups of people prioritized for evacuation and rescue. Some cabins may also see a survivability increase solely due to its geographical location on the ship.
To effectively handle missing values, multiple methods were researched according top the well-known paper "Improved Heterogeneous Distance Functions", D. Randall Wilson, Tony R. Martinez. As such, Heterogeneous Value Difference Metric (HVDM) was chosen, not only because of its efficiency, but also due to its easy implementation. This algorithm can be summarized by the following expressions:
where:
-
$x$ and$y$ are passengers; -
$m$ is the nº of attributes of a passenger; -
$a$ is an attribute; -
$\sigma$ is the standard deviation -
$c$ is the output class -
$C$ is the nº of output classes
So, in laymen's terms, HVDM finds the distance between two passengers by comparing the difference of each attribute and adding them. It is similar to the HEOM algorithm, differing mainly in that HVDM normalizes its values before comparing them.
Finally, having the distances between all passengers, a missing value in a said attribute will be filled by searching the "closest"
bla bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla