First, checked for the missing values in the given data and identified below features with missing values.
Then investigate statistic summary for each feature available.
Then plot a correlation heat map for numerical features for and no significant correlation between two features were found.
scheme_name feature has a lot of missing values. Those values were filled using the mode value for each region.
scheme_managemet feature only have 12 unique values and missing values were replaced using the mode value of scheme_managemet.
Public_meeting and permit features’ missing values replaced using the mode value of respective field. Funder and installer features’ only top 10 values were used as separate categories. Other values consider as a separate category named “other”.
Dropped the subvillage feature.
Then plot a graph using longitude and latitude to observe the geographical distribution of data. And identified 1812 rows with 0,0 longitude and latitude which is clearly are some outlier due to false values.
Data have addition field region code which gives some idea about geo graphical location. Calculated the median longitude and latitude for each region and use respective values for correct the outliers. After correcting the outliers longitude and latitude distribution was as follows.
Other dropped columns – [date_recorded,gps_height,wpt_name,num_private,subvillage,lga,ward,recorded_by,extraction_type,management, management_group, payment , quality_group, quantity , source_type, waterpoint_type_group, region]
Use k-fold validation to evaluate the models. Models tested-
- Random Forrest
- XG boost classifier
- SVM
Random forest achieved the best cross validation scores and used it for final prediction after training on the whole dataset. This model was able to achieve 0.8124 score on data driven test dataset.