Data cleaning and analysis of the birth and death records for the Open Data Hackathon organized by PMC Open Data Portal.
Problem Statement: opendatapune/birth_and_death_records
- Combine various years datasets into a clean machine-readable CSV
- Derive date-wise, month-wise, year-wise totals by gender and type (birth/death).
- See monthly trends and year-on-year trends
- Locate anomalies in the data
- Compare with historical events, passing of laws like legislation on early sex determination, weather events
- Compare between the genders
- Create time series based visualizations, similar to stock market/currency charts.
-
Install and setup Anaconda Find an easy installation and setup guide using this link Make sure you install Anaconda for python 3.6 or above
-
Install the required packages: Open Anaconda prompt and run these commands
conda install pandas, numpy, matplotlib, plotly, psutil, cufflinks, plotly-orca
Verify Installation by running
import pandas as pd
pd.__version__
- Download the data sets:
Dataset for birth_and_death_records
Optional: Download the whole repository
We have taken the input dataset and converted it into pure text CSV format for easy cleaning process.
"xlsx_to_csv.py" is a script that will automatically convert all the original datasets stored in folder datasets to text CSV format and store them in "converted/datasets".
Steps to convert the datasets:
- Open Anaconda prompt and go to folder "opendatapune-birth_and_death_records/converted_datasets".
- Run the commands stored in "conversion_commands.txt" for the required conversion.
Example:
python xlsx_to_csv.py ../datasets/"List of Birth And Death rate for the year 2010.xlsx" 2010 birth_and_death_2010.csv
After running all the commands u will see that all the xlsx files are converted to csv files and a essage "Successfully completed" shown after each conversion
The csv files are then cleaned using "Data_Cleaning_Preparation.ipynb"
Steps in Data Cleaning:
- Corrected miss interpreted dates to dd/mm/yyyy
- Split the "Date of Event" and replaced it with "Day", "Month" and "Year"
- Used One-Hot-encoding technique to store "Type" and "Gender" attributed
"Type" -> "Birth" = 0 / "Death" = 1
"Gender" -> "Male" = 0 / "Female" = 1 - Created Cleaned datasets from year 2010 to 2018 and stored them in folder "converted_datasets"
- Combined data from years 2010 to 2018 into single dataset "combined_cleaned_birth_and_death_2010-2018.csv"
How to run:
- Open Anaconda prompt
- Go to folder "opendatapune-birth_and_death_records"
- Type following command to open jupyter notebook
jupyter notebook
- Open "Data_Cleaning_Preparation.ipnyb" and run each row one by one to clean
Processed the combined raw dataset to get the following information:
-
Day-wise data of number of male,female and total births and deaths
- Attributes: Day,Month,Year,male_births,female_births,total_births,male_deaths,female_deaths,total_deaths
- Stored in file "Day_wise_information.csv" in folder "Insights"
-
Month-wise data of number of male, female and total births and deaths
- Attributes: Month,Year,male_births,female_births,total_births,male_deaths,female_deaths,total_deaths
- Stored in file "Month_wise_information.csv" in folder "Insights"
-
Year-wise data of number of male,female and total births and deaths
- Attributes: Year,male_births,female_births,total_births,male_deaths,female_deaths,total_deaths
- Stored in file "Year_wise_information.csv" in folder "Insights"
-
Male, female and total Population Growth per year for
- Atributes: Year,male_births,female_births,total_births,male_deaths,female_deaths,total_deaths
- Stored in file "Recording_differences_between_datasets.csv" in folder "Insights"
How to run
- "Data_Cleaning_Preparation.ipnyb" itself does the Data Preparation part.
- Note: Clicking on the plot name takes you to the chart which is hosted online on plotly.*
Visualized the information, extracted from the data, through the following plots:
- Daily Birth and Death Trend
- Monthly Birth and Death Trend
- Yearly Birth and Death Trend
- Trend for Male Births
- Trend for Female Births
- Trend for Total Births
- Trend for Male Deaths
- Trend for Female Deaths
- Trend for Total Deaths
- Yearly population growth
- Yearly Increase/Decrease in Birth Rate
- Average per Month
- Swine-flu vs road accidents deaths
- Seasonal Decompositional analysis for Birth Trend
- Seasonal Decompositional analysis for Death Trend
- Future Birth Prediction
How to run:
- Open Plots "Data_Visualization.ipynb" and run all code blocks
Where to find plots:
- Plots can be visualized in the "Data_Visualization.ipynb" file
- Plots are stored in "plots" folder in image format as well as HTML format.
- Plots can be viewed online through the links provided next to the plot names.
Note: Make use of these interactive features of plots for Visualization.
- Legend(top right) can be used to show or hide a particular trace/bar on the map.
- Buttons(top left) can be used to select a predefined configuration of traces/bars.
- Hovering over the plot shows the values corresponding to the x-axis for all traces/bars.
- Range slider(below x-axis) can be used to choose the range to display on the plot.
- Pinch to zoom in/out option is available
- Insights on Daily information
- Insights on Montly information
- Birth and Death Records not registered in "Detected cases and deaths due to diseases" dataset
- After comparing our data with the data provided in "Detected cases and deaths due to diseases" dataset available at link we found that the later dataset missed many birth and death records
- The below table shows the number of birth and death records missing in the later dataset.
. - Stored in file "Recording_differences_between_datasets.csv" in folder "Insights"
- After comparing our data with the data provided in "Detected cases and deaths due to diseases" dataset available at link we found that the later dataset missed many birth and death records
How to run:
- Open "Analysis_and_Insights.ipynb" file and run all code cells.
We used the monthly data from the year 2010 to 2018 to predict the number of births in the year 2019.
ARIMA model is used for future prediction which is a state-of-art algorithm in Time-Series Forecasting.
How to run:
- Open "Forecasting.ipynb" and run all the code cells
- The results can be seen in the file "Future Birth Prediction.jpeg" or "2019_births_prediction.html" saved in "plots" folder.
- A drastic drop in recorded Birth records was seen in year 2014. Our insights show a strong correlation with the implementation of the Two-child policy in Maharashtra, in the year 2014.
- In Maharashtra, Pune has the second highest Road fatalities. Our Analysis shows Road accidents cause 1.64% of total deaths in Pune, compared to 0.18% of that of deadly swine-flu.
- October shows Highest Average Total Births of 2643 followed by April with 2508. Data shows strong relation with the fact that, in Maharashtra, most of the *marriages happen in the month of January and September.
- Comparing the data found in "Detected cases and deaths due to diseases" dataset(available on PMC Open Data Store) and our datasets shows dramatic differences in the number of birth and death records. The exact difference in a number of recorded births and deaths from 2 datasets can be seen in file "Recording_differences_between_datasets.csv".
- Future Forecasting predicts the Number of Total Births in year 2019 will be 48,824
- The number of Birth records on 28 and 29 Apr 2018 were suspiciously low than mean (average) daily births.
- The number of Death records between 30 Jun to 30 Jul 2017 were suspiciously low. News shows no evidence of hospital shutdowns or strikes.
- Prior half of Aug 2016 shows high number deaths indicating some kind of epidemic, but analysis of news from July and Aug 2016 shows no evidence of an epidemic in Pune city. The reason for such high death records is not yet clear.
- Oct 2013 recorded highest births of 5640 in any month in past decade. The data looks anomalous, but after comparing the data to the "Detected cases and deaths due to diseases" dataset, it was found that the recorded data was true.
- The number of Birth records in year 2010 were lowest indicating poor birth data recording that year.
- Year 2016 recorded the highest number of deaths 31,374 in any year. The year 2016 is the only year that does not follow the normal trend in death records of 25k-30k.
- Highest #Birth records of 298 was recorded on 10 Nov, 2011, in past decade.
- Female to Male Population growth was 123.5% shows increase in female population than male population in Pune.
- Female to Male Birth Ratio was 92.2% indicating almost same male vs female births.
- Female to Male Death Ratio was 69.73% indicating lot more male deaths than female deaths.
- Highest population growth of 30,928 was recorded in year 2013 and Lowest population growth of 24,255 was recorded in year 2016.
- Aug 2016 recorded highest deaths of 3217 in any month in past decade.
- Jul 2017 recorded lowest deaths of 1177 in any month in past decade.
- Drastic Decrease in Birth Rate was seen in years 2014 (-14.25%) and 2016 (-8.18%) from espective previous years.
- In Maharashtra state, Pune city has Highest Road Fatalities after Mumbai City. Every year, over 400 people lose their lives in road accidents. Stats show that only 16 of every 100 people are seen wearing helmets in Pune. While Departments of Insect Control and Epidemic Disease Control of Pune Municipal Corporation are doing a good job to keep the Deaths caused due to diseases as low as 0.2% of total deaths, Pune road fatalities cause more than 1.6% deaths of all deaths.
- Pune Road safety authorities must do something to bring down the road fatalities to decrease the deaths in Pune city.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- Sushant Gundla - Github Profile
- Pune Open Data portal for providing with data