United States Census Bureau Data Analysis using Apache Spark
Project Overview
This project analyzes the United States Census Bureau's 2017 Basic Monthly CPS data using Apache Spark with Python. The goal is to extract relevant information, answer specific questions, and provide insights into the data.
Dataset
- Source: United States Census Bureau's 2017 Basic Monthly CPS
- File: December DOS/Windows zip file (extracted dat file)
- Data Dictionary: Used to map and extract relevant data
Extracted Data
The following data was extracted:
- Full household identifier
- Time of interview in YYYY/MMM format
- Final outcome of the survey
- Type of housing unit
- Household type
- Apartment/Household has a telephone
- Apartment/Household can access a telephone elsewhere
- Is telephone interview acceptable for the responder
- Type of interview
- Family income range
- Geographical division/location
- Race
Analysis Questions
The following questions were answered:
- Count of responders per family income range
- Count of responders per geographical division/location and race (top 10)
- Number of responders without telephone in their house, but can access a telephone elsewhere and telephone interview is accepted
- Number of responders who can access a telephone, but telephone interview is not accepted
Code
- The code is written in Python using Apache Spark and is available in this repository.
- The code uses Jupyter Notebook 'main.ipynb'.
- Decoding data elements and their corresponding schema information are in schema.py.
Requirements
- Apache Spark
- Python 3.8
- pandas
Installation
- Install Apache Spark
- Install required Python libraries using pip install -r requirements.txt
- Clone this repository