Skip to content

Latest commit

 

History

History
37 lines (35 loc) · 6.71 KB

README.md

File metadata and controls

37 lines (35 loc) · 6.71 KB

Data Science and Machine Learning Capstone Project

Objective: To analyze housing complaints in New York City and develop a predictive model to forecast future complaints.

Introduction and Objectives

The capstone project for the Data Science and Machine Learning course is designed to apply and showcase the comprehensive skills and knowledge acquired throughout the program. This project explores the realm of housing complaints in New York City, focusing on the Department of Housing Preservation and Development's dataset. The primary objective is to identify patterns, relationships, and predictive factors concerning housing complaints, ultimately providing actionable insights for city departments and stakeholders.

Data Description and Sources

The datasets utilized in this project were sourced from the New York City Open Data portal, specifically targeting the Department of Housing Preservation and Development's complaints records. These datasets encompass a wide range of variables, including complaint types, dates, locations (boroughs, ZIP codes, and street names), and specific characteristics of the housing units involved. The data cleaning and preprocessing steps were meticulously documented, ensuring the analysis was conducted on reliable and relevant data.

Methodologies Applied

The project employed a variety of data science techniques, tailored to address each specific research question: Geographical Analysis: For identifying areas with high complaint volumes, spatial analysis techniques were applied, including mapping and ZIP code-level aggregation. Relationship Analysis: To explore the connection between complaint types and housing characteristics, statistical tests and exploratory data analysis methods were used. Predictive Modeling: For the development of a predictive model, machine learning algorithms were evaluated, including regression models and decision trees, with an emphasis on model selection, training, and validation processes.

Key Findings and Results

•	Geographical Focus: The analysis revealed certain boroughs and ZIP codes with disproportionately high numbers of complaints, suggesting targeted intervention could be more effective.
•	Complaint Characteristics: A clear relationship was identified between specific complaint types and housing characteristics, highlighting potential areas for regulatory and inspection focus.
•	Predictive Model: The project successfully developed a predictive model with reasonable accuracy, capable of forecasting future complaint occurrences based on historical data and housing attributes. This model could serve as a tool for proactive measures in housing preservation efforts.

Conclusions and Future Work

The project demonstrated the power of data science in addressing complex urban issues, providing a data-driven basis for decision-making in housing preservation and development. The findings suggest that targeted, informed interventions can potentially lead to significant improvements in housing conditions. For future work, it is recommended to integrate more granular data, explore additional predictive factors, and deploy the model in a real-world testing environment to monitor its effectiveness and adjust as needed. This technical report outlines the comprehensive approach taken in the capstone project, from data acquisition and cleaning to in-depth analysis and modeling. It showcases the application of data science methodologies to real-world problems, contributing valuable insights and tools for urban development and housing quality improvement.

Libraries and Methods Utilized

Throughout the capstone project, a wide array of libraries and methods were employed to facilitate data analysis, visualization, and modeling. Below is a detailed breakdown of the key libraries and the methodologies applied within each stage of the project: Libraries Pandas and NumPy: These foundational Python libraries for data manipulation and numerical computations were extensively used for data cleaning, preprocessing, and analysis tasks. Pandas facilitated the handling of datasets through dataframes, enabling efficient data operations, while NumPy supported complex numerical analyses. Matplotlib and Seaborn: For data visualization, Matplotlib provided a comprehensive suite of plotting tools, allowing for the creation of a wide range of static, interactive, and animated figures. Seaborn, built on top of Matplotlib, enabled the creation of more aesthetically pleasing and complex visualizations with simpler code, particularly for statistical graphics. Scikit-learn: This machine learning library was pivotal for the predictive modeling aspect of the project. It was utilized for preprocessing data, splitting datasets into training and test sets, model selection, training, and evaluation. Algorithms from scikit-learn, such as regression models and decision trees, were key components of the predictive analysis. Geopandas and Folium: For geographical data processing and mapping, Geopandas extended the functionalities of Pandas to allow spatial operations on geometric types. Folium, on the other hand, was used for generating interactive maps to visualize geographical distributions of complaints and insights derived from spatial analysis. Methods

Exploratory Data Analysis (EDA):

Initial data exploration involved summarizing the main characteristics of the datasets, using methods like .describe(), .info(), and visualization techniques such as histograms, box plots, and scatter plots to understand distributions, correlations, and potential outliers.

Data Cleaning and Preprocessing:

Techniques included handling missing values, outlier detection and removal, and feature engineering to prepare the datasets for analysis. Methods such as .fillna(), .drop(), and encoding categorical variables were used.

Statistical Analysis:

To explore relationships between complaint types and housing characteristics, statistical tests like chi-square tests for independence were applied. This helped in identifying significant associations between variables.

Machine Learning Modeling:

The project adopted a structured approach to machine learning, starting with feature selection, followed by model training using algorithms like linear regression and decision trees. Model evaluation was performed using metrics such as accuracy, precision, recall, and F1-score to assess the performance of the predictive models. The combination of these libraries and methods formed the backbone of the project, enabling a rigorous analysis of the housing complaints data and the development of a predictive model. This methodological framework not only facilitated the achievement of the project's objectives but also illustrated the practical application of data science techniques in addressing real-world challenges.