Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
- Data Source: This project uses data on Kaggle including 2 datasets: US Accidents (2016 - 2023) and Traffic Accidents and Vehicles
US Accidents (2016 - 2023
: This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023Traffic Accidents and Vehicles
: every line in the file represents the involvement of a unique vehicle in a unique traffic accident, featuring various vehicle and passenger properties as columns
- Extract Data: Data is
extracted
fromcsv
file theningested
intoMinIO
data lake inbronze
folder usingPython
andAirflow
- Transform Data: Data is retrieved from
MinIO's
bronze
directory usingSpark
andFastAPI
to performtransformation
andcleaning
, then the output isloaded
intoMinIO's
silver
directory. - Load Data: Once the data has been cleaned, we load it into the
Snowflake
datawarehouse
at SchemaStaging
usingPython
andAirflow
. - Warehouse: Data is loaded into
staging
schema inSnowflake
, Build and deploydata warehouse
withStar Schema
architecture by creatingdimension
andfact
tables, to do this we useDBT
totransform
andcheck data
. - Serving: Analyze data to improve road safety, identify high-risk accident areas to implement preventative measures. Identify factors that contribute to accidents (weather, road conditions, human error). Then visualize and create reports with
Power BI
. - Package and Orchestration: Components are packaged using
Docker
and orchestrated usingApache Airflow
.
Apache Airflow
Apache Spark
Docker
Dbt
Snowflake
MinIO
FastApi
Power BI