Skip to content

An end-to-end data engineering solution using Azure Databricks, PySpark, Spark SQL, Azure Data Lake Gen2, Azure Data Factory, and Power BI to analyze Formula 1 racing data.

Notifications You must be signed in to change notification settings

hxycorn/Formula1-Racing-Cloud-Data-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Formula1-Racing-Cloud-Data-Platform

A Formula1 project using Azure databricks for creating ETL pipeline, PySpark/SQL fro Data Analysis, and PowerBI for Data Visualization.

Overview

Formula One is the highest class of international racing for open-wheel single-seater formula racing cars. Every season happened once a year, each race happened over weekends (Friday to Sunday). Each race is conducted in individual circuits. 10 Teams/Constructors will be participated. 2 Drivers will be assigned in a team. Saturday will be a qualifying round for the Sundays match. 50-70 Laps will be there on each race. Pitstop will be available to change tire or damages. Race results included driver standing and constructure standing.

Source Date Files

We are referring open-source data from website Ergast Developer API. Data available from 1950 till 2022.
File Name File Type
Circuits CSV
Races CSV
Constructors Single Line JSON
Drivers Single Line Nested JSON
Results Single Line JSON
PitStops Multi Line JSON
LapTimes Split CSV Files
Qualifying Split Multi Line JSON Files

Qualifying Split Multi Line JSON Files

Data Model (http://ergast.com/images/ergast_db.png)

Azure Architecture

Security Set Up

set up folder

All the connection in configuration file are created hand handled by Azure Key Vault and created a secure connection between the datastorge and register the project

Data Ingestion Requirement

ingestion folder

  • Ingest All 8 Files into the data lake
  • Ingested data must have the schema applied
  • Ingested data must have audit columns
  • Ingested data must be stored in a columnar format
  • Must be able to analyze the ingest data
  • Ingestion logic must be able to handle incremental load
Incremental load This solution consist of incremental load to the tranformed data i.e the initial data (cutout data) is loaded fully and the new data is compared and overwritten only on MATCHING records and modify data by cerating partitioned data. In this project all the raw data from differnt file have to read and tranformed into PARQUET files and stored in DATA LAKE ADLS blob storage.

Data Transformation Requirements

transformation folder

  • Join the key information required for reporting to create a new table
  • Join the key information required for analysis to create a new table
  • Transformed tables must have audit columns
  • Must be able to analyze the transformed data via SQL

Reporting Requirements

  • Driver Standing
  • Constructure Standing

Analysis Requirements

analysis folder

  • Dominant Drivers
  • Dominant Teams
  • Visualize the output
  • Create Databricks Dashboard

Scheduling Requirements

I have created a Azure datafactory service to create piplines to run this process by weekly and also added tumbling Triggers to schedules the jobs.

  • Scheduling to run every Sunday 10PM
  • Ability to monitor pipelines
  • Ability to re-run failed pipelines
  • Ability to set-up alerts on failures

Technologies/Tools Used:

  • Pyspark
  • Spark SQL
  • Delta Lake
  • Azure Databricks
  • Azure Data Factory
  • Azure Date Lake Storage Gen2
  • Azure Key Fault
  • Power BI

About

An end-to-end data engineering solution using Azure Databricks, PySpark, Spark SQL, Azure Data Lake Gen2, Azure Data Factory, and Power BI to analyze Formula 1 racing data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published