Formula1-Racing-Cloud-Data-Platform

A Formula1 project using Azure databricks for creating ETL pipeline, PySpark/SQL fro Data Analysis, and PowerBI for Data Visualization.

Overview

Formula One is the highest class of international racing for open-wheel single-seater formula racing cars. Every season happened once a year, each race happened over weekends (Friday to Sunday). Each race is conducted in individual circuits. 10 Teams/Constructors will be participated. 2 Drivers will be assigned in a team. Saturday will be a qualifying round for the Sundays match. 50-70 Laps will be there on each race. Pitstop will be available to change tire or damages. Race results included driver standing and constructure standing.

Source Date Files

We are referring open-source data from website Ergast Developer API. Data available from 1950 till 2022.

File Name	File Type
Circuits	CSV
Races	CSV
Constructors	Single Line JSON
Drivers	Single Line Nested JSON
Results	Single Line JSON
PitStops	Multi Line JSON
LapTimes	Split CSV Files
Qualifying	Split Multi Line JSON Files

Qualifying Split Multi Line JSON Files

Data Model (http://ergast.com/images/ergast_db.png)

Security Set Up

set up folder

All the connection in configuration file are created hand handled by Azure Key Vault and created a secure connection between the datastorge and register the project

Data Ingestion Requirement

ingestion folder

Ingest All 8 Files into the data lake
Ingested data must have the schema applied
Ingested data must have audit columns
Ingested data must be stored in a columnar format
Must be able to analyze the ingest data
Ingestion logic must be able to handle incremental load

Incremental load This solution consist of incremental load to the tranformed data i.e the initial data (cutout data) is loaded fully and the new data is compared and overwritten only on MATCHING records and modify data by cerating partitioned data. In this project all the raw data from differnt file have to read and tranformed into PARQUET files and stored in DATA LAKE ADLS blob storage.

Data Transformation Requirements

transformation folder

Join the key information required for reporting to create a new table
Join the key information required for analysis to create a new table
Transformed tables must have audit columns
Must be able to analyze the transformed data via SQL

Reporting Requirements

Driver Standing
Constructure Standing

Analysis Requirements

analysis folder

Dominant Drivers
Dominant Teams
Visualize the output
Create Databricks Dashboard

Scheduling Requirements

I have created a Azure datafactory service to create piplines to run this process by weekly and also added tumbling Triggers to schedules the jobs.

Scheduling to run every Sunday 10PM
Ability to monitor pipelines
Ability to re-run failed pipelines
Ability to set-up alerts on failures

Technologies/Tools Used:

Pyspark
Spark SQL
Delta Lake
Azure Databricks
Azure Data Factory
Azure Date Lake Storage Gen2
Azure Key Fault
Power BI

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
analysis		analysis
data		data
includes		includes
ingestion		ingestion
set_up		set_up
transformation		transformation
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula1-Racing-Cloud-Data-Platform

Overview

Source Date Files

Security Set Up

Data Ingestion Requirement

Data Transformation Requirements

Reporting Requirements

Analysis Requirements

Scheduling Requirements

Technologies/Tools Used:

About

Releases

Packages

Languages

hxycorn/Formula1-Racing-Cloud-Data-Platform

Folders and files

Latest commit

History

Repository files navigation

Formula1-Racing-Cloud-Data-Platform

Overview

Source Date Files

Security Set Up

Data Ingestion Requirement

Data Transformation Requirements

Reporting Requirements

Analysis Requirements

Scheduling Requirements

Technologies/Tools Used:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages