This is a project that is part of the data engineering 2024 Zoomcamp Project
As someone who is dependent on the Berlin Airport and has flown a decent number of times over the last few years, I've witnessed how chaotic flying in Europe can be, and especially at this airport. This airport has been known for multiple delays and issues in both its development and day to day operations. As a result, I decided to create a project that would be able to help me visualize the Airport's arrivals and departures, and track delays based on average delays, delays by airlines, and other interesting statistic such as arrival and departure destinations.
By using this ELT orchestration project, and left to run long enough, one could collect enough data to visualize over the long run the performance of the airport and identify specific patterns that drives the timeliness of the airport. For this project, given the budgetary, available data and time constraints, I've decided to only do one day to showcase how the final output would look like.
Note: Due to the rate limiter of the API, the DAG was designed to collect data only one day (yesterday) per run, and with cloud costs running high and limited time, it was decided that only one day of data will be shown for demonstration purposes. The dashboard will only contain data for 2024-02-28 for demoing purposes. Also the screenshot image is displaying sum and not average values of average delay in the single numbers on the right, this has been fixed
This project is an end to end project from API to Visual Dashboard.
- Terraform is used as "infrastructre as code"
- Mage is used here throughout the whole ELT process from pulling arrivals and depatrues data from API to Combining Data in Bigquery
- GCP Cloudfile Store is used as the data lake
- Bigquery is used as the querying analytic engine
- Pyspark (Set up on Dataproc) is used as a data transformation tool to combine arrivals and departures
For this project, the requirements will be needed as follows
- Docker
- GCS Buckets (GCP)
- Bigquery (GCP)
- Mage
- Java (Optional for Option A)
- Apache Spark (Optional for Option A)
Please follow the tutorials on weeks 1 (Docker), 2 (Mage) and 5 (Spark) for proper installation and setup as these are quite extensive if the pre-requisite installations have not already been done. Notably for this project, Mage and Pyspark will be used as the primary tools. You can also use the notes made by the students who cover extensively the setup process if text is the preferred form of instruction.
Additionally, for Mage Orchestration, you will need to install the python packages on requirements.txt for the DAG to run.
Terraform will be used to setup the GCS bucket. To do so, follow these steps:
- CD into the terraform directory.
- Apply the correct variables on the
variables.tf
file. This will be used to create the GCS bucket. - On the terminal, intialize terraform by entering
terraform init
. - To apply the
main.tf
file, enterterraform apply
. - To destroy the setup, enter
terraform destroy
into the terminal.
To begin mage, please cd into the mage-zoomcamp
directory and then do the following:
docker compose build
docker pull mageai/mageai:latest
(To update to latest mage image)docker compose up
Please make sure to have Docker and the container mage-zoom running. Then you can type in your browswer http://localhost:6789/
to run Mage locally.
Tutorial on Local Mage setup can be found here
To connect Mage with GCS, you will need to set up a service account with proper access, and also have GCS credentials prepared for Mage to use.
Please refer to this tutorial by Mage for the GCS setup.
Additionally, it may be helpful to view the ETL: API to GCS
and ETL: API to Bigquery
for full explanation.
The pipelines should all be connected and setup, you will see three pipelines
- berlin_arrivals_etl_to_gcs_bq: This pipeline pulls the arrivals data from the API and uploads the data as a CSV onto a GCS Bucket and Corresponding BigQuery table.
- berlin_departures_etl_to_gcs_bq: This pipeline pulls the departures data from the API and uploads the data as a CSV onto a GCS Bucket and Corresponding BigQuery table.
- berlin_union_departures_arrivals: This pipeline pulls the data from the arrivals and departures table and unions them together into a combined table.
You can create with Mage the triggers to run the pipelines daily, and should run them in the order listed above with the union being last to combine the data. Make sure to have enough time for the first two pipelines to finish running before combining the two tables on BigQuery.
To run Mage on Google Cloud so it can run autonomously without needing a local machine, you will need to follow the set up by Mage (Matt Palmer) here:
Note: You will need to recreate the pipeplines and schedules on the cloud.
Alternatively, as shown in Option A, included in the repo is also the file berlin_airport_pyspark_dataproc.py
which is the Pyspark file used to transform and combine the two datasets. The correct GCS bucket name of the data and the Python file will need to be filled in spark.conf.set('temporaryGcsBucket', "dataproc bucket")
. Both Mage and Spark here aim to achieve the same task which is to transform the arrivals and departures data and then union them together into a combined table.
Please follow this video tutorial to understand how to set up a Dataproc Cluster and how to run the file. You can use the python file in the repository as opposed to the one in the tutorial for the project. Note that running a cluster can be prohibitevly expensive and doesn't make too much sense for a small scale project like this, but for educational purposes this could be useful to learn.
I would like to thank Jean Loui Bernard Silva de Jesus (JeanExtreme002) for the FlightRadarAPI that was used to pull the flight data.
Github Repository can be found here: https://github.com/JeanExtreme002/FlightRadarAPI?tab=MIT-1-ov-file
Special thank you to the team at Data Engineering Zoomcamp for organizing this project. They are great folks, and you can check them out here:
- Github: https://github.com/DataTalksClub/data-engineering-zoomcamp
- Slack: https://datatalks.club/slack.html
Lastly, I would like to give a shoutout to Mage for partnering with the Zoomcamp and providing a smooth tutorial and experience on using the tool.
You can find the Mage Zoomcamp here: https://github.com/mage-ai/mage-zoomcamp