Skip to content

Commit

Permalink
Merge branch 'master' of github.com:mathewsrc/Streamlined-ETL-Process…
Browse files Browse the repository at this point in the history
…-Unleashing-Polars-Dataprep-and-Airflow
  • Loading branch information
mathewsrc committed Nov 2, 2023
2 parents 39657fc + f7c00c1 commit fd09f83
Showing 1 changed file with 25 additions and 0 deletions.
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
Streamlined ETL Process: Unleashing Airflow, Polars, SODA, and YData Profiling
Streamlined ETL Process: Unleashing Airflow, Polars, SODA, and YData Profiling
==============================

Project Summary:

This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, SODA, YData Profiling, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the [U.S. government's data repository](https://catalog.data.gov) and the [Chicago Sidewalk Cafe Permits] (https://catalog.data.gov/dataset/sidewalk-cafe-permits). The notebook in the notebooks directory is used to extract, transform, and load datasets from the U.S. government's data repository and the Airflow workflow to extract, transform, and load the Chicago Sidewalk Cafe Permits dataset.
This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, SODA, YData Profiling, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the [U.S. government's data repository](https://catalog.data.gov) and the [Chicago Sidewalk Cafe Permits] (https://catalog.data.gov/dataset/sidewalk-cafe-permits). The notebook in the notebooks directory is used to extract, transform, and load datasets from the U.S. government's data repository and the Airflow workflow to extract, transform, and load the Chicago Sidewalk Cafe Permits dataset.

Project Objectives:

Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov and the Chicago Sidewalk Cafe Permits dataset.
Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov and the Chicago Sidewalk Cafe Permits dataset.

Transformation: Data manipulation and cleaning are accomplished using Polars, a high-performance data manipulation library written in Rust.

Data Profiling: YData Profiling is employed to create dynamic data reports and facilitate data profiling, quality assessment, and visualization, providing insights into data quality and characteristics.
Data Profiling: YData Profiling is employed to create dynamic data reports and facilitate data profiling, quality assessment, and visualization, providing insights into data quality and characteristics.

Loading: Transformed data is saved in CSV files using Polars.
Expand All @@ -25,10 +29,21 @@ Linting: Ruff is employed to ensure code quality.

Formatting: Ruff is again employed to ensure code quality.

Data quality: SODA is employed to ensure data quality.

Tests: Pytest is employed for code validation.

Linting: Ruff is employed to ensure code quality.

Formatting: Ruff is again employed to ensure code quality.

Orchestration: Airflow is employed to orchestrate the whole ETL process.

CI: GitHub Actions is used for continuous integration to push code to GitHub.

By automating these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.
CI: GitHub Actions is used for continuous integration to push code to GitHub.

By automating these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.


Expand All @@ -50,6 +65,8 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
| ├── drop_missing.py # Method to drop rows with missing values in specific columns
| ├── format_url.py # Method to format the URL
| ├── get_time_period.py # Method to get current time period
| ├── format_url.py # Method to format the URL
| ├── get_time_period.py # Method to get current time period
| ├── modify_file_name.py # Method to create a formatted file name
| └── rename_columns.py # Method to rename DataFrame columns name
├── include
Expand All @@ -59,6 +76,7 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
| ├── checks # Directory containing data quality rules yml files
| | └── transformation.yml # Data quality rules for transformation step
| ├── check_function.py # Helpful function for running SODA data quality checks
| ├── check_function.py # Helpful function for running SODA data quality checks
| └── configuration.yml # Configurations to connect Soda to a data source (DuckDB)
├── README.md
├── notebooks # COLAB notebooks
Expand All @@ -71,12 +89,15 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
├── LICENSE
├── lint.sh # Bash script to lint code with ruff
├── Makefile # Makefile with some helpful commands
├── Makefile # Makefile with some helpful commands
├── packages.txt
├── README.md
├── requirements.txt # Required Python libraries
├── setup_data_folders.sh # Bash script to create some directories
├── source_env_linux.sh # Bash script to create a Python virtual environment in linux
├── source_env_windows.sh # Bash script to create a Python virtual environment in windows
├── source_env_linux.sh # Bash script to create a Python virtual environment in linux
├── source_env_windows.sh # Bash script to create a Python virtual environment in windows
└── test.sh # Bash script to test code with pytest
```

Expand All @@ -94,8 +115,10 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/

## Exploring datasets

You can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)
You can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)

Below you can see some images of it:
Below you can see some images of it:

Fetching datasets<br/>
Expand Down Expand Up @@ -138,9 +161,11 @@ Output

## TODO

Next, unpause by clicking on the toggle button next to the DAG name
Next, unpause by clicking on the toggle button next to the DAG name

## TODO !insert image here
## TODO !insert image here


Finally, click on the play button to trigger the workflow
Expand Down

0 comments on commit fd09f83

Please sign in to comment.