Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mathewsrc authored Nov 2, 2023
1 parent 2f7f6df commit 0d50c69
Showing 1 changed file with 27 additions and 16 deletions.
43 changes: 27 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
Streamlined ETL Process: Unleashing Polars, YData Profiling and Airflow
Streamlined ETL Process: Unleashing Airflow, Polars, SODA, and YData Profiling
==============================

Project Summary:

This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, Dataprep, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the U.S. government's data repository at https://catalog.data.gov.
This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, YData Profiling, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the [U.S. government's data repository](https://catalog.data.gov) and the [Chicago Sidewalk Cafe Permits] (https://catalog.data.gov/dataset/sidewalk-cafe-permits). The notebook in the notebooks directory is used to extract, transform, and load datasets from the U.S. government's data repository and the Airflow workflow to extract, transform, and load the Chicago Sidewalk Cafe Permits dataset.

Project Objectives:

Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov, a repository of various data formats, including CSV, XLS, and HTML.
Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov and the Chicago Sidewalk Cafe Permits dataset.

Transformation: Data manipulation and cleaning are accomplished using Polars, a high-performance data manipulation library written in Rust.

Expand All @@ -17,9 +17,19 @@ Loading: Transformed data is saved in CSV files using Polars.

Logging: Loguru is chosen for logging, ensuring transparency, and facilitating debugging throughout the ETL process.

Data quality: SODA is employed to ensure data quality.

Tests: Pytest is employed for code validation.

Linting: Ruff is employed to ensure code quality.

Formatting: Ruff is again employed to ensure code quality.

Orchestration: Airflow is employed to orchestrate the whole ETL process.

Through the automation of these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.
CI: GitHub Actions is used for continuous integration to push code to GitHub.

By automating these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.


Project Organization
Expand All @@ -38,8 +48,8 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
| ├── drop_full_null_columns.py # Method to drop columns if all values are null
| ├── drop_full_null_rows.py # Method to drop rows if all values in a row are null
| ├── drop_missing.py # Method to drop rows with missing values in specific columns
| ├── format_url.py # Method to format the url
| ├── get_time_period.py # Method to get currently time period
| ├── format_url.py # Method to format the URL
| ├── get_time_period.py # Method to get current time period
| ├── modify_file_name.py # Method to create a formatted file name
| └── rename_columns.py # Method to rename DataFrame columns name
├── include
Expand All @@ -48,7 +58,7 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
| └── soda # Directory with SODA files
| ├── checks # Directory containing data quality rules yml files
| | └── transformation.yml # Data quality rules for transformation step
| ├── check_function.py # Helpful function for run SODA data quality checks
| ├── check_function.py # Helpful function for running SODA data quality checks
| └── configuration.yml # Configurations to connect Soda to a data source (DuckDB)
├── README.md
├── notebooks # COLAB notebooks
Expand All @@ -60,13 +70,13 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
├── format.sh # Bash script to format code with ruff
├── LICENSE
├── lint.sh # Bash script to lint code with ruff
├── Makefile # Makefile with some helpfull commands
├── Makefile # Makefile with some helpful commands
├── packages.txt
├── README.md
├── requirements.txt # Required Python libraries
├── setup_data_folders.sh # Bash script to create some directories
├── source_env_linux.sh # Bash script to create an Python virtual enviroment in linux
├── source_env_windows.sh # Bash script to create an Python virtual enviroment in windows
├── source_env_linux.sh # Bash script to create a Python virtual environment in linux
├── source_env_windows.sh # Bash script to create a Python virtual environment in windows
└── test.sh # Bash script to test code with pytest
```

Expand All @@ -84,9 +94,9 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/

## Exploring datasets

You can can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)
You can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)

Bellow you can see some images of it:
Below you can see some images of it:

Fetching datasets<br/>

Expand All @@ -107,13 +117,13 @@ First things first, we need to start astro project. you have two options:
astro dev start
```

2. Use the Makefile command `astro-start` on `Terminal`. Notice that you maybe need to install Makefile utility on your machine.
2. Use the Makefile command `astro-start` on `Terminal`. Just so you know, you might need to install the Makefile utility on your machine.

```bash
astro-start
```

Now you can visit the Airflow Webserver at: http://localhost:8080 and trigger the ETL workflow or run the Astro command `astro dev ps` to see running containers
Now you can visit the Airflow Webserver at http://localhost:8080 and trigger the ETL workflow or run the Astro command `astro dev ps` to see running containers

```bash
astro dev ps
Expand All @@ -128,7 +138,7 @@ Output

## TODO

Next, unpause by clicking in the toggle button next to the dag name
Next, unpause by clicking on the toggle button next to the DAG name

## TODO !insert image here

Expand All @@ -138,7 +148,7 @@ Finally, click on the play button to trigger the workflow
## TODO !insert image here


If everything goes well you will see a result like this one bellow
If everything goes well you will see a result like this one below

## TODO !insert image here

Expand All @@ -150,6 +160,7 @@ If everything goes well you will see a result like this one bellow
- [ ] Record how to create a DuckDB connection in airflow
- [ ] Record workflow running
- [ ] Create the architecture image of the project
- [ ] Review README (add ETL section for notebook and airflow)
- [ ] Completed!


0 comments on commit 0d50c69

Please sign in to comment.