Merge branch 'master' of github.com:mathewsrc/Streamlined-ETL-Process…

…-Unleashing-Polars-Dataprep-and-Airflow
mathewsrc · Nov 2, 2023 · fd09f83 · fd09f83
2 parents 39657fc + f7c00c1
commit fd09f83
Showing 1 changed file with 25 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,20 @@
 Streamlined ETL Process: Unleashing  Airflow, Polars, SODA, and YData Profiling
+Streamlined ETL Process: Unleashing  Airflow, Polars, SODA, and YData Profiling
 ==============================
 
 Project Summary:
 
+This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, SODA, YData Profiling, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the [U.S. government's data repository](https://catalog.data.gov) and the [Chicago Sidewalk Cafe Permits] (https://catalog.data.gov/dataset/sidewalk-cafe-permits). The notebook in the notebooks directory is used to extract, transform, and load datasets from the U.S. government's data repository and the Airflow workflow to extract, transform, and load the Chicago Sidewalk Cafe Permits dataset.
 This ETL (Extract, Transform, Load) project employs several Python libraries, including Polars, Airflow, SODA, YData Profiling, Requests, BeautifulSoup, and Loguru, to streamline the extraction, transformation, and loading of CSV datasets from the [U.S. government's data repository](https://catalog.data.gov) and the [Chicago Sidewalk Cafe Permits] (https://catalog.data.gov/dataset/sidewalk-cafe-permits). The notebook in the notebooks directory is used to extract, transform, and load datasets from the U.S. government's data repository and the Airflow workflow to extract, transform, and load the Chicago Sidewalk Cafe Permits dataset.
 
 Project Objectives:
 
+Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov and the Chicago Sidewalk Cafe Permits dataset.
 Extraction: I utilize the requests library and BeautifulSoup to scrape datasets from https://catalog.data.gov and the Chicago Sidewalk Cafe Permits dataset.
 
 Transformation: Data manipulation and cleaning are accomplished using Polars, a high-performance data manipulation library written in Rust.
 
+Data Profiling: YData Profiling is employed to create dynamic data reports and facilitate data profiling, quality assessment, and visualization, providing insights into data quality and characteristics.
 Data Profiling: YData Profiling is employed to create dynamic data reports and facilitate data profiling, quality assessment, and visualization, providing insights into data quality and characteristics.
 
 Loading: Transformed data is saved in CSV files using Polars.
@@ -25,10 +29,21 @@ Linting: Ruff is employed to ensure code quality.
 
 Formatting: Ruff is again employed to ensure code quality.
 
+Data quality: SODA is employed to ensure data quality.
+
+Tests: Pytest is employed for code validation.
+
+Linting: Ruff is employed to ensure code quality.
+
+Formatting: Ruff is again employed to ensure code quality.
+
 Orchestration: Airflow is employed to orchestrate the whole ETL process.
 
 CI: GitHub Actions is used for continuous integration to push code to GitHub. 
 
+By automating these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.
+CI: GitHub Actions is used for continuous integration to push code to GitHub. 
+
 By automating these ETL tasks, I establish a robust data pipeline that transforms raw data into valuable assets, supporting informed decision-making and data-driven insights.
 
 
@@ -50,6 +65,8 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
 |        ├── drop_missing.py             # Method to drop rows with missing values in specific columns
 |        ├── format_url.py               # Method to format the URL
 |        ├── get_time_period.py          # Method to get current time period
+|        ├── format_url.py               # Method to format the URL
+|        ├── get_time_period.py          # Method to get current time period
 |        ├── modify_file_name.py         # Method to create a formatted file name
 |        └── rename_columns.py           # Method to rename DataFrame columns name
 ├── include
@@ -59,6 +76,7 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
 |        ├── checks                      # Directory containing data quality rules yml files
 |        |    └── transformation.yml     # Data quality rules for transformation step
 |        ├── check_function.py           # Helpful function for running SODA data quality checks 
+|        ├── check_function.py           # Helpful function for running SODA data quality checks 
 |        └── configuration.yml           # Configurations to connect Soda to a data source (DuckDB)
 ├── README.md                               
 ├── notebooks                            # COLAB notebooks
@@ -71,12 +89,15 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
 ├── LICENSE   
 ├── lint.sh                              # Bash script to lint code with ruff
 ├── Makefile                             # Makefile with some helpful commands  
+├── Makefile                             # Makefile with some helpful commands  
 ├── packages.txt
 ├── README.md 
 ├── requirements.txt                     # Required Python libraries 
 ├── setup_data_folders.sh                # Bash script to create some directories
 ├── source_env_linux.sh                  # Bash script to create a Python virtual environment in linux
 ├── source_env_windows.sh                # Bash script to create a Python virtual environment in windows
+├── source_env_linux.sh                  # Bash script to create a Python virtual environment in linux
+├── source_env_windows.sh                # Bash script to create a Python virtual environment in windows
 └── test.sh                              # Bash script to test code with pytest 
 ```
 
@@ -94,8 +115,10 @@ Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/
 
 ## Exploring datasets 
 
+You can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)
 You can explore some datasets by using this notebook: [gov_etl.ipynb](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Polars-Dataprep-and-Airflow/blob/master/notebooks/gov_etl.ipynb)
 
+Below you can see some images of it:
 Below you can see some images of it:
 
 Fetching datasets<br/>
@@ -138,9 +161,11 @@ Output
 
 ## TODO
 
+Next, unpause by clicking on the toggle button next to the DAG name
 Next, unpause by clicking on the toggle button next to the DAG name
 
 ## TODO !insert image here
+## TODO !insert image here
 
 
 Finally, click on the play button to trigger the workflow