Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mathewsrc authored Nov 15, 2023
1 parent f40e1ad commit 9391033
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This ETL (Extract, Transform, Load) project employs several Python libraries, in
1. I created a concurrency of 1 using the BashOperator to avoid two or more executions against DuckDB as allowing two or more calls to DuckDB would cause an error
2. I loaded the CSV file using an HTTP call by leveraging the Astro Python SDK `load_file()` function and the DuckDB connection that I created in Airflow `Admin/Connections`
3. Then, I create a task to check raw data quality using [Soda](https://docs.soda.io/)

3.1 Check the number of rows

<img src="https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Airflow-Soda-Polars-and-YData-Profiling/assets/94936606/7861bcba-09d4-4f3b-b00f-5a86e6288f40" width=40%><br/>
Expand All @@ -55,16 +56,17 @@ This ETL (Extract, Transform, Load) project employs several Python libraries, in

<img src="https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Airflow-Soda-Polars-and-YData-Profiling/assets/94936606/70904470-35c5-487a-be84-9fa431524d00" width=40%><br/>

4. Next, I created tasks to count the number of rows and to create a data profiling
5. Finally, I create a transform task to apply the following transformations: lower column name, remove duplicated rows, remove missing values, and drop a row if all values are null
5. Next, I created tasks to count the number of rows and to create a data profiling
6. Finally, I create a transform task to apply the following transformations: lower column name, remove duplicated rows, remove missing values, and drop a row if all values are null


### Part 2

![image](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Airflow-Soda-Polars-and-YData-Profiling/assets/94936606/8b325417-bdc9-4adb-8a22-cf2a04d7171e)

1. After the transformation of data I used Soda to check data quality to ensure that data was transformed as expected
1.1 Check the number of rows

1.1 Check the number of rows

<img src="https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Airflow-Soda-Polars-and-YData-Profiling/assets/94936606/c51ba209-2f06-4a05-8a76-e5b74a89b4fd" width=40%><br/>

Expand Down

0 comments on commit 9391033

Please sign in to comment.