From 93910336dc98ede452cb550821dbcad4ee19f2cd Mon Sep 17 00:00:00 2001 From: Matheus Ribeiro <94936606+mathewsrc@users.noreply.github.com> Date: Wed, 15 Nov 2023 16:30:50 -0300 Subject: [PATCH] Update README.md --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 79874ea..8d28f1f 100644 --- a/README.md +++ b/README.md @@ -43,6 +43,7 @@ This ETL (Extract, Transform, Load) project employs several Python libraries, in 1. I created a concurrency of 1 using the BashOperator to avoid two or more executions against DuckDB as allowing two or more calls to DuckDB would cause an error 2. I loaded the CSV file using an HTTP call by leveraging the Astro Python SDK `load_file()` function and the DuckDB connection that I created in Airflow `Admin/Connections` 3. Then, I create a task to check raw data quality using [Soda](https://docs.soda.io/) + 3.1 Check the number of rows
@@ -55,8 +56,8 @@ This ETL (Extract, Transform, Load) project employs several Python libraries, in
-4. Next, I created tasks to count the number of rows and to create a data profiling -5. Finally, I create a transform task to apply the following transformations: lower column name, remove duplicated rows, remove missing values, and drop a row if all values are null +5. Next, I created tasks to count the number of rows and to create a data profiling +6. Finally, I create a transform task to apply the following transformations: lower column name, remove duplicated rows, remove missing values, and drop a row if all values are null ### Part 2 @@ -64,7 +65,8 @@ This ETL (Extract, Transform, Load) project employs several Python libraries, in ![image](https://github.com/mathewsrc/Streamlined-ETL-Process-Unleashing-Airflow-Soda-Polars-and-YData-Profiling/assets/94936606/8b325417-bdc9-4adb-8a22-cf2a04d7171e) 1. After the transformation of data I used Soda to check data quality to ensure that data was transformed as expected - 1.1 Check the number of rows + + 1.1 Check the number of rows