diff --git a/cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb b/cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb index 9c74372..9c8e5c7 100644 --- a/cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb +++ b/cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "31d69dd0-cec7-4554-b4de-3f3db4d62add", + "id": "0", "metadata": {}, "source": [ "# Cookbook 2: Validate data during ingestion (take action on failures)" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "6b8b7fd6-f986-4d5e-a627-17c550e1348f", + "id": "1", "metadata": {}, "source": [ "This cookbook showcases a sample GX data validation workflow characteristic of data ingestion at the start of the data pipeline. Data is loaded into a Pandas dataframe, cleaned, validated, and then ingested into a Postgres database table. This cookbook explores the validation workflow first in a notebook setting, then embedded within an Airflow pipeline.\n", @@ -22,7 +22,7 @@ }, { "cell_type": "markdown", - "id": "4e7e703b-73f9-4550-8759-5aabfb5f0669", + "id": "2", "metadata": {}, "source": [ "## Imports" @@ -30,7 +30,7 @@ }, { "cell_type": "markdown", - "id": "d187e1a6-915e-4a68-8630-e04a4f4bf98c", + "id": "3", "metadata": {}, "source": [ "This tutorial features the `great_expectations` library.\n", @@ -43,7 +43,7 @@ { "cell_type": "code", "execution_count": null, - "id": "20879e58-685f-4ed9-b1d3-56b3521c1a02", + "id": "4", "metadata": {}, "outputs": [], "source": [ @@ -59,7 +59,7 @@ }, { "cell_type": "markdown", - "id": "c4325870-7d39-48a1-a06e-da86903c20c9", + "id": "5", "metadata": {}, "source": [ "## Load raw data" @@ -67,7 +67,7 @@ }, { "cell_type": "markdown", - "id": "733a5bd9-0b7f-4707-83ba-9791a0f28a82", + "id": "6", "metadata": {}, "source": [ "In this tutorial, you will clean and validate a dataset containing synthesized product data. The data is loaded from a CSV file into a Pandas DataFrame." @@ -76,7 +76,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8320005b-5560-40be-a797-0c3720036dd6", + "id": "7", "metadata": {}, "outputs": [], "source": [ @@ -88,7 +88,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c97df7e7-5b97-41f4-adf1-55d7d4fc678d", + "id": "8", "metadata": {}, "outputs": [], "source": [ @@ -99,7 +99,7 @@ }, { "cell_type": "markdown", - "id": "5aac1e31-7473-47da-9607-d96e968da270", + "id": "9", "metadata": {}, "source": [ "## Examine destination tables" @@ -107,7 +107,7 @@ }, { "cell_type": "markdown", - "id": "09e3fe6a-63e3-4f16-a328-96830e4c6de1", + "id": "10", "metadata": {}, "source": [ "The product data will be normalized and loaded into multiple Postgres tables:\n", @@ -121,7 +121,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3b71f661-71bd-4ad6-bce0-0bdd6eb3e807", + "id": "11", "metadata": {}, "outputs": [], "source": [ @@ -131,7 +131,7 @@ { "cell_type": "code", "execution_count": null, - "id": "124a885e-759b-4af6-8bcd-bb5346983a6b", + "id": "12", "metadata": {}, "outputs": [], "source": [ @@ -141,7 +141,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fd334e4e-d4db-4fb5-802e-8809f090f947", + "id": "13", "metadata": {}, "outputs": [], "source": [ @@ -150,7 +150,7 @@ }, { "cell_type": "markdown", - "id": "46e93824-4f7b-42a0-94c7-92dfc5fa6b67", + "id": "14", "metadata": {}, "source": [ "## Clean product data" @@ -158,7 +158,7 @@ }, { "cell_type": "markdown", - "id": "77abd0ab-0b79-4e98-9bde-55cb09ca51cd", + "id": "15", "metadata": {}, "source": [ "To clean the product data and separate it into three dataframes to normalize the data, you will use a pre-prepared function, `clean_product_data`. The cleaning code is displayed below, and then invoked to clean the raw product data." @@ -167,7 +167,7 @@ { "cell_type": "code", "execution_count": null, - "id": "68245d77-d37e-49a4-9e04-c2df620684b0", + "id": "16", "metadata": {}, "outputs": [], "source": [ @@ -177,7 +177,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2fa3c45b-a5ea-4aa0-9cab-74beafa063b3", + "id": "17", "metadata": {}, "outputs": [], "source": [ @@ -189,7 +189,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5f5e7e37-618f-481a-8ad6-704782c0b117", + "id": "18", "metadata": {}, "outputs": [], "source": [ @@ -201,7 +201,7 @@ { "cell_type": "code", "execution_count": null, - "id": "25489fb3-368b-4873-a402-b759fe9de53c", + "id": "19", "metadata": {}, "outputs": [], "source": [ @@ -213,7 +213,7 @@ { "cell_type": "code", "execution_count": null, - "id": "91ed5ebc-b58d-419c-8dfe-b25f94f861e5", + "id": "20", "metadata": {}, "outputs": [], "source": [ @@ -224,7 +224,7 @@ }, { "cell_type": "markdown", - "id": "684b1b89-78f0-4b0a-8f06-398815ce56e2", + "id": "21", "metadata": {}, "source": [ "## GX data validation workflow" @@ -232,7 +232,7 @@ }, { "cell_type": "markdown", - "id": "a7ee9db7-1495-450f-a784-f392120185ab", + "id": "22", "metadata": {}, "source": [ "You will validate the cleaned product data using GX prior to loading it into a Postgres database table.\n", @@ -254,7 +254,7 @@ }, { "cell_type": "markdown", - "id": "b3a2b58f-f1ca-4b9a-83e2-19a647a171b0", + "id": "23", "metadata": {}, "source": [ "### Set up the GX validation workflow\n", @@ -267,7 +267,7 @@ }, { "cell_type": "markdown", - "id": "df334b32-703c-4c3f-932a-0940ca02b035", + "id": "24", "metadata": {}, "source": [ "```{admonition} Reminder: Adding GX components to the Data Context\n", @@ -283,7 +283,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4cae696e-209b-48be-8104-a2b30ffae51d", + "id": "25", "metadata": {}, "outputs": [], "source": [ @@ -345,7 +345,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bff73224-938a-471c-9dc9-ebdc6c4e54bd", + "id": "26", "metadata": {}, "outputs": [], "source": [ @@ -354,7 +354,7 @@ }, { "cell_type": "markdown", - "id": "a4401526-dc15-4392-90d8-469e6a4dee62", + "id": "27", "metadata": {}, "source": [ "### Extend the validation workflow" @@ -362,7 +362,7 @@ }, { "cell_type": "markdown", - "id": "a6f51c51-61b5-4816-a64e-df57159d9e5a", + "id": "28", "metadata": {}, "source": [ "A **Validation Definition** pairs a Batch Definition with an Expectation Suite. It defines what data you want to validate using which Expectations." @@ -371,7 +371,7 @@ { "cell_type": "code", "execution_count": null, - "id": "908e39e0-da9d-44a9-aab8-1e60c35e7fae", + "id": "29", "metadata": {}, "outputs": [], "source": [ @@ -397,7 +397,7 @@ }, { "cell_type": "markdown", - "id": "26890cd8-53b9-4904-ae97-bd895748588b", + "id": "30", "metadata": {}, "source": [ "A **Checkpoint** executes data validation based on the specifications of the Validation Definition. Checkpoints also enable actions to be tied to data validation, and \n", @@ -410,7 +410,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bcc1d65b-db64-44fe-bd10-4f087ed4594b", + "id": "31", "metadata": {}, "outputs": [], "source": [ @@ -446,7 +446,7 @@ }, { "cell_type": "markdown", - "id": "fd65b819-588b-419d-a82c-414f4f7e1dc3", + "id": "32", "metadata": {}, "source": [ "Next, run the Checkpoint. When validating dataframe Data Sources, the dataframe must be supplied to the Checkpoint at runtime." @@ -455,7 +455,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b3a798e3-dd87-4513-b757-134a94ad4edd", + "id": "33", "metadata": {}, "outputs": [], "source": [ @@ -464,7 +464,7 @@ }, { "cell_type": "markdown", - "id": "0d7faa28-ea0f-40d3-a954-90df8e6993a5", + "id": "34", "metadata": {}, "source": [ "## Examine Validation Result" @@ -473,7 +473,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b9e12ed7-be50-40a4-8d7c-d6d493baee7d", + "id": "35", "metadata": {}, "outputs": [], "source": [ @@ -486,7 +486,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1fd62090-1003-4560-9f60-7c198ba968d9", + "id": "36", "metadata": {}, "outputs": [], "source": [ @@ -495,7 +495,7 @@ }, { "cell_type": "markdown", - "id": "43e9597a-9c02-44a5-a29c-4f5f8f7aa1c0", + "id": "37", "metadata": {}, "source": [ "```\n", @@ -511,7 +511,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dbc43a93-a931-482c-8e8b-1c2d077decc0", + "id": "38", "metadata": {}, "outputs": [], "source": [ @@ -526,7 +526,7 @@ { "cell_type": "code", "execution_count": null, - "id": "05770580-3fbe-4c55-b8c9-0a9a2535a65a", + "id": "39", "metadata": {}, "outputs": [], "source": [ @@ -538,7 +538,7 @@ }, { "cell_type": "markdown", - "id": "c2f1e408-d51b-45af-a69a-3080955a70d2", + "id": "40", "metadata": {}, "source": [ "## pull out bad rows" @@ -547,7 +547,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3190040d-a52e-42d7-8f6d-fe19470677e2", + "id": "41", "metadata": {}, "outputs": [], "source": [ @@ -562,7 +562,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bc45a6bc-2e12-44ef-abaa-218362ecb6ef", + "id": "42", "metadata": {}, "outputs": [], "source": [ @@ -576,7 +576,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0b54b2d4-3d63-4569-86c2-c26362af0949", + "id": "43", "metadata": {}, "outputs": [], "source": [ @@ -587,7 +587,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e880a999-7abf-4684-9a8d-9afbbfd9e87d", + "id": "44", "metadata": {}, "outputs": [], "source": [ @@ -602,7 +602,7 @@ { "cell_type": "code", "execution_count": null, - "id": "41880604-b5cc-4c34-87e9-389bfbc8da78", + "id": "45", "metadata": {}, "outputs": [], "source": [ @@ -618,7 +618,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dd532b9d-e720-4676-9cf3-998548bdf1ec", + "id": "46", "metadata": {}, "outputs": [], "source": [] diff --git a/environment/airflow/dags/cookbook1_ingest_customer_data.py b/environment/airflow/dags/cookbook1_ingest_customer_data.py index 2e184c4..714d042 100644 --- a/environment/airflow/dags/cookbook1_ingest_customer_data.py +++ b/environment/airflow/dags/cookbook1_ingest_customer_data.py @@ -4,11 +4,10 @@ import pathlib import pandas as pd +import tutorial_code as tutorial from airflow import DAG from airflow.operators.python import PythonOperator -import tutorial_code as tutorial - log = logging.getLogger("GX validation") diff --git a/environment/airflow/dags/cookbook2_ingest_product_data_handle_invalid_data.py b/environment/airflow/dags/cookbook2_ingest_product_data_handle_invalid_data.py index d3ba4d0..744bc46 100644 --- a/environment/airflow/dags/cookbook2_ingest_product_data_handle_invalid_data.py +++ b/environment/airflow/dags/cookbook2_ingest_product_data_handle_invalid_data.py @@ -4,11 +4,10 @@ import pathlib import pandas as pd +import tutorial_code as tutorial from airflow import DAG from airflow.operators.python import PythonOperator -import tutorial_code as tutorial - log = logging.getLogger("GX validation") diff --git a/tests/test_cookbook1.py b/tests/test_cookbook1.py index 88daff2..5f439b3 100644 --- a/tests/test_cookbook1.py +++ b/tests/test_cookbook1.py @@ -6,7 +6,6 @@ import great_expectations as gx import pandas as pd import pytest - import tutorial_code as tutorial diff --git a/tests/test_cookbook2.py b/tests/test_cookbook2.py index 5d7b3db..a330c7a 100644 --- a/tests/test_cookbook2.py +++ b/tests/test_cookbook2.py @@ -6,7 +6,6 @@ import great_expectations as gx import pandas as pd import pytest - import tutorial_code as tutorial