Skip to content

Commit

Permalink
Cookbook 3: Validate data with GX Core and GX Cloud (#5)
Browse files Browse the repository at this point in the history
Add Cookbook 3, which walks the user through validating data in a data pipeline with GX and persisting the results in GX Cloud.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
rachhouse and pre-commit-ci[bot] authored Jan 9, 2025
1 parent 5b81388 commit b6b2b8f
Show file tree
Hide file tree
Showing 37 changed files with 1,801 additions and 101 deletions.
10 changes: 8 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,12 @@ jobs:
uses: actions/checkout@v4

- name: Docker compose up
env:
GX_CLOUD_ORGANIZATION_ID: ${{ secrets.GX_CLOUD_ORGANIZATION_ID }}
GX_CLOUD_ACCESS_TOKEN: ${{ secrets.GX_CLOUD_ACCESS_TOKEN }}
run: |
echo ---Starting compose setup---
docker compose up --build --detach --wait --wait-timeout 120
GX_CLOUD_ORGANIZATION_ID=${GX_CLOUD_ORGANIZATION_ID} GX_CLOUD_ACCESS_TOKEN=${GX_CLOUD_ACCESS_TOKEN} docker compose up --build --detach --wait --wait-timeout 120
echo ---Compose is running---
- name: Run tutorial integration tests
Expand All @@ -36,7 +39,10 @@ jobs:
- name: Docker compose down
if: success() || failure()
env:
GX_CLOUD_ORGANIZATION_ID: ${{ secrets.GX_CLOUD_ORGANIZATION_ID }}
GX_CLOUD_ACCESS_TOKEN: ${{ secrets.GX_CLOUD_ACCESS_TOKEN }}
run: |
echo ---Spinning down compose---
docker compose down --volumes
GX_CLOUD_ORGANIZATION_ID=${GX_CLOUD_ORGANIZATION_ID} GX_CLOUD_ACCESS_TOKEN=${GX_CLOUD_ACCESS_TOKEN} docker compose down --volumes
echo ---Compose is down---
68 changes: 53 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,29 @@
# tutorial-gx-in-the-data-pipeline
This repo hosts hands-on tutorials that guide you through working examples of GX data validation in an Airflow pipeline.
This repo hosts hands-on tutorials that guide you through working examples of GX data validation in a data pipeline.

If you are new to GX, these tutorials will introduce you to GX concepts and guide you through creating GX data validation workflows that can be triggered and run using Airflow.
While Airflow is used as the data pipeline orchestrator for the tutorials, these examples are meant to show how GX can be integrated into any orchestrator that supports Python code.

If you are an experienced GX user, these tutorials will provide code examples of GX and Airflow integration that can be used as a source of best practices and techniques that can enhance your current data validation pipeline implementations.
If you are new to GX, these tutorials will introduce you to GX concepts and guide you through creating GX data validation workflows that can be triggered and run using a Python-enabled orchestrator.

If you are an experienced GX user, these tutorials will provide code examples of GX and orchestrator integration that can be used as a source of best practices and techniques that can enhance your current data validation pipeline implementations.

## README table of contents
1. [Prerequisites](#prerequisites)
1. [Quickstart](#quickstart)
1. [Cookbooks](#cookbooks)
1. [Tutorial environment](#tutorial-environment)
1. [Tutorial data](#tutorial-data)
1. [Troubleshooting](#troubleshooting)
1. [Additional resources](#additional-resources)

## Prerequisites
* Docker: You use Docker compose to run the containerized tutorial environment. [Docker Desktop](https://www.docker.com/products/docker-desktop/) is recommended.

* Git: You use Git to clone this repository and access the contents locally. Download Git [here](https://git-scm.com/downloads).

* GX Cloud [organization id and access token](https://docs.greatexpectations.io/docs/cloud/connect/connect_python#get-your-user-access-token-and-organization-id): Cookbook 3 uses GX Cloud to store and visualize data validation results. Sign up for a free GX Cloud account [here](https://hubs.ly/Q02TyCZS0).


## Quickstart
1. Clone this repo locally.
```
Expand All @@ -29,15 +35,24 @@ If you are an experienced GX user, these tutorials will provide code examples of
cd tutorial-gx-in-the-data-pipeline
```
3. Start the tutorial environment using Docker compose.
```
docker compose up --build --detach --wait
```
3. Start the tutorial environment using Docker compose. **If you are running Cookbook 3, supply your GX Cloud credentials.**
* To run the environment for Cookbooks 1 or 2:
```
docker compose up --build --detach --wait
```
* To run the environment for Cookbooks 1, 2, or 3, replace `<my-gx-cloud-org-id>` and `<my-gx-cloud-access-token>` with your GX Cloud organization id and access token values, respectively:
```
export GX_CLOUD_ORGANIZATION_ID="<my-gx-cloud-org-id>"
export GX_CLOUD_ACCESS_TOKEN="<my-gx-cloud-access-token>"
docker compose up --build --detach --wait
```
> [!IMPORTANT]
> The first time that you start the Docker compose instance, the underlying Docker images need to be built. This process can take several minutes.
>
> **When environment is ready, you will see the following output in the terminal:**
> **When environment is ready, you will see the following output (or similar) in the terminal:**
>
>```
>✔ Network tutorial-gx-in-the-data-pipeline_gxnet Created
Expand Down Expand Up @@ -70,12 +85,11 @@ Cookbooks will be progressively added to this repo; the table below lists the cu
>
> If the tutorial environment is not running when you try to access the cookbook, you will receive a connection error.
| No. | Cookbook topic | Cookbook status | Path to running tutorial cookbook | Path to static render of cookbook |
| :--: | :-- | :-- | :-- | :-- |
| 1 | Data validation during ingestion of data into database (happy path) | Available | [Click to open and run Cookbook 1](http://localhost:8888/lab/tree/Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) | [View Cookbook 1 on GitHub](cookbooks/Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) |
| 2 | Data validation during ingestion of data into database (pipeline fail + then take action) | Available | [Click to open and run Cookbook 2](http://localhost:8888/lab/tree/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) | [View Cookbook 2 on GitHub](cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) |
| 3 | Data validation of Postgres database tables \* | Coming soon | | |
| 4 | Data validation and automated handling in a medallion data pipeline \* | Coming soon | | |
| No. | Cookbook topic | Path to running tutorial cookbook | Path to static render of cookbook |
| :--: | :-- | :-- | :-- |
| 1 | Data validation during ingestion of data into database (happy path) | [Click to open and run Cookbook 1](http://localhost:8888/lab/tree/Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) | [View Cookbook 1 on GitHub](cookbooks/Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) |
| 2 | Data validation during ingestion of data into database (pipeline fail + then take action) | [Click to open and run Cookbook 2](http://localhost:8888/lab/tree/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) | [View Cookbook 2 on GitHub](cookbooks/Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) |
| 3 | Data validation with GX Core and GX Cloud \* | [Click to open and run Cookbook 3](http://localhost:8888/lab/tree/Cookbook_3_Validate_data_with_GX_Core_and_Cloud.ipynb) | [View Cookbook 3 on GitHub](cookbooks/Cookbook_3_Validate_data_with_GX_Core_and_Cloud.ipynb) |
<sup>\* Cookbook execution requires GX Cloud organization credentials. Sign up for a free GX Cloud account [here](https://hubs.ly/Q02TyCZS0).</sup>
Expand All @@ -88,7 +102,7 @@ Tutorials are hosted and executed within a containerized environment that is run
* **Postgres**. The containerized Postgres database hosts the sample data used by the tutorial cookbooks and pipelines.
Cookbooks that feature GX Cloud-based data validation workflows connect to your GX Cloud organization.
Cookbook 3 features GX Cloud-based data validation workflow that connects to your GX Cloud organization.
## Tutorial data
Expand All @@ -97,6 +111,30 @@ dataset](https://www.kaggle.com/datasets/bhavikjikadara/global-electronics-retai
This dataset is used under the Creative Commons Attribution 4.0 International License. Appropriate credit is given to Bhavik Jikadara. The dataset has been modified to suit the requirements of this project. For more information about this license, please visit the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
## Troubleshooting
This section provides guidance on how to resolve potential errors and unexpected behavior when running the tutorial.
### Docker compose errors
If you receive unexpected errors when running `docker compose up`, or do not get healthy containers, you can try recreating the tutorial Docker containers using the `--force-recreate` argument.
```
docker compose up --build --force-recreate --detach --wait
```
### GX Cloud environment variables warning
The tutorial Docker compose `docker-compose.yaml` is defined to capture `GX_CLOUD_ORGANIZATION_ID` and `GX_CLOUD_ACCESS_TOKEN` environment variables to support Cookbook 3. If these variables are not provided when running `docker compose up`, you will see the following warnings:
```
WARN[0000] The "GX_CLOUD_ORGANIZATION_ID" variable is not set. Defaulting to a blank string.
WARN[0000] The "GX_CLOUD_ACCESS_TOKEN" variable is not set. Defaulting to a blank string.
WARN[0000] The "GX_CLOUD_ORGANIZATION_ID" variable is not set. Defaulting to a blank string.
WARN[0000] The "GX_CLOUD_ACCESS_TOKEN" variable is not set. Defaulting to a blank string.
```
You can safely ignore the these warnings if:
* You are not trying to run Cookbook 3.
* You are running `docker compose down --volumes` to stop the running Docker compose.
## Additional resources
* To report a bug for any of the tutorials or code within this repo, [open an issue](https://github.com/greatexpectationslabs/tutorial-gx-in-the-data-pipeline/issues/new).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@
"metadata": {},
"outputs": [],
"source": [
"context = gx.get_context()"
"context = gx.get_context(mode=\"ephemeral\")"
]
},
{
Expand Down Expand Up @@ -300,7 +300,7 @@
"outputs": [],
"source": [
"expectation = gxe.ExpectTableColumnsToMatchOrderedList(\n",
" column_list=[\"customer_id\", \"name\", \"dob\", \"city\", \"state\", \"zip\", \"country\"]\n",
" column_list=[\"customer_id\", \"name\", \"city\", \"state\", \"zip\", \"country\"]\n",
")"
]
},
Expand Down Expand Up @@ -402,14 +402,13 @@
"source": [
"expectations = [\n",
" gxe.ExpectTableColumnsToMatchOrderedList(\n",
" column_list=[\"customer_id\", \"name\", \"dob\", \"city\", \"state\", \"zip\", \"country\"]\n",
" column_list=[\"customer_id\", \"name\", \"city\", \"state\", \"zip\", \"country\"]\n",
" ),\n",
" gxe.ExpectColumnValuesToBeOfType(column=\"customer_id\", type_=\"int\"),\n",
" *[\n",
" gxe.ExpectColumnValuesToBeOfType(column=x, type_=\"str\")\n",
" for x in [\"name\", \"city\", \"state\", \"zip\"]\n",
" ],\n",
" gxe.ExpectColumnValuesToMatchRegex(column=\"dob\", regex=r\"^\\d{4}-\\d{2}-\\d{2}$\"),\n",
" gxe.ExpectColumnValuesToBeInSet(\n",
" column=\"country\", value_set=[\"AU\", \"CA\", \"DE\", \"FR\", \"GB\", \"IT\", \"NL\", \"US\"]\n",
" ),\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@
"outputs": [],
"source": [
"# Create the Data Context.\n",
"context = gx.get_context()\n",
"context = gx.get_context(mode=\"ephemeral\")\n",
"\n",
"# Create the Data Source, Data Asset, and Batch Definition.\n",
"try:\n",
Expand Down Expand Up @@ -1002,7 +1002,7 @@
"source": [
"This cookbook has walked you through the process of validating data using GX, integrating the data validation workflow in an Airflow pipeline, and then programmatically handling invalid data in the pipeline when validation fails.\n",
"\n",
"[Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and Cookbook 2 (this notebook) have focused on usage of [GX Core](https://docs.greatexpectations.io/docs/core/introduction/) to implement data validation in a data pipeline. Subsequent cookbooks will explore integrating [GX Cloud](https://docs.greatexpectations.io/docs/cloud/overview/gx_cloud_overview), GX Core, and an Airflow data pipeline to achieve end-to-end data validation workflows that make validation results available and shareable in the GX Cloud web UI."
"[Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and Cookbook 2 (this notebook) have focused on usage of [GX Core](https://docs.greatexpectations.io/docs/core/introduction/) to implement data validation in a data pipeline. [Cookbook 3](Cookbook_3_Validate_data_with_GX_Core_and_Cloud.ipynb) explores integrating [GX Cloud](https://docs.greatexpectations.io/docs/cloud/overview/gx_cloud_overview), GX Core, and an Airflow data pipeline to achieve end-to-end data validation workflows that make validation results available and shareable in the GX Cloud web UI."
]
},
{
Expand Down
Loading

0 comments on commit b6b2b8f

Please sign in to comment.