This workshop content is current as of 18 December 2024.
Welcome to our workshop! In this workshop, you'll learn how to connect your GX Cloud account to a sample Data Source, create Expectations, and run Validations.
- A GX Cloud account with Admin or Editor permissions.
You'll complete the following tasks in this workshop:
- Sign in to GX Cloud
- Use demo data as a Data Source and Data Asset
- Create Expectations
- Validate Expectations
- Update the failing Expectation and run the Validation again
- Profile Data
- Custom SQL Expectations
- Review Validation Schedule
- Review Alerts
If you're new to GX, an understanding of the following GX terminology will be helpful as you complete this workshop.
Sign in to GX Cloud.
You'll need to setup a data source from GX Cloud. We provide demo data in a PostgreSQL database for this workshop.
- In GX Cloud under, "Not ready to connect to your data?", click Use demo data.
Select the
nyc_taxi_data
Data Asset
- On the Select tables to import page, check the box next to
nyc_taxi_data
.- Click Add 1 Asset.
Congratulations! You have successfully added the nyc_taxi_data
demo Data Asset to your GX Cloud organization.
Expectations are a unique GX construct that enable you to make simple, declarative assertions about your data. You can think of Expectations as unit tests for your data. They make implicit assumptions about your data explicit, and they use self-explanatory language for describing data. Expectations can help you better understand your data and help you improve data quality.
In GX Cloud, you create Expectations for the Data Asset.
The nyc_taxi_data
Data Asset table contains New York City (NYC) taxi data from January 2022. The NYC Taxi data is a popular set of open source data that contains records of completed taxi cab trips in NYC, including information such as pick up and drop off times, the number of passengers, the fare collected, and so on.
You'll create Expectations to validate the taxi data. New Expectations are selected from the Expectation picker. The picker can be filtered using the search box or the drop down.
Your first Expectation will expect that there is an associated unique id
for each taxi trip. You expect that you should not see any null id
values.
Create your first Expectation
In the Data Assets list, click the
nyc_taxi_data
Data Asset.Click the New Expectation button.
Type
null
into the search box to filter Expectation namesClick the Expect column values to not be null Expectation.
Create an Expectation that verifies that there is a unique ID for each record:
- In the Column field, enter
id
.- Click Save.
Once created, your first Expectation is displayed on the Data Asset Expectations page.
Create a second Expectation that checks the passenger count values to determine if the maximum allowable capacity of four passengers is exceeded on any given trip. Typically, trips don't accommodate more than four passengers, since there are normally only four passenger seats in a taxi cab.
Create your second Expectation
Create an Expectation that asserts that there are no more than four passengers for any trip:
- Click back on the Overview tab and click on New Expectation.
- Click the Expect column maximum to be between Expectation.
- In the Column field, enter
passenger_count
.- In the first fixed value field, enter
4
.- In the second fixed value field, enter
4
.- Leave the other fields blank.
- Click Save.
Your new id
and passenger_count
Expectations appear in the Expectation list.
You have successfully created two Expectations. Now, make sure that they pass as expected when you validate your Data Asset.
Validate your Data Asset
On the Data Asset Expectations page, click Validate.
After you click Validate, GX Cloud runs queries based on the defined Expectations against the data in Postgres. GX Cloud uses the query results to determine if the data fails or meets your Expectations, and reports the results back to the UI.
After validation is completed, a notification appears indicating that the Validation results are ready. To view the results click on the Data Asset Validations tab.
You can see that the passenger_count
Expectation has failed. Click on the red (!) icon in the "Batches and run history" list, and you'll see that GX Cloud has an observed value of 7, not 4. This is because some of the larger New York City taxis in NYC are SUVs or minivans that can carry up to seven passengers.
Now that you know the assumption about taxi passenger capacity was incorrect, you need to update the Expectation so the Validation of the passenger_count
Expectation passes.
Update your Expectation
- Click the Expectations tab.
- Click Edit (the pencil icon) for the
passenger_count
Expectation.- In both fixed value fields, change
4
to7
.- Click Save.
After the Expectation is updated, click the Validate button again. When the notification indicating the Validation was successful appears, click the Validations tab. The passenger_count
Expectation was successful and you will see the history of all runs by default.
You might wonder if there is an easier way to create your Expectations instead of making assumptions or manually inspecting the data. Thankfully, GX Cloud lets you profile data assets directly, so you don't have to wonder!
When you profile a Data Asset, GX Cloud reads the Data Asset and returns a collection of descriptive metrics including column types, statistical summaries, and null percentages.
Profile Data for a Data Asset
- Click the Data Asset Metrics tab. The first time you visit this tab, it will fetch basic information about your Data Asset and display it in the Data Asset Information pane.
- Click the Profile Data button.
When the process completes, an updated view of your Data Asset appears. You can see the Data Asset row count as well as some key information about each of the columns. Take some time now to review the data included in metrics.
Once you have profiled the data for a Data Asset, you can use the introspected results when creating new Expectations. Let's create a new Expectation for this Data Asset. Note the subtle, but key, changes on the Expectation creation page.
- When selecting new Expectations types, the Column input provides a dropdown menu of existing Data Asset columns, rather than a freeform text field.
- Depending on the Expectation type and column selected, default values are populated automatically.
Examine creating a new Expectation using profiled data
- Click New Expectation.
- Click the Expect column maximum to be between Expectation.
- In the Column menu, select
passenger_count
.- The value
7
is automatically added to the fixed value fields.- Click the X at the top next to New Expectation or click Back to cancel.
We've already created this expectation, so go ahead and cancel creating a new expectation.
You can create custom SQL Expectations in GX Cloud. These will fail validation if the SQL query returns one or more rows. You can perform any query against the data that you wish.
Let's create a new expectation using the custom SQL Expectation form. We're going to modify the example query that is filled in by default.
Create Custom SQL Expectation
- On the Expectations tab, click New Expectation
- Click Create custom Expectation at the bottom of the panel
- Enter the description, "Queens to Newark Airport"
- Modify the
WHERE
clause in the query to look like this:
SELECT
*
FROM
{batch}
WHERE
pickup_borough = 'Queens'
AND dropoff_borough = 'EWR'
AND fare_amount < 100
Click Save.
In this query, we're selecting all rides from Queens
to the Newark Airport (code EWR
), where the fares are less than 100
($100 US). We know that Queens to EWR is generally a long drive that takes awhile, so we expect it to be expensive.
Now that the SQL Expectation is created click Validate. Go to the Validations tab, and see that the Expectation failed. Click on the latest run that failed, and note that the observed value returned 6 rows, indicating the failure.
If we were to query the data directly, we will see our assumption about the expensive rides was a bit too high:
> SELECT fare_amount
FROM nyc_taxi_data
WHERE
pickup_borough = 'Queens'
AND dropoff_borough = 'EWR'
AND fare_amount < 100;
fare_amount
-------------
78
76
81
83
98
92.5
(6 rows)
Go back to the Expectations tab and edit the Expectation. In the SQL code box, change fare_amount < 100
to fare_amount < 76
. Rerun the Validation again and view the results.
Congratulations! You've created a custom SQL Expectation.
GX Cloud will create a validation schedule when expectations are created. The schedule can be paused by clicking the "On" radio button to "Off". The schedule can also be modified by clicking the pencil "Edit Schedule" button.
The default schedule is to run every 24 hours, starting at the top of the next hour. Edit the validation schedule now.
Edit the Validation schedule
- Click Edit Schedule (the pencil icon).
- Click Frequency drop down.
- Select Every 6 hours.
- Click Save.
GX Cloud will automatically send alerts to users' email address. To disable or re-enable this, open the Alerts panel. This is controlled on a per-user basis, so other users within your organization will need to enable this if they would like to be alerted to failing validations.
Review Alerts
- Click the Alerts button.
- Click the radio button to turn off Email alerts, click it again to turn on Email alerts.
Congratulations! You've successfully completed th Get Started with GX Cloud Workshop. You have connected the demo Data Source and Data Asset, created Expectations, run some Validations, and fetched Metrics for the data. We hope you have a better understanding of how GX Cloud works and how it can work within your data pipeline.
- Connect to your own Data Source
- Create your own Expectations in GX Cloud
- Use the GX Python API to create Data Sources, Data Assets, and Expectations
- Connect to GX Cloud from an orchestrator (for example, Airflow)
- Invite others to work in your GX Cloud organization
- Explore our documentation