Skip to content

Latest commit

 

History

History
244 lines (171 loc) · 13.5 KB

workshop.md

File metadata and controls

244 lines (171 loc) · 13.5 KB

Get Started with GX Cloud

This workshop content is current as of 18 December 2024.

Welcome to our workshop! In this workshop, you'll learn how to connect your GX Cloud account to a sample Data Source, create Expectations, and run Validations.

Prerequisites

  • A GX Cloud account with Admin or Editor permissions.

Agenda

You'll complete the following tasks in this workshop:

  1. Sign in to GX Cloud
  2. Use demo data as a Data Source and Data Asset
  3. Create Expectations
  4. Validate Expectations
  5. Update the failing Expectation and run the Validation again
  6. Profile Data
  7. Custom SQL Expectations
  8. Review Validation Schedule
  9. Review Alerts

GX terminology

If you're new to GX, an understanding of the following GX terminology will be helpful as you complete this workshop.

Introductory GX terminology

Sign in to GX Cloud

Sign in to GX Cloud.

Use Demo Data

You'll need to setup a data source from GX Cloud. We provide demo data in a PostgreSQL database for this workshop.

  1. In GX Cloud under, "Not ready to connect to your data?", click Use demo data.

Use Demo Data

Select the nyc_taxi_data Data Asset

  1. On the Select tables to import page, check the box next to nyc_taxi_data.
  2. Click Add 1 Asset.

Add a Demo Data Asset

Congratulations! You have successfully added the nyc_taxi_data demo Data Asset to your GX Cloud organization.

Create Expectations

Expectations are a unique GX construct that enable you to make simple, declarative assertions about your data. You can think of Expectations as unit tests for your data. They make implicit assumptions about your data explicit, and they use self-explanatory language for describing data. Expectations can help you better understand your data and help you improve data quality.

In GX Cloud, you create Expectations for the Data Asset.

The nyc_taxi_data Data Asset table contains New York City (NYC) taxi data from January 2022. The NYC Taxi data is a popular set of open source data that contains records of completed taxi cab trips in NYC, including information such as pick up and drop off times, the number of passengers, the fare collected, and so on.

You'll create Expectations to validate the taxi data. New Expectations are selected from the Expectation picker. The picker can be filtered using the search box or the drop down.

Create a missingness Expectation

Your first Expectation will expect that there is an associated unique id for each taxi trip. You expect that you should not see any null id values.

Create your first Expectation

  1. In the Data Assets list, click the nyc_taxi_data Data Asset.

  2. Click the New Expectation button.

  3. Type null into the search box to filter Expectation names

  4. Click the Expect column values to not be null Expectation.

  5. Create an Expectation that verifies that there is a unique ID for each record:

    1. In the Column field, enter id.
    2. Click Save.

Use the search filter

Create a missingness Expectation

Once created, your first Expectation is displayed on the Data Asset Expectations page.

Create a second Expectation that checks the passenger count values to determine if the maximum allowable capacity of four passengers is exceeded on any given trip. Typically, trips don't accommodate more than four passengers, since there are normally only four passenger seats in a taxi cab.

Create your second Expectation

Create an Expectation that asserts that there are no more than four passengers for any trip:

  1. Click back on the Overview tab and click on New Expectation.
  2. Click the Expect column maximum to be between Expectation.
  3. In the Column field, enter passenger_count.
  4. In the first fixed value field, enter 4.
  5. In the second fixed value field, enter 4.
  6. Leave the other fields blank.
  7. Click Save.

Create a column max Expectation

Your new id and passenger_count Expectations appear in the Expectation list.

Validate a Data Asset

You have successfully created two Expectations. Now, make sure that they pass as expected when you validate your Data Asset.

Validate your Data Asset

On the Data Asset Expectations page, click Validate.

Validate a Data Asset

After you click Validate, GX Cloud runs queries based on the defined Expectations against the data in Postgres. GX Cloud uses the query results to determine if the data fails or meets your Expectations, and reports the results back to the UI.

After validation is completed, a notification appears indicating that the Validation results are ready. To view the results click on the Data Asset Validations tab.

You can see that the passenger_count Expectation has failed. Click on the red (!) icon in the "Batches and run history" list, and you'll see that GX Cloud has an observed value of 7, not 4. This is because some of the larger New York City taxis in NYC are SUVs or minivans that can carry up to seven passengers.

Validation results with passing and failing Expectations

Update the failing Expectation and run the Validation again

Now that you know the assumption about taxi passenger capacity was incorrect, you need to update the Expectation so the Validation of the passenger_count Expectation passes.

Update your Expectation

  1. Click the Expectations tab.
  2. Click Edit (the pencil icon) for the passenger_count Expectation.
  3. In both fixed value fields, change 4 to 7.
  4. Click Save.

After the Expectation is updated, click the Validate button again. When the notification indicating the Validation was successful appears, click the Validations tab. The passenger_count Expectation was successful and you will see the history of all runs by default.

Validation results with all passing Expectations

Profile Data

You might wonder if there is an easier way to create your Expectations instead of making assumptions or manually inspecting the data. Thankfully, GX Cloud lets you profile data assets directly, so you don't have to wonder!

When you profile a Data Asset, GX Cloud reads the Data Asset and returns a collection of descriptive metrics including column types, statistical summaries, and null percentages.

Profile Data for a Data Asset

  1. Click the Data Asset Metrics tab. The first time you visit this tab, it will fetch basic information about your Data Asset and display it in the Data Asset Information pane.
  2. Click the Profile Data button.

Profile data button for Demo Data Asset

When the process completes, an updated view of your Data Asset appears. You can see the Data Asset row count as well as some key information about each of the columns. Take some time now to review the data included in metrics.

Data Asset Metrics

Once you have profiled the data for a Data Asset, you can use the introspected results when creating new Expectations. Let's create a new Expectation for this Data Asset. Note the subtle, but key, changes on the Expectation creation page.

  • When selecting new Expectations types, the Column input provides a dropdown menu of existing Data Asset columns, rather than a freeform text field.
  • Depending on the Expectation type and column selected, default values are populated automatically.

Examine creating a new Expectation using profiled data

  1. Click New Expectation.
  2. Click the Expect column maximum to be between Expectation.
  3. In the Column menu, select passenger_count.
  4. The value 7 is automatically added to the fixed value fields.
  5. Click the X at the top next to New Expectation or click Back to cancel.

We've already created this expectation, so go ahead and cancel creating a new expectation.

Create a new Expectation using Data Asset Metrics

Custom SQL Expectations

You can create custom SQL Expectations in GX Cloud. These will fail validation if the SQL query returns one or more rows. You can perform any query against the data that you wish.

Let's create a new expectation using the custom SQL Expectation form. We're going to modify the example query that is filled in by default.

Create Custom SQL Expectation

  1. On the Expectations tab, click New Expectation
  2. Click Create custom Expectation at the bottom of the panel
  3. Enter the description, "Queens to Newark Airport"
  4. Modify the WHERE clause in the query to look like this:
SELECT
  *
FROM
  {batch}
WHERE
  pickup_borough = 'Queens'
  AND dropoff_borough = 'EWR'
  AND fare_amount < 100

Click Save.

In this query, we're selecting all rides from Queens to the Newark Airport (code EWR), where the fares are less than 100 ($100 US). We know that Queens to EWR is generally a long drive that takes awhile, so we expect it to be expensive.


Now that the SQL Expectation is created click Validate. Go to the Validations tab, and see that the Expectation failed. Click on the latest run that failed, and note that the observed value returned 6 rows, indicating the failure.


If we were to query the data directly, we will see our assumption about the expensive rides was a bit too high:

> SELECT fare_amount
FROM nyc_taxi_data
WHERE
  pickup_borough = 'Queens'
  AND dropoff_borough = 'EWR'
  AND fare_amount < 100;
 fare_amount
-------------
          78
          76
          81
          83
          98
        92.5
(6 rows)

Go back to the Expectations tab and edit the Expectation. In the SQL code box, change fare_amount < 100 to fare_amount < 76. Rerun the Validation again and view the results.


Congratulations! You've created a custom SQL Expectation.

Review Validation Schedule

GX Cloud will create a validation schedule when expectations are created. The schedule can be paused by clicking the "On" radio button to "Off". The schedule can also be modified by clicking the pencil "Edit Schedule" button.

Validation schedule highlighted in the expectation view

The default schedule is to run every 24 hours, starting at the top of the next hour. Edit the validation schedule now.

Edit the Validation schedule

  1. Click Edit Schedule (the pencil icon).
  2. Click Frequency drop down.
  3. Select Every 6 hours.
  4. Click Save.

Edit the validation schedule

Review Alerts

GX Cloud will automatically send alerts to users' email address. To disable or re-enable this, open the Alerts panel. This is controlled on a per-user basis, so other users within your organization will need to enable this if they would like to be alerted to failing validations.

Review Alerts

  1. Click the Alerts button.
  2. Click the radio button to turn off Email alerts, click it again to turn on Email alerts.

Alerts Button on Expectation Suite

Alerts Panel to Enable/Disable Email alerts

Conclusion

Congratulations! You've successfully completed th Get Started with GX Cloud Workshop. You have connected the demo Data Source and Data Asset, created Expectations, run some Validations, and fetched Metrics for the data. We hope you have a better understanding of how GX Cloud works and how it can work within your data pipeline.

What's next?