jaffle_shop
is a fictional ecommerce store. This dbt (data build tool)
project transforms raw data from an app database into an orders model ready for analytics.
This project is has an even more simplified model than the project jaffle_shop from which it was inspired by focusing only on status changes that happen on the orders.
An order has one of the statuses described in the order status documentation :
- placed
- shipped
- completed
- return_pending
- returned
The staging table contains the following columns:
- load_id: monotonically increasing sequence for the order changes
- order_id: the order identifier
- updated_at: the date when the order was updated
- status: completion status of the order
This project offers a proof of concept on how to answer the question:
What is the amount of orders that were being shipped in the month January?
NOTE that a shipment can take multiple days until the package reaches the customer and the order is marked as completed
.
A naive answer would be to count the distinct orders from the staging table where status shipped
is appearing.
SELECT COUNT(DISTINCT order_id)
FROM jaffle_shop.stg_orders
WHERE updated_at BETWEEN '2018-01-01' AND '2018-01-31'
AND status = 'shipped';
In this case the orders that started being shipped before the month of January and completed during or after this month would not be taken into account.
One possible solution in order to the question previously mentioned would be to historize the status changes performed on the orders which would allow to easily find the shipment date ranges that overlap with the month on which the number or orders in shipment needs to be calculated:
-- number of orders in shipment during the month of January 2018
SELECT COUNT(DISTINCT(order_id))
FROM jaffle_shop.fct_orders
WHERE valid_from < '2018-02-01' AND valid_to >= '2018-01-01'
AND status = 'shipped';
-- number of orders in shipment during the month of February 2018
SELECT COUNT(DISTINCT(order_id))
FROM jaffle_shop.fct_orders
WHERE valid_from < '2018-03-01' AND (valid_to >= '2018-02-01' OR valid_to IS NULL)
AND status = 'shipped';
NOTE in the image below that the [valid_from
, valid_to
] date ranges corresponding to an order are adjacent to
each other. Only the latest entry corresponding to an order is unbounded (valid_to
is set to NULL
)
This project provides a proof of concept on how to historize order status changes with dbt models on Snowflake database.
As described in the introduction to dbt :
dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.
dbt does the T in ELT (Extract, Load, Transform) processes – it doesn't extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.
The jaffle_shop project is a useful minimum viable dbt project to get new dbt users up and running with their first dbt project. It includes seed files with generated data so that a user can run this project on their own warehouse.
For more information on dbt:
- Read the introduction to dbt.
- Read the dbt viewpoint.
Use virtualenv for creating a virtual
python environment:
pip3 install virtualenv
virtualenv venv
source venv/bin/activate
Once virtualenv is set, proceed to install the requirements for the project:
(venv) ➜ pip3 install -r requirements.txt
Place in ~/.dbt/profiles.yml
file the following content for interacting via dbt with Snowflake database:
NOTE be sure to change the coordinates of the database according to your Snowflake account.
# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile
jaffle_shop:
target: dev
outputs:
dev:
type: snowflake
account: your-account.your-snowflake-region
port: 443
user: "your-username"
password: "your-password"
role: accountadmin
threads: 4
database: playground
warehouse: your-warehouse-name
schema: jaffle_shop
config:
send_anonymous_usage_stats: False
If everything is setup correctly, dbt can be used to seed the database with test data and also to fill the models:
(venv) ➜ dbt seed --profile jaffle_shop
(venv) ➜ dbt run --profile jaffle_shop
By using the Snowflake Query Browser, the number of orders in shipment during the month of January 2018 can be now retrieved by executing the following sql query:
-- number of orders in shipment during the month of January 2018
SELECT COUNT(DISTINCT(order_id))
FROM jaffle_shop.fct_orders
WHERE valid_from < '2018-02-01' AND valid_to >= '2018-01-01'
AND status = 'shipped';
Deactivate the Python virtual environment
(venv) ➜ deactivate
After running the demo, if the reader is interested to see what dbt
is doing in the background, the content
of the logs (logs/dbt.log
) can be investigated to see how the historization is actually performed.
The SQL queries needed to achieve the historization functionality can be consulted by checking the file
historization-queries.sql.
In a nutshell, the new staging entries containing order status changes get deduplicated and they get appended to the existing history log of order status changes.
dbt offers the possibility of using macros. Macros are pieces of code that can be reused multiple times - same as functions in other programming languages.
By means of using a specialized macro for the historization functionality the complexity of building the historization queries (see historization-queries.sql) is fully abstracted and the readability of the query used for historizing the status changes is greatly simplified:
{{
config(
materialized = 'historized',
primary_key_column_name = 'order_id',
valid_from_column_name = 'updated_at',
load_id_column_name = 'load_id'
)
}}
select
load_id,
order_id,
updated_at,
status
from {{ ref('stg_orders') }}
The implementation details for the historized
dbt macro can be found in historized.sql file.
As can be seen in the code snippet above, same as a function, the historized
macro takes a few parameters:
primary_key_column_name
: the name of the column used to identify the staged entityvalid_from_column_name
: the name of the column containing the timestamp when the staged entity has been last updated.load_id_column_name
: the name of the column containing monotonically increasing sequence values used for distinguishing the precedence between order status changes that happen within the same timestamp.
The workflow performed within the workflow:
- create a temporary table (suffixed by
__tmp
) which contains the staged entries along withmd5_hash
column containing hash of the concatenated columns used for historization (e.g. :status
in case of this example, but nothing speaks agains historizing more columns on an entity). The hash column can be used to easily distinguish whether two adjacent entries (ordered by load id) are duplicated (have the same values for the versioned columns). For more information on hashing, read on The most underutilized function in SQL dbt blog post. - in case whether there are new entries staged (from the
__tmp
suffixed source table) that correspond to an unbounded historized entity (valid_to
column isNULL
) in the target table, then set the upper bound columnvalid_to
to the value corresponding to the minimumvalid_from
value of the staged entries corresponding on the target entity - deduplicate the staged entries (based on the
md5_hash
column) - obtain the validity range for each of the staged records (
valid_from
,valid_to
) - join the staged records with the records from the target table and filter out eventual duplicates (based on
md5_hash
) - insert the staged records in the target table.
For more information on dbt macros:
- Read the introduction to Jinja macros.
- Read about the dbt materializations and their corresponding implementation on Github.
It takes quite a while to develop a dbt
macro and afterwards it needs to be tested whether it works.
It may even happen that while it is productively used, a bugfix needs to be done or that the functionality of
the macro needs to be extended.
By means of using automated tests for data transformations there could be ensured that the macro works
as expected with an extensive battery of tests on a test environment.
dtspec is an open-source framework written in Python which can be used for specifying and testing data transformations.
Within dtspec
is specified in a yaml format:
- the source data in the table(s) to be used by the data transformation(
dbt
) - the expected data from the target table(s).
dtspec
framework offers means to read the yaml specification, and match the data from the actual tables, once
the data transformation has been performed (via dbt
) with the data from the specification scenario.
This project has introduced minor changes to the test code present in the project jaffle_shop-dtspec in order to allow it to run against Snowflake database.
In order to run the tests simply execute:
(venv) ➜ python tests/test.py
Below is presented a snippet of the output used for running the tests:
Executing test specification tests/demo-spec.yml
Truncating data from the tables ['raw_orders', 'fct_orders']
Inserting input data into the source table raw_orders
/home/findinpath/dbt_jaffle_shop_historized/tests/..
Running with dbt=0.18.1
Found 2 models, 11 tests, 0 snapshots, 0 analyses, 148 macros, 0 operations, 1 seed file, 0 sources
21:51:29 | Concurrency: 4 threads (target='dev')
21:51:29 |
21:51:29 | 1 of 2 START table model jaffle_shop.stg_orders...................... [RUN]
21:51:31 | 1 of 2 OK created table model jaffle_shop.stg_orders................. [SUCCESS 1 in 2.02s]
21:51:31 | 2 of 2 START historized model jaffle_shop.fct_orders................. [RUN]
21:51:35 | 2 of 2 OK created historized model jaffle_shop.fct_orders............ [SUCCESS 1 in 4.39s]
21:51:36 |
21:51:36 | Finished running 1 table model, 1 historized model in 10.58s.
Completed successfully
Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
Loading data from the target table fct_orders
Loading actuals for target fct_orders
Asserting Building fct_orders out of the raw_orders when the fct_orders table is empty: Basic full refresh loading of the historized fact table `fct_orders`
Tested scenarios: - order1 : the expected historized entries should follow the representation: placed > shipped > completed - order2 : the expected historized entries should follow the representation: placed - order3 : contains duplicated (with the same `updated_at` and `status` values) entries. The duplicates should be filtered out from the processing. The expected historized entries should follow the representation: placed > shipped > return_pending
PASSED
NOTE: Make sure to follow before the steps described in the Demo section of this document.
To give to the reader a hint about what happens when dtspec finds a mismatch when verifying the content of the target table against what is present in the specification, there is presented also a snippet of the output of a failing test:
DataFrame.iloc[:, 2] (column name="status") values are different (25.0 %)
[index]: [0, 1, 2, 3]
[left]: [placed, completed, shipped, placed]
[right]: [returned, completed, shipped, placed]
Actual:
order_id load_id status valid_from valid_to
0 order1 1 placed 2018-01-01 2018-01-03
1 order1 11 completed 2018-01-04 {NULL}
2 order1 3 shipped 2018-01-03 2018-01-04
3 order2 10 placed 2018-01-02 {NULL}
Expected:
order_id load_id status valid_from valid_to
0 order1 1 returned 2018-01-01 2018-01-03
1 order1 11 completed 2018-01-04 {NULL}
2 order1 3 shipped 2018-01-03 2018-01-04
3 order2 10 placed 2018-01-02 {NULL}
As already mentioned in the beginning of this document, this project represents a proof of concept and not necessarily a
recipe on how to setup the test automation in the build pipeline for dbt
when working with the Snowflake database.
When running the tests for this project, there can be noticed that it takes quite some time to execute them.This happens in part because Snowflake takes roughly a bit under a second to run any kind of query, even if the query is performed on an empty database.
Nevertheless, having automated tests for the dbt
models, even though they are slow, is much better than doing the
tests by hand or not doing them at all and leaving production as a playground for the developed models.
For more information on dtspec:
- Visit the dtspec Github project page
- Visit the jaffle_shop-dtspec Github project to get an
introduction on how to work with
dtspec
&dbt
for the jaffle-shop dbt tutorial project.
This proof of concept project compiles together several topics:
- dbt macros
- data transformation specification
- historization of entities
into a functioning prototype for historizing order status changes.
Feel free to provide feedback or alternative implementations to any of the topics presented in this project.