In this exercise, you will use several of the capabilities associated with dedicated SQL Pools to analyze the data.
SQL data warehouses have been for a long time the centers of gravity in data platforms. Current data warehouses can provide high performance, distributed, and governed workloads, regardless of the data volumes at hand.
The dedicated SQL pool in Azure Synapse is the new incarnation of the former Azure SQL Data Warehouse. It provides all the state-of-art SQL data warehousing features while benefiting from the advanced integration with all the other Synapse services.
The tasks you will perform in this exercise are:
- Exercise 4 - High-Performance Analysis with Azure Synapse Dedicated SQL Pools
- Task 1 - Use a dedicated SQL pool query to understand a dataset
- Task 2 - Investigate query performance and table design
- Bonus Challenge
Note: The tasks in this exercise must be run against the dedicated SQL pool (as opposed to the ones from exercise 1, which were run against the serverless SQL pool named "built-in" pool). Make sure you have
SQLPool01
selected before running each query:
In this task, you will try to understand who your best customers are.
Challenge: Can you author and run a query that will aggregate the total quantity of items purchased by the customer and then visualize the result with a chart similar to the following?
Solution:
- Open Synapse Analytics Studio, and then navigate to the
Develop
hub. - Under
SQL scripts
, select the script calledExercise 4 - Analyze Transactions
. - Change the Connect to drop-down to the SQLPool01 database.
- Select Run to execute the script against the SQL Pool database.
- When the results appear, for the View toggle, select Chart.
- For the Chart type, select
Column
. - For the Category column, leave the selection at
(none)
. - For the Legend (series) column, select
CustomerKey
.
In this task, you will try to understand the implications of the table design at a general level. You will run the same set of queries against two fact tables (FactSale_Fast
and FactSale_Slow
). The two fact tables have (with one notable exception) the same structure and contain identical data.
First, let us set the stage by performing the following steps:
-
Under SQL Scripts in the
Develop
hub within Synapse Analytics Studio, select the script calledExercise 4 - Investigate query performance
. -
Change the Connect to drop-down to the SQLPool01 database.
-
Select line 1 and then select
Run
.Notice the quick response time (usually under 3 seconds) and the result - 83.4 million records. If SQLPool was configured with DW500c, then it would be under 1 second.
-
Select line 3 and then select
Run
.Notice the quick response time (usually under 3 seconds) and the result - 83.4 million records. If SQLPool was configured with DW500c, then it would be under 1 second.
-
Select lines 5 to 20 and then select
Run
.Re-run the query 3 to 5 times until the execution time stabilizes (usually, the first "cold" execution takes longer than subsequent ones who benefit from the initialization of various internal data and communications buffers). Make a note of the amount of time needed to run the query (typically 15 to 30 seconds).
-
Select lines 22 to 37 and then select
Run
.Re-run the query 3 to 5 times until the execution time stabilizes (usually, the first "cold" execution takes longer than subsequent ones who benefit from the initialization of various internal data and communications buffers). Make a note on the amount of time needed to run the query (typically 3 to 5 seconds).
Can you explain the significant difference in performance between the two seemingly identical tables? Furthermore, can you explain why the first set of queries (the simple counts) were not that further apart in execution times?
Solution:
-
In Synapse Analytics Studio, navigate to the
Data
hub. -
Under Databases, expand the
SQLPool01
node, expandTables
, and locate thewwi_perf.FactSale_Slow
table. -
Right-click the table (1) and then select
New SQL script
(2),CREATE
(3). -
In the CREATE script, note the
DISTRIBUTION = ROUND_ROBIN
option used to distribute the table. -
Repeat the same actions for the
wwi_perf.FactSale_Fast
table and note theDISTRIBUTION = HASH ( [CustomerKey] )
option used to distribute the table.
This is the critical difference that has such a significant impact on the last two queries' performance. Because wwi_perf.FactSale_Slow
is distributed in a round-robin fashion, each customer's data will end up living in multiple (if not all) distributions. When our query needs to consolidate each customer's data, a lot of data movement will occur between the distributions. This is what slows down the query significantly.
On the other hand, wwi_perf.FactSale_Fast
is distributed using the hash of the customer identifier. This means that each customer's data will end up living in a single distribution. When the query needs to consolidate each customer's data, virtually no data movement occurs between distributions, which makes the query very fast.
By default, tables are Round Robin-distributed, enabling users to create new tables without deciding on the distribution. In some workloads, Round Robin tables have acceptable performance. However, in many cases, selecting a distribution column will perform much better.
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution. As a result, the system sometimes needs to invoke a data movement operation to better organize your data before resolving a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
Finally, the first two queries (the counts) were not that far apart performance-wise because none of them incurred any data movement (each distribution just reported its local counts, and then the results were aggregated).
This simple example demonstrates one of the core challenges of modern, massively distributed data platforms - solid design. You witnessed first-hand how one decision taken at table design time can significantly influence the performance of queries. You also got a glimpse of Azure Synapse Analytics' raw power: the more efficient table design enabled a non-trivial query involving more than 80 million records to execute in just a few seconds.