-
Examples of Time Series
- Stock market data
- Tide time tables
- Performance monitoring
- Health monitoring
- Population statistics
- Business performance
-
Data Ingest
-
Query Patterns
- Latest data
- Compare time periods
- Summarize time window
-
Sliding Window
- Window is a set of contiguous rows
- Sliding windows moves through table one row at a time
-
Tumbliing Windows
-
Denormalized Data
- Joining done once
- Useful when data queried multiple times
- Or, if query response time is more important than storage or preprocessing time
-
Trade-Offs
- Pros
- Faster query response
- Point-in-time history
- Cons
- Storage
- More complex ingest
- Pros
- What You Should Know
- SQL SELECT statements
- Tables and views
- Subqueries and joins
-
Time Series Data
-
Frequency
Measurement Time Unit CPU Utilization Microseconds Network I/O Seconds Units Produced Minutes Customers Served Hours Packages Delivered Days Auto Accidents Month Company Profit Quarterly Births and Deaths Annually - Fixed interval between measurements
- Varies by application
-
Unit of Measure
Measurement Uints of Measure CPU Utilization Percentage Network I/O MB Units Produced Count Customers Served Count Packages Delivered Count Auto Accidents Count Company Profit Monetary unit Births and Deaths Count - What do numbers represent?
- Varies by application
-
Metric Types
- Counter - monotonically increasing accumulator
- Gauge - numerical measure that can go up or down
- Summary - calculates values over time window, such as counts or rates
- Histograms - counts of items over buckets
-
This Course
- Time series in relational databases
- Query with SQL
- Data modeling choices
- Time series analysis
- Examples of Time Series
- Stock market data
- Tide time tables
- Performance monitoring
- Health monitoring
- Population statistics
- Business performance
-
Data Ingest
-
Sequentially Write
- Write data to disk in order it arrives
- Potential for large tables
- Which can be difficult to query and manage
-
Horizontaling Partitioning
- Split tables by rows into partitions
- Treat each partition like a table
-
Range Partitioning
- Type of horizontal partitioning
- Partition on non-overlapping keys
- Partition by date is common
-
When to Use Partition by Range
- Query latest data
- Comparative queries, for example, same time last year
- Drop or summarize data after a period of time
- All common in time series data
-
Query Patterns
- Latest data
- Compare time periods
- Summarize time window
-
Granularity
- Latest data is finest grained
- As data ages, less likely to query fine-grained data
- Summarize and aggregate data at larger time intervals
-
Information Drivers
- Latest data: What is current CPU utilization? Why is throughput down?
- Older data: What is average CPU utilization? Are we running efficiently?
- pgAdmin
- Query History
-
Time Window
- Window is a set of contiguous rows
- Size of window is number of rows
-
Example Time Series Table
CREATE TABLE time_series.utilization ( event_time timestamp without time zone NOT NULL, server_id integer NOT NULL, cpu_utilization real, free_memory real, session_cnt integer, CONSTRAINT utilization_pkey PRIMARY KEY (event_time, server_id) );
-
Example Query on Window
SELECT u.event_time, u.server_id, u.cpu_utilization, u.free_memory, u.session_cont FROM utilization u WHERE u.event_time BETWEEN to_timestamp('2019-05-01 00:01:00', 'YYYY-MM-DD hh24:mis:ss') AND to_timestamp('2019-05-01 00:30:00', 'YYYY-MM-DD hh24:mis:ss') ORDER BY event_time, server_id
-
Example Aggregate Query
SELECT avg(u.cpu_utilization), avg(u.session_cont) FROM utilization u WHERE u.event_time BETWEEN to_timestamp('2019-05-01 00:01:00', 'YYYY-MM-DD hh24:mis:ss') AND to_timestamp('2019-05-01 00:30:00', 'YYYY-MM-DD hh24:mis:ss')
-
Sliding Window
- Window is a set of contiguous rows
- Sliding windows moves through table one row at a time
-
Sliding Window Use Cases
- Monitoring change over time
- Continuous time series
- No logical boundaries
-
Window is a set of contiguous rows
-
Sliding windows moves through table one window at a time
-
Tumbling Window Use Cases
- Monitoring change over time
- Logical grouping of data by time
- Useful for aggregate summaries about time periods
-
Joining on Time
- Two time series may have different measurements over same time period
- Useful to combine multiple measurements
-
Potential Problems Joining on Time
- Times may have different precision
- More precise, less likely to match
- Same hour, minute, seconds
12:15:37.45
does not match12:15:37.46
clock skew - Clock skew
-
Truncate Time
- Drop most precise time elements
- Use second or minute, for example
- Truncating sub-seconds will lead to a match between
12:15:37.45
and12:15:37.46
- May have multiple rows with truncated time
-
Aggregate over Time Window
- GROUP BY time
- Choose degree of precision
- Use aggregate functions
- Produces one row per time period
-
Trade-Offs
- Pros
- Ability to join
- Account for some clock skew
- Summarize data
- Cons
- Loss of precision
- Loss of detail
- Pros
-
Denormalizing
- Combine columns from multiple tables in a single table
- Pre-joining
- Supported by materialized views
-
Denormalized Data
- Joining done once
- Useful when data required multiple times
- Or, if query response time is more important than storage or preprocessing time
-
Ingest Pipeline
-
Trade-Offs
- Pros
- Faster query response
- Point-in-time history
- Cons
- Storage
- More complex ingest
- Pros
-
Query
SELECT * FROM time_series.location_temp ORDER BY event_time, location_id LIMIT 100
=>
event_time temp_celcius location_id 2019-03-04 19:48:06 29 loc0 2019-03-04 19:48:06 31 loc1 2019-03-04 19:48:06 28 loc2 2019-03-04 19:48:06 24 loc3 ...
-
Query:
SELECT * FROM time_series.utilization LIMIT 100
=>
event_time server_id cpu_utilization free_memory session_cnt 2019-03-05 08:06:14 100 0.57 0.51 47 2019-03-05 08:11:14 100 0.47 0.62 43 ...
-
-
Query 1 result
dept_id server_id cpu_utilization lead 0 140 0.87 0.87 0 140 0.87 0.86 0 140 0.86 0.86 0 140 0.86 0.86 0 140 0.86 0.86 -
Query 2 result
dept_id server_id cpu_utilization lead 0 140 0.87 0.86 0 140 0.87 0.86 0 140 0.86 0.86 0 140 0.86 0.86 0 140 0.86 0.86
-
-
Query 1 result
dept_id server_id cpu_utilization lag 0 140 0.87 [null] 0 140 0.87 0.87 0 140 0.86 0.87 0 140 0.86 0.86 0 140 0.86 0.86
-
Query 1 result
dept_id server_id cpu_utilization rank 0 140 0.87 1 0 140 0.87 1 0 140 0.86 3 0 140 0.86 3 0 140 0.86 3
-
- Divided by total number of rows
-
Query 1 result
dept_id server_id cpu_utilization percent_rank 0 140 0.87 0 0 140 0.87 0 0 140 0.86 0.00209643605870021 0 140 0.86 0.00209643605870021 0 140 0.86 0.00209643605870021
-
Query result
event_date avg_temp 2019-03-04 00:00:00 28.0652941176470588
-
Query result
server_id cpu_utilization avg 100 0.57 0.486492146060105 100 0.47 0.486492146060105
-
Query result
event_date avg_temp avg_temp 2019-03-04 00:00:00 28.0652941176470588 [null] 2019-03-05 00:00:00 28.0688833333333333 28.0652941176470588
-
Query result
event_time server_id hourly_cpu_util 2019-03-05 08:06:14 100 0.569999992847443 2019-03-05 08:06:16 101 0.715000003576279
-
Query result(3 days weighted average)
event_date round ?column? 2019-03-04 00:00:00 28.07 [null] 2019-03-05 00:00:00 28.07 [null] 2019-03-06 00:00:00 28.06 [null] 2019-03-07 00:00:00 28.07 28.06500 2019-03-08 00:00:00 28.06 28.06667
-
Query result
predicted_value 0.363047333224612
-
Exponential Moving Average
- A type of moving average
- Weights decrease exponentially
- Used to smooth trends when large variance in data
-
Smoothing Parameter
- Typically 2 / (1 + number of intervals)
- For example, for 7 days the smoothing parameter is 0.25
- Called lambda
-
Lambda Formula
- (currentperiod_value * lambda) + (previousperiod_EWMA * (1 - lambda))
- Recursive
- Could be calculated using Common Table Expressions but not recommended for large data sets
- Use user defined function instead
- Using SQL for data science
- Tuning SQL queries
- Using R for advanced time series statistics