Home

Welcome to the spark-sandbox wiki!

cardinality
- bounded: data finite in size. mostly by traditional batch engines
- unbounded: data infinite in size. mostly by most streaming or micro-batch engines
encoding
- table vs dataset
consistency
- at-most-once vs at-least-once vs exactly-once
in process-time windowing, no way to handle late data
in event-time windowing (Due to extended window lifetimes), more buffering of data is required. the other point is completeness (when you gonna end it), via watermarks

stream: element view of dataset overtime
micro-batch streaming: uses repeated executions of a batch processing engine to process unbounded data
tupple-based windowing: windows whose sizes are counted in numbers of elements (esp. in sql based systems)

Provide feedback