Skip to content

Latest commit

 

History

History
20 lines (14 loc) · 1.09 KB

README.md

File metadata and controls

20 lines (14 loc) · 1.09 KB

Apache Spark Engine

This the implementation of the Engine contract of Open Data Fabric using the Apache Spark data processing framework. It is currently in use in kamu-cli data management tool.

Features

  • Spark engine currently provides the most rich SQL dialect for map/filter style transformations
  • Integrates GeoSpark to provide geo-spatial SQL functions
  • It is used by kamu-cli for ingesting data into Parquet
  • It is used by kamu-cli along with Apache Livy to provide SQL queries functionality in the Jupyter notebooks

Known Issues

  • Takes a long time to start up which is hurting the user experience
  • Does not support temporal table joins
    • You might be better off using Flink-based engine for joining and aggregating event streams
  • TODO

Developing

See the Developer Guide