You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -
On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).
Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.
The text was updated successfully, but these errors were encountered:
Following the offline conversions, here are a few points about this for posterity:
Our Parquet file generation used to be "append only"; this would mean that if a FHIR resource was changed, two versions of it existed in the DWH and a deduplication should be done at query time.
The above situation was not ideal for large resource tables, e.g., Observations; IOW this can put a performance burden at query time.
Because of this reason, we added the merger in Incremental merge #364. The performance of the merger should be good because it reads from filesystem and writes to it (no FHIR-server/DB interaction). That said, if for some resources we know that deduplication is not necessary, for example we know that an Observation will not be modified after it first was created, then we can add an option to disable merge for certain resource types.
This obviously leave the responsibility to the user to make sure if any deduplication is needed it is taken care of at query time.
To summarize: The work to be done here is to add the feature to disable merge for some resource types and also do not copy their Parquet files in DWH snapshots (instead re-use one copy).
Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -
On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).
Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.
The text was updated successfully, but these errors were encountered: