Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Charantl · 2024-04-03T13:44:54Z

Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -

On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).

Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.

bashir2 · 2024-05-10T16:55:08Z

Following the offline conversions, here are a few points about this for posterity:

Our Parquet file generation used to be "append only"; this would mean that if a FHIR resource was changed, two versions of it existed in the DWH and a deduplication should be done at query time.
The above situation was not ideal for large resource tables, e.g., Observations; IOW this can put a performance burden at query time.
Because of this reason, we added the merger in Incremental merge #364. The performance of the merger should be good because it reads from filesystem and writes to it (no FHIR-server/DB interaction). That said, if for some resources we know that deduplication is not necessary, for example we know that an Observation will not be modified after it first was created, then we can add an option to disable merge for certain resource types.
This obviously leave the responsibility to the user to make sure if any deduplication is needed it is taken care of at query time.
Finally, this is also related to Data retrieved from the Hapi JPA Database only includes the latest information, with no historical data being fetched #1012; i.e., if we want to keep historical information for a resource type, we should also disable merge/deduplication for that and change queries to take care of that.

To summarize: The work to be done here is to add the feature to disable merge for some resource types and also do not copy their Parquet files in DWH snapshots (instead re-use one copy).

bashir2 added the P2:should An issue to be addressed in a quarter or so. label May 10, 2024

bashir2 self-assigned this Sep 17, 2024

bashir2 mentioned this issue Oct 10, 2024

Feature Request: Options to rerun pipelines for selective date ranges #1060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Charantl commented Apr 3, 2024

bashir2 commented May 10, 2024

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Comments

Charantl commented Apr 3, 2024

bashir2 commented May 10, 2024