Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Open
Charantl opened this issue Apr 3, 2024 · 1 comment
Assignees
Labels
P2:should An issue to be addressed in a quarter or so.

Comments

@Charantl
Copy link

Charantl commented Apr 3, 2024

Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -

On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).

Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.

@bashir2 bashir2 added the P2:should An issue to be addressed in a quarter or so. label May 10, 2024
@bashir2
Copy link
Collaborator

bashir2 commented May 10, 2024

Following the offline conversions, here are a few points about this for posterity:

  • Our Parquet file generation used to be "append only"; this would mean that if a FHIR resource was changed, two versions of it existed in the DWH and a deduplication should be done at query time.
  • The above situation was not ideal for large resource tables, e.g., Observations; IOW this can put a performance burden at query time.
  • Because of this reason, we added the merger in Incremental merge #364. The performance of the merger should be good because it reads from filesystem and writes to it (no FHIR-server/DB interaction). That said, if for some resources we know that deduplication is not necessary, for example we know that an Observation will not be modified after it first was created, then we can add an option to disable merge for certain resource types.
  • This obviously leave the responsibility to the user to make sure if any deduplication is needed it is taken care of at query time.
  • Finally, this is also related to Data retrieved from the Hapi JPA Database only includes the latest information, with no historical data being fetched #1012; i.e., if we want to keep historical information for a resource type, we should also disable merge/deduplication for that and change queries to take care of that.

To summarize: The work to be done here is to add the feature to disable merge for some resource types and also do not copy their Parquet files in DWH snapshots (instead re-use one copy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2:should An issue to be addressed in a quarter or so.
Projects
None yet
Development

No branches or pull requests

2 participants