Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add TIMDEX provenance object to transformed records
Why these changes are being introduced: Transitioning to a parquet dataset architecture for TIMDEX ETL provides additional data related to each transformed record as part of that record's row in the dataset. But this data is only helpful if you tether the record you encounter in Opensearch with a row in the dataset. Certainly related, but not dependent on the parquet dataset change, was the desire for more information about a record in TIMDEX, e.g. when was it transformed and indexed. We might consider this information "provenance" about the TIMDEX record as encountered in Opensearch and/or the TIMDEX API. How this addresses that need: A new "timdex_provenance" field is added to the TIMDEX data model that includes information about the origins of the TIMDEX record. As it pertains to the parquet dataset, this provenance data includes fields like "run_id" and "run_record_offset" which help pinpoint the row in the parquet dataset for this record. With this linkage, it becomes possible to very quickly retrieve the original source record for a transformed record. In addition to support random access reads of the dataset, this provenance data provides some metadata about the TIMDEX record that is immediately informative like "run_date". Side effects of this change: * None, really. TIM will need to be updated to include this new field in the Opensearch mapping, but until then, it's just extra data in the transformed record. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-406
- Loading branch information