-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIMX 441 - support v2 transform command #314
Conversation
Why these changes are being introduced: We are adding feature flag support for parquet dataset writing, aka ETL_VERSION=2, and the first step will be supporting an updated transform command for Transmogrifier. How this addresses that need: * Builds out feature flag pathway _etl_v2_generate_transform_commands_method * Adds comments for what code is feature flagged and should be addressed after v2 work is established Side effects of this change: * If ETL_VERSION=2 for this application, new pathways will be used for transform command generation Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-441
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an effort to make feature flag removal easier later, "v1" tests will be moved here during each phase of work. Once v2 is fully implemented and v1 no longer supported, we can remove this file entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good approach!
for extract_output_file in extract_output_files: | ||
transform_command = [ | ||
f"--input-file=s3://{timdex_bucket}/{extract_output_file}", | ||
f"--output-location=s3://{timdex_bucket}/dataset", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subtle but important: this effectively sets /dataset
as the TIMDEX parquet dataset.
We may well want to explore setting this via an SSM param / env var at some point, but I'd propose to keep this hardcoded at the moment. We did similar things with source
, where each application knew that the source
at hand was the folder to look for in S3. Now, it's kind of a simplication where everyone just knows that /dataset
is where the dataset is located.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to recap, this will result in the following file tree:
└── timdex-bucket/
└── dataset/
└── year=2024/
└── month=12/
└── day=01/
└── <uuid>-<optional-sequence>.parquet
where UUID uniquely identifies a run per source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes and tests are very clear and I think you've said this up well to be phased out in the near future!
Purpose and background context
This PR begins to utilize the v2 feature flag pathway for producing a transform command for Transmogrifier.
All said, the changes are actually pretty minimal:
run-id
from the payload sent to pipeline lambda, minting a UUID if not present (i.e. for "v1" invocations)_etl_v2_generate_transform_commands_method()
functionHow can a reviewer manually see the effects of these changes?
In Dev1 the SSM parameter
/apps/timdex-infrastructure/etl/version
has been set to2
, and terraform re-applied, resulting in Transmogrifier, pipeline lambdas, and TIM getting an environment variableETL_VERSION=2
. High level, this means both the v1 StepFunction and v2 StepFunction in Dev1 are attempting v2 runs at this time.As such, invoking the Dev1 v2 StepFunction will trigger this new pathway in the pipeline lambdas which send an updated command to Transmogrifier.
This required some small changes to the StepFunction as well, which now extract the StepFunction execution id and pass it to the pipeline lambdas whenever they are invoked.
This recent run demonstrates this new pipeline lambdas behavior working as expected.
This screenshot shows the transform pipeline lambda invoked with a new
run-id
node:And this shows the lambda returning a new Transmogrifier command with
--output-location
and--run-id
CLI arguments included:Given the updates to Transmogrifier from this PR, Transmogrifier successfully wrote its output to the TIMDEX parquet dataset for the first time (when invoked by the StepFunction) at
s3://timdex-extract-dev-222053980223/dataset/year=2024/month=12/day=18/4cfc36e4-4bb5-4f85-aed2-cee91bcc588d-0.parquet
.Lastly, this run of the StepFunction ultimately failed because v2 behavior is not yet set for TIM. After writing to the parquet dataset, the rest of the StepFunction was still looking for v1 files.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES: if the
ETL_VERSION=2
in the TIMDEX StepFunction environment, this application and Transmogrifier will be using v2 behavior.What are the relevant tickets?
Developer
Code Reviewer(s)