TIMX 441 - support v2 transform command #314

ghukill · 2024-12-19T14:40:13Z

Purpose and background context

This PR begins to utilize the v2 feature flag pathway for producing a transform command for Transmogrifier.

All said, the changes are actually pretty minimal:

retrieve run-id from the payload sent to pipeline lambda, minting a UUID if not present (i.e. for "v1" invocations)
add logic to the previously stubbed _etl_v2_generate_transform_commands_method() function
produce a v2 transform command for Transmogrifier

How can a reviewer manually see the effects of these changes?

In Dev1 the SSM parameter /apps/timdex-infrastructure/etl/version has been set to 2, and terraform re-applied, resulting in Transmogrifier, pipeline lambdas, and TIM getting an environment variable ETL_VERSION=2. High level, this means both the v1 StepFunction and v2 StepFunction in Dev1 are attempting v2 runs at this time.

As such, invoking the Dev1 v2 StepFunction will trigger this new pathway in the pipeline lambdas which send an updated command to Transmogrifier.

This required some small changes to the StepFunction as well, which now extract the StepFunction execution id and pass it to the pipeline lambdas whenever they are invoked.

This recent run demonstrates this new pipeline lambdas behavior working as expected.

This screenshot shows the transform pipeline lambda invoked with a new run-id node:

And this shows the lambda returning a new Transmogrifier command with --output-location and --run-id CLI arguments included:

Given the updates to Transmogrifier from this PR, Transmogrifier successfully wrote its output to the TIMDEX parquet dataset for the first time (when invoked by the StepFunction) at s3://timdex-extract-dev-222053980223/dataset/year=2024/month=12/day=18/4cfc36e4-4bb5-4f85-aed2-cee91bcc588d-0.parquet.

Lastly, this run of the StepFunction ultimately failed because v2 behavior is not yet set for TIM. After writing to the parquet dataset, the rest of the StepFunction was still looking for v1 files.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: if the ETL_VERSION=2 in the TIMDEX StepFunction environment, this application and Transmogrifier will be using v2 behavior.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-441

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: We are adding feature flag support for parquet dataset writing, aka ETL_VERSION=2, and the first step will be supporting an updated transform command for Transmogrifier. How this addresses that need: * Builds out feature flag pathway _etl_v2_generate_transform_commands_method * Adds comments for what code is feature flagged and should be addressed after v2 work is established Side effects of this change: * If ETL_VERSION=2 for this application, new pathways will be used for transform command generation Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-441

ghukill · 2024-12-19T14:44:00Z

tests/test_commands_v1.py

In an effort to make feature flag removal easier later, "v1" tests will be moved here during each phase of work. Once v2 is fully implemented and v1 no longer supported, we can remove this file entirely.

Good approach!

ghukill · 2024-12-19T15:59:20Z

lambdas/commands.py

+    for extract_output_file in extract_output_files:
+        transform_command = [
+            f"--input-file=s3://{timdex_bucket}/{extract_output_file}",
+            f"--output-location=s3://{timdex_bucket}/dataset",


This is subtle but important: this effectively sets /dataset as the TIMDEX parquet dataset.

We may well want to explore setting this via an SSM param / env var at some point, but I'd propose to keep this hardcoded at the moment. We did similar things with source, where each application knew that the source at hand was the folder to look for in S3. Now, it's kind of a simplication where everyone just knows that /dataset is where the dataset is located.

Just to recap, this will result in the following file tree:

└── timdex-bucket/ └── dataset/ └── year=2024/ └── month=12/ └── day=01/ └── <uuid>-<optional-sequence>.parquet

where UUID uniquely identifies a run per source?

ehanson8

The changes and tests are very clear and I think you've said this up well to be phased out in the near future!

ghukill marked this pull request as ready for review December 19, 2024 14:42

ghukill requested review from jonavellecuerdo and ehanson8 December 19, 2024 14:42

ghukill commented Dec 19, 2024

View reviewed changes

ehanson8 approved these changes Dec 19, 2024

View reviewed changes

jonavellecuerdo approved these changes Dec 20, 2024

View reviewed changes

ghukill merged commit d6015d6 into main Dec 20, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX 441 - support v2 transform command #314

TIMX 441 - support v2 transform command #314

ghukill commented Dec 19, 2024 •

edited

Loading

ghukill Dec 19, 2024

ehanson8 Dec 19, 2024

ghukill Dec 19, 2024

jonavellecuerdo Dec 20, 2024

ehanson8 left a comment

TIMX 441 - support v2 transform command #314

TIMX 441 - support v2 transform command #314

Conversation

ghukill commented Dec 19, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill Dec 19, 2024

Choose a reason for hiding this comment

ehanson8 Dec 19, 2024

Choose a reason for hiding this comment

ghukill Dec 19, 2024

Choose a reason for hiding this comment

jonavellecuerdo Dec 20, 2024

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

ghukill commented Dec 19, 2024 •

edited

Loading