Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMX 441 - support v2 transform command #314

Merged
merged 1 commit into from
Dec 20, 2024
Merged

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 19, 2024

Purpose and background context

This PR begins to utilize the v2 feature flag pathway for producing a transform command for Transmogrifier.

All said, the changes are actually pretty minimal:

  • retrieve run-id from the payload sent to pipeline lambda, minting a UUID if not present (i.e. for "v1" invocations)
  • add logic to the previously stubbed _etl_v2_generate_transform_commands_method() function
  • produce a v2 transform command for Transmogrifier

How can a reviewer manually see the effects of these changes?

In Dev1 the SSM parameter /apps/timdex-infrastructure/etl/version has been set to 2, and terraform re-applied, resulting in Transmogrifier, pipeline lambdas, and TIM getting an environment variable ETL_VERSION=2. High level, this means both the v1 StepFunction and v2 StepFunction in Dev1 are attempting v2 runs at this time.

As such, invoking the Dev1 v2 StepFunction will trigger this new pathway in the pipeline lambdas which send an updated command to Transmogrifier.

This required some small changes to the StepFunction as well, which now extract the StepFunction execution id and pass it to the pipeline lambdas whenever they are invoked.

This recent run demonstrates this new pipeline lambdas behavior working as expected.

This screenshot shows the transform pipeline lambda invoked with a new run-id node:

Screenshot 2024-12-19 at 9 35 10 AM

And this shows the lambda returning a new Transmogrifier command with --output-location and --run-id CLI arguments included:

Screenshot 2024-12-19 at 9 37 13 AM

Given the updates to Transmogrifier from this PR, Transmogrifier successfully wrote its output to the TIMDEX parquet dataset for the first time (when invoked by the StepFunction) at s3://timdex-extract-dev-222053980223/dataset/year=2024/month=12/day=18/4cfc36e4-4bb5-4f85-aed2-cee91bcc588d-0.parquet.

Lastly, this run of the StepFunction ultimately failed because v2 behavior is not yet set for TIM. After writing to the parquet dataset, the rest of the StepFunction was still looking for v1 files.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: if the ETL_VERSION=2 in the TIMDEX StepFunction environment, this application and Transmogrifier will be using v2 behavior.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

We are adding feature flag support for parquet dataset
writing, aka ETL_VERSION=2, and the first step will be supporting
an updated transform command for Transmogrifier.

How this addresses that need:
* Builds out feature flag pathway _etl_v2_generate_transform_commands_method
* Adds comments for what code is feature flagged and should
be addressed after v2 work is established

Side effects of this change:
* If ETL_VERSION=2 for this application, new pathways will
be used for transform command generation

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-441
@ghukill ghukill marked this pull request as ready for review December 19, 2024 14:42
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an effort to make feature flag removal easier later, "v1" tests will be moved here during each phase of work. Once v2 is fully implemented and v1 no longer supported, we can remove this file entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good approach!

for extract_output_file in extract_output_files:
transform_command = [
f"--input-file=s3://{timdex_bucket}/{extract_output_file}",
f"--output-location=s3://{timdex_bucket}/dataset",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subtle but important: this effectively sets /dataset as the TIMDEX parquet dataset.

We may well want to explore setting this via an SSM param / env var at some point, but I'd propose to keep this hardcoded at the moment. We did similar things with source, where each application knew that the source at hand was the folder to look for in S3. Now, it's kind of a simplication where everyone just knows that /dataset is where the dataset is located.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to recap, this will result in the following file tree:

└── timdex-bucket/
    └── dataset/
        └── year=2024/
            └── month=12/
                └── day=01/
                    └── <uuid>-<optional-sequence>.parquet

where UUID uniquely identifies a run per source?

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes and tests are very clear and I think you've said this up well to be phased out in the near future!

@ghukill ghukill merged commit d6015d6 into main Dec 20, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants