Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dp 2977 #2995

Closed
wants to merge 4 commits into from
Closed

Dp 2977 #2995

wants to merge 4 commits into from

Conversation

murdo-moj
Copy link
Contributor

@murdo-moj murdo-moj commented Jan 18, 2024

  • Copy curated data in s3 between versions upon table deletion

@murdo-moj murdo-moj requested a review from a team January 18, 2024 09:29
@@ -200,6 +207,21 @@ def update_metadata_remove_schemas(self, schema_list: list[str]) -> str:
logger=self.logger,
).run()

# Copy data files in the curated bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this wasn't needed because it was already handled by the athena query in CuratedDataCopier.

Rather than copy the all the parquet files in s3 we have an unload query that creates the new parquet from the latest load timestamp in the previous version. Am I misunderstanding how this works?

           UNLOAD (
                SELECT
                    *
                FROM {previous_major_database}.{curated_table.name}
                WHERE load_timestamp = (
                    SELECT MAX(load_timestamp)
                    FROM {previous_major_database}.{curated_table.name}
                )
            )
            TO '{self.table_path}'
            WITH(
                format='parquet',
                compression = 'SNAPPY',
                partitioned_by=ARRAY['load_timestamp']
            )

trigger from https://github.com/ministryofjustice/data-platform/blob/main/containers/daap-python-base/src/var/task/curated_data/curated_data_loader.py#L215

Copy link
Contributor

This pull reuest is being marked as stale because it has been open for 30 days with no activity. Remove stale label or comment to keep the pull reuest open.

@github-actions github-actions bot added the stale label Feb 18, 2024
Copy link
Contributor

This pull reuest is being closed because it has been open for a further 7 days with no activity. If this is still a valid pull reuest, please reopen it, Thank you!

@github-actions github-actions bot closed this Feb 25, 2024
@jacobwoffenden jacobwoffenden deleted the dp-2977 branch May 9, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants