Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report pipeline validation errors from seqr #4581

Merged
merged 7 commits into from
Jan 17, 2025

Conversation

jklugherz
Copy link
Contributor

draft pr while I work on tests!

@jklugherz jklugherz requested review from hanars and bpblanken January 14, 2025 20:13
safe_post_to_slack(
SEQR_SLACK_LOADING_NOTIFICATION_CHANNEL, '\n\n'.join(messages),
)
with TemporaryDirectory() as temp_dir_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it is only writing one file, I think using the write_multiple_files helper function in export utils would allow you to reuse most of that code and not need to reimplement so much boiler plate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the usage would just be

write_multiple_files([(ERRORS_REPORTED_FILE_NAME, [], [])], run_directory, user)

@hanars
Copy link
Collaborator

hanars commented Jan 14, 2025

At a high level it looks like you are running 3 separate ls commands to check the runs - one to get the runs with validation errors, one to get the runs with the errors already reported and one to get the success runs. I think at this point its probably more effective to run a single ls with the file_name arg as a wildcard and then group the results by run_version to get which files exist for which runs in a single operation


local_files = [
'/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 3 new files in 2 run directories. The first directory has both _ERRORS_REPORTED and validation_errors.json and the second has just validation_errors.json.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency we should update RUN_PATHS to align with these changes

def _get_runs(self, **kwargs):
""" Returns two dictionaries:
- run_files, a mapping of the run directory to a set of filenames inside the directory
- run_args, a mapping of the run directory to a dict of args for that run
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a better data structure to capture these mappings, but we're definitely only making one list_files/gsutil ls call now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use a single structure with runs = defaultdict(lambda: {'files': set()}). Then where you are currently updating run_files and run_args you would do

runs[run_dirname]['files'].add(file_name)
runs[run_dirname].update(match_dict)

And then for example in _report_validation_errors the way you would use it is

for run_dir, run_details in runs.items():
    files = run_details['files']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perfect, thanks!

@jklugherz jklugherz requested a review from hanars January 15, 2025 21:35
@jklugherz jklugherz marked this pull request as ready for review January 15, 2025 21:37
from collections import defaultdict
from tempfile import TemporaryDirectory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer used

def _get_runs(self, **kwargs):
""" Returns two dictionaries:
- run_files, a mapping of the run directory to a set of filenames inside the directory
- run_args, a mapping of the run directory to a dict of args for that run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use a single structure with runs = defaultdict(lambda: {'files': set()}). Then where you are currently updating run_files and run_args you would do

runs[run_dirname]['files'].add(file_name)
runs[run_dirname].update(match_dict)

And then for example in _report_validation_errors the way you would use it is

for run_dir, run_details in runs.items():
    files = run_details['files']


local_files = [
'/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency we should update RUN_PATHS to align with these changes

@@ -282,10 +287,14 @@ def _test_call(self, error_logs, reload_annotations_logs=None, run_loading_logs=
@mock.patch('seqr.management.commands.check_for_new_samples_from_pipeline.MAX_LOOKUP_VARIANTS', 1)
@mock.patch('seqr.views.utils.airtable_utils.BASE_URL', 'https://test-seqr.org/')
@mock.patch('seqr.views.utils.airtable_utils.MAX_UPDATE_RECORDS', 2)
@mock.patch('seqr.views.utils.export_utils.os.makedirs')
@mock.patch('seqr.views.utils.export_utils.mv_file_to_gs')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are already mocking the subprocess calls used for interacting with gs in this test case it would be better not to mock this helper function and instead assert that the subprocess calls all look right including the mv calls

@jklugherz jklugherz requested a review from hanars January 17, 2025 18:09
@jklugherz jklugherz merged commit a5a8c9c into dev Jan 17, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants