-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
report pipeline validation errors from seqr #4581
Conversation
safe_post_to_slack( | ||
SEQR_SLACK_LOADING_NOTIFICATION_CHANNEL, '\n\n'.join(messages), | ||
) | ||
with TemporaryDirectory() as temp_dir_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while it is only writing one file, I think using the write_multiple_files
helper function in export utils would allow you to reuse most of that code and not need to reimplement so much boiler plate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the usage would just be
write_multiple_files([(ERRORS_REPORTED_FILE_NAME, [], [])], run_directory, user)
At a high level it looks like you are running 3 separate |
|
||
local_files = [ | ||
'/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 3 new files in 2 run directories. The first directory has both _ERRORS_REPORTED
and validation_errors.json
and the second has just validation_errors.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency we should update RUN_PATHS
to align with these changes
def _get_runs(self, **kwargs): | ||
""" Returns two dictionaries: | ||
- run_files, a mapping of the run directory to a set of filenames inside the directory | ||
- run_args, a mapping of the run directory to a dict of args for that run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a better data structure to capture these mappings, but we're definitely only making one list_files
/gsutil ls
call now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use a single structure with runs = defaultdict(lambda: {'files': set()})
. Then where you are currently updating run_files
and run_args
you would do
runs[run_dirname]['files'].add(file_name)
runs[run_dirname].update(match_dict)
And then for example in _report_validation_errors
the way you would use it is
for run_dir, run_details in runs.items():
files = run_details['files']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perfect, thanks!
from collections import defaultdict | ||
from tempfile import TemporaryDirectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer used
def _get_runs(self, **kwargs): | ||
""" Returns two dictionaries: | ||
- run_files, a mapping of the run directory to a set of filenames inside the directory | ||
- run_args, a mapping of the run directory to a dict of args for that run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use a single structure with runs = defaultdict(lambda: {'files': set()})
. Then where you are currently updating run_files
and run_args
you would do
runs[run_dirname]['files'].add(file_name)
runs[run_dirname].update(match_dict)
And then for example in _report_validation_errors
the way you would use it is
for run_dir, run_details in runs.items():
files = run_details['files']
|
||
local_files = [ | ||
'/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency we should update RUN_PATHS
to align with these changes
@@ -282,10 +287,14 @@ def _test_call(self, error_logs, reload_annotations_logs=None, run_loading_logs= | |||
@mock.patch('seqr.management.commands.check_for_new_samples_from_pipeline.MAX_LOOKUP_VARIANTS', 1) | |||
@mock.patch('seqr.views.utils.airtable_utils.BASE_URL', 'https://test-seqr.org/') | |||
@mock.patch('seqr.views.utils.airtable_utils.MAX_UPDATE_RECORDS', 2) | |||
@mock.patch('seqr.views.utils.export_utils.os.makedirs') | |||
@mock.patch('seqr.views.utils.export_utils.mv_file_to_gs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are already mocking the subprocess calls used for interacting with gs
in this test case it would be better not to mock this helper function and instead assert that the subprocess calls all look right including the mv
calls
draft pr while I work on tests!