report pipeline validation errors from seqr #4581

jklugherz · 2025-01-14T20:13:31Z

draft pr while I work on tests!

hanars · 2025-01-14T20:18:10Z

seqr/management/commands/check_for_new_samples_from_pipeline.py

+            safe_post_to_slack(
+                SEQR_SLACK_LOADING_NOTIFICATION_CHANNEL, '\n\n'.join(messages),
+            )
+            with TemporaryDirectory() as temp_dir_name:


while it is only writing one file, I think using the write_multiple_files helper function in export utils would allow you to reuse most of that code and not need to reimplement so much boiler plate

I think the usage would just be

write_multiple_files([(ERRORS_REPORTED_FILE_NAME, [], [])], run_directory, user)

hanars · 2025-01-14T20:31:59Z

At a high level it looks like you are running 3 separate ls commands to check the runs - one to get the runs with validation errors, one to get the runs with the errors already reported and one to get the success runs. I think at this point its probably more effective to run a single ls with the file_name arg as a wildcard and then group the results by run_version to get which files exist for which runs in a single operation

jklugherz · 2025-01-15T21:34:13Z

seqr/management/tests/check_for_new_samples_from_pipeline_tests.py


        local_files = [
+            '/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED',


I added 3 new files in 2 run directories. The first directory has both _ERRORS_REPORTED and validation_errors.json and the second has just validation_errors.json.

for consistency we should update RUN_PATHS to align with these changes

jklugherz · 2025-01-15T21:35:37Z

seqr/management/commands/check_for_new_samples_from_pipeline.py

+    def _get_runs(self, **kwargs):
+        """ Returns two dictionaries:
+            - run_files, a mapping of the run directory to a set of filenames inside the directory
+            - run_args, a mapping of the run directory to a dict of args for that run


There's probably a better data structure to capture these mappings, but we're definitely only making one list_files/gsutil ls call now.

I think you could use a single structure with runs = defaultdict(lambda: {'files': set()}). Then where you are currently updating run_files and run_args you would do

runs[run_dirname]['files'].add(file_name) runs[run_dirname].update(match_dict)

And then for example in _report_validation_errors the way you would use it is

for run_dir, run_details in runs.items(): files = run_details['files']

this is perfect, thanks!

hanars · 2025-01-15T22:01:28Z

seqr/management/commands/check_for_new_samples_from_pipeline.py

 from collections import defaultdict
+from tempfile import TemporaryDirectory


this is no longer used

hanars · 2025-01-15T22:11:21Z

seqr/management/commands/check_for_new_samples_from_pipeline.py

+    def _get_runs(self, **kwargs):
+        """ Returns two dictionaries:
+            - run_files, a mapping of the run directory to a set of filenames inside the directory
+            - run_args, a mapping of the run directory to a dict of args for that run


I think you could use a single structure with runs = defaultdict(lambda: {'files': set()}). Then where you are currently updating run_files and run_args you would do

runs[run_dirname]['files'].add(file_name) runs[run_dirname].update(match_dict)

And then for example in _report_validation_errors the way you would use it is

for run_dir, run_details in runs.items(): files = run_details['files']

hanars · 2025-01-15T22:16:02Z

seqr/management/tests/check_for_new_samples_from_pipeline_tests.py


        local_files = [
+            '/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED',


for consistency we should update RUN_PATHS to align with these changes

hanars · 2025-01-15T22:17:01Z

seqr/management/tests/check_for_new_samples_from_pipeline_tests.py

@@ -282,10 +287,14 @@ def _test_call(self, error_logs, reload_annotations_logs=None, run_loading_logs=
    @mock.patch('seqr.management.commands.check_for_new_samples_from_pipeline.MAX_LOOKUP_VARIANTS', 1)
    @mock.patch('seqr.views.utils.airtable_utils.BASE_URL', 'https://test-seqr.org/')
    @mock.patch('seqr.views.utils.airtable_utils.MAX_UPDATE_RECORDS', 2)
+    @mock.patch('seqr.views.utils.export_utils.os.makedirs')
+    @mock.patch('seqr.views.utils.export_utils.mv_file_to_gs')


since we are already mocking the subprocess calls used for interacting with gs in this test case it would be better not to mock this helper function and instead assert that the subprocess calls all look right including the mv calls

jklugherz added 2 commits January 13, 2025 17:49

first pass report errors

23c3e25

kwargs

6171b25

jklugherz requested review from hanars and bpblanken January 14, 2025 20:13

hanars reviewed Jan 14, 2025

View reviewed changes

jklugherz added 4 commits January 15, 2025 14:02

make one ls call

5734083

write files

32e35ad

test cases cover most new code

282284c

mock file stuff

ace6f09

jklugherz commented Jan 15, 2025

View reviewed changes

jklugherz requested a review from hanars January 15, 2025 21:35

jklugherz marked this pull request as ready for review January 15, 2025 21:37

hanars reviewed Jan 15, 2025

View reviewed changes

review comments

04747a6

jklugherz requested a review from hanars January 17, 2025 18:09

hanars approved these changes Jan 17, 2025

View reviewed changes

jklugherz merged commit a5a8c9c into dev Jan 17, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report pipeline validation errors from seqr #4581

report pipeline validation errors from seqr #4581

jklugherz commented Jan 14, 2025

hanars Jan 14, 2025

hanars Jan 14, 2025

hanars commented Jan 14, 2025

jklugherz Jan 15, 2025

hanars Jan 15, 2025

jklugherz Jan 15, 2025

hanars Jan 15, 2025

jklugherz Jan 17, 2025

hanars Jan 15, 2025

hanars Jan 15, 2025

hanars Jan 15, 2025

hanars Jan 15, 2025


		local_files = [
		'/seqr/seqr-hail-search-data/GRCh38/SNV_INDEL/runs/manual__2025-01-13/_ERRORS_REPORTED',

		from collections import defaultdict
		from tempfile import TemporaryDirectory

report pipeline validation errors from seqr #4581

report pipeline validation errors from seqr #4581

Conversation

jklugherz commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanars commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment