Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unlinking old job out/err files. #6533

Merged
merged 5 commits into from
Jan 8, 2025
Merged

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Dec 20, 2024

Close #6529

diff --git a/cylc/flow/job_runner_mgr.py b/cylc/flow/job_runner_mgr.py
index b8ddaaa05..de7817170 100644
--- a/cylc/flow/job_runner_mgr.py
+++ b/cylc/flow/job_runner_mgr.py
@@ -568,8 +568,8 @@ class JobRunnerManager():
         # Create NN symbolic link, if necessary
         self._create_nn(job_file_path)
         for name in JOB_LOG_ERR, JOB_LOG_OUT:
-            with suppress(OSError):
-                os.unlink(os.path.join(job_file_path, name))
+            with suppress(FileNotFoundError):
+                os.unlink(os.path.join(os.path.dirname(job_file_path), name))

         # Start new status file

Suppressing OSError here hid a bug:

    NotADirectoryError: [Errno 20] Not a directory: '/home/oliverh/cylc-run/bug/run32/log/job/1/a/01/job/job.err'

This fixes the path error and just suppresses FileNotFoundError.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
    • (there are no reports of users being affected by this)
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.
    • (simple fix going directly for 8.4 for immediate release)

@hjoliver hjoliver added bug Something is wrong :( small labels Dec 20, 2024
@hjoliver hjoliver added this to the 8.4.0 milestone Dec 20, 2024
@hjoliver hjoliver self-assigned this Dec 20, 2024
@hjoliver hjoliver marked this pull request as ready for review January 7, 2025 02:01
@hjoliver hjoliver force-pushed the fix-job-runner-mgr branch from dc087dd to de7a921 Compare January 7, 2025 02:25
@hjoliver hjoliver requested a review from ColemanTom January 7, 2025 02:35
cylc/flow/job_runner_mgr.py Outdated Show resolved Hide resolved
@hjoliver
Copy link
Member Author

hjoliver commented Jan 7, 2025

(Note all tests passed, 100% patch coverage, before I committed some non-coding changes with skip-ci)

@MetRonnie
Copy link
Member

MetRonnie commented Jan 7, 2025

Actually, is the unlink really needed if it has been broken this whole time?

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing beyond what @MetRonnie said.

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
@hjoliver
Copy link
Member Author

hjoliver commented Jan 7, 2025

Actually, is the unlink really needed if it has been broken this whole time?

I thought about this. Presumably the code was put in for a reason, and maybe it got broken during Cylc 8 development.

In Cylc 7 it was common to re-run from scratch in an existing run directory. Maybe a job.err file from the previous run could be left behind, if not overwritten by the new run, to contaminate the record.

In Cylc 8, it is still possible to force a re-run from scratch in an existing run dir (delete the .service dir first). I think we have considered it supporting it properly as an option (e.g. to make iterative workflow development a bit quicker and more contained), not sure where we got with that discussion.

However:

  1. the background job runner now automatically overwrites job.err, even if no stderr is generated
  2. Slurm jobs also generate an empty job.err file even if no stderr is generated

So, the unlinking code is pretty much useless for background and Slurm jobs. The latter result is presumably down to Slurm itself. I don't have access to other batch systems.

SUMMARY:

I suspect that sometime in the past we had a job runner that did not generate an empty job.err file if the job did not generate any stderr. And we didn't like having old logs contaminating the record after doing in-place re-runs. In Cylc 8 in-place re-runs are not officially supported, but they can be forced and we have considered supporting them.

Options for this PR:

  • delete the unlinking code, as it is now very (if not totally) unlikely to be needed
  • leave it in just in case, and add a clarifying comment

Does Cylc+PBS nuke existing job.err files? If so, maybe we just delete the code?

@dpmatthews
Copy link
Contributor

PBS does replace any existing out/err files but only when the job completes which can be confusing. More importantly, we rely on the existence of the job.out file to determine whether the job output is present when doing log retrieval so it's essential we remove any old files.

@ColemanTom
Copy link
Contributor

One question, is there a reason this is going to master and not 8.3.7? Is 8.3.7 not going to exist as 8.4.0 will soon?

@wxtim wxtim merged commit 874bf8d into cylc:master Jan 8, 2025
27 checks passed
@wxtim
Copy link
Member

wxtim commented Jan 8, 2025

One question, is there a reason this is going to master and not 8.3.7? Is 8.3.7 not going to exist as 8.4.0 will soon?

releasing 8.4 this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An unlink for JOB_LOG_ERR and JOB_LOG_OUT will always do nothing in job_runner_mgr.py
5 participants