Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill tasks during job prep #6535

Draft
wants to merge 10 commits into
base: 8.4.x
Choose a base branch
from
Draft

Kill tasks during job prep #6535

wants to merge 10 commits into from

Conversation

MetRonnie
Copy link
Member

@MetRonnie MetRonnie commented Dec 30, 2024

Supersedes #5749, closes #5746

This PR allows cylc kill [and cylc remove] to kill preparing tasks, resulting in the submit-failed state.

Tested by deliberately extending job preparation like this (and kill tasks whilst still in the preparing state)

platform = $(sleep 10; echo localhost)

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • No dependency changes
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • No docs needed.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@MetRonnie MetRonnie added the bug? Not sure if this is a bug or not label Dec 30, 2024
@MetRonnie MetRonnie added this to the 8.4.1 milestone Dec 30, 2024
@MetRonnie MetRonnie requested a review from hjoliver December 30, 2024 15:41
@MetRonnie MetRonnie self-assigned this Dec 30, 2024
@hjoliver hjoliver mentioned this pull request Jan 5, 2025
8 tasks
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested as working, with a task somewhere in the preparing state: the killed preparing task goes to submit-failed, and gets a new submit number if retriggered.

Definitely nicer than my hack, if it covers all the bases. Should this work at any stage of the "job preparation pipeline"? (I don't recall why I tried my other approach, but I think I was worried about this aspect for some reason ... hopefully unnecessarily worried).

One thing: presumably the killed task should get held as well, to prevent automatic retries (until released).

@MetRonnie MetRonnie marked this pull request as draft January 6, 2025 10:06
@MetRonnie MetRonnie changed the base branch from master to 8.4.x January 9, 2025 18:32
@MetRonnie MetRonnie marked this pull request as ready for review January 13, 2025 16:35
@oliver-sanders
Copy link
Member

This doesn't actually kill the job preparation (e.g. if killed during remote-init, we won't actually kill the remote init command itself). This is probably the best we can do as truly "recalling" a preparation operation is tricky and may have unwanted consequences (e.g. a partially remote-installed workflow).

The orphaned operation (e.g. remote-init) is still tracked by the subprocpool and may still be logged, example:

setup to reproduce
# global.cylc

[platforms]
  [[myplatform]]
    rsync command = whatevs
    job runner = background
# ~/bin/whatevs

sleep 20
exit 42
$ cylc vip myworkflow
$ sleep 2
$ cylc kill 'myworkflow//*'
INFO - [1/foo:waiting] => preparing
INFO - platform: myplatform - remote init (on myplatform)
INFO - platform: myplatform - remote file install (on myplatform)
INFO - Command "kill_tasks" received. ID=6b6c8363-3bc8-4c99-ba76-9e68a1cb4b5f
    kill_tasks(tasks=['1/foo'])
INFO - [1/foo/01:preparing] => preparing(held)
ERROR - [jobs-submit cmd] (killed in job prep)
    [jobs-submit ret_code] 1
CRITICAL - [1/foo/01:preparing(held)] submission failed
INFO - [1/foo/01:preparing(held)] => submit-failed(held)
WARNING - [1/foo/01:submit-failed(held)] did not complete the required outputs:
    ⨯ ┆  succeeded
INFO - Command "kill_tasks" actioned. ID=6b6c8363-3bc8-4c99-ba76-9e68a1cb4b5f
ERROR - Incomplete tasks:
    * 1/foo did not complete the required outputs:
      ⨯ ┆  succeeded
CRITICAL - Workflow stalled
WARNING - P3D stall timer starts NOW
INFO - testing connectivity for myplatform using ['ssh', '-oBatchMode=yes', '-oConnectTimeout=8', '-oStrictHostKeyChecking=no', 'myplatform', 'true']
ERROR - platform: myplatform - initialisation did not complete
    COMMAND:
        meme --delete \
            --rsh=ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no \
            --include=/.service/ --include=/.service/server.key -a \
            --checksum --out-format=%o %n%L --no-t --exclude=log \
            --exclude=share --exclude=work --include=/ana/*** \
            --include=/app/*** --include=/bin/*** --include=/etc/*** \
            --include=/lib/*** --exclude=* \
            /home/users/oliver.sanders/cylc-run/tmp.d6SQ6gMhfZ/run11/ \
            myplatform:cylc-run/tmp.d6SQ6gMhfZ/run11/
    RETURN CODE:
        42

Strangely, in this situation, if I trigger a downstream task, it goes straight to submit-failed with no attempt to re-run the failed remote-fileinstall operation (bug). But if I re-trigger the downstream task a second time, then it repeats both the remote-init and remote-fileinstall (bug?)!?

Should this work at any stage of the "job preparation pipeline"

From testing it appears to abort at any stage 👍 (though I'm not entirely sure how this is guaranteed).

We could do with some tests to ensure that this works at each stage and that the following stages are correctly canceled:

  • host-select
  • bash syntax check
  • remote-init
  • remote-fileinstall

Hopefully easy to do with integration tests where we can mock the results of each stage of the process.

@MetRonnie
Copy link
Member Author

Strictly speaking this "kills the task" but doesn't kill any preparation process it has started.

I think it works by setting itask.waiting_on_job_prep = False so that the task will no longer submit a job once the orphaned preparation process finishes. And _prep_submit_task_job_error() creates a dummy submit-failed job.

I'll have a look at testing more thoroughly.

@hjoliver
Copy link
Member

My approach, on the superseded PR, was to set a new flag on the preparing task and then, based on that flag, abort job submission once prep had completed. There must have been a reason why I did not just use itask.waiting_on_job_prep = False. Unfortunately i don't recall. Doesn't that flag actually affect the job prep pipeline?

@MetRonnie MetRonnie marked this pull request as draft January 15, 2025 14:11
@MetRonnie
Copy link
Member Author

The bash syntax check is blocking (not using the subproc pool) so I don't think it matters?

if check_syntax:
try:
with Popen( # nosec
['/usr/bin/env', 'bash', '-n', tmp_name],
stderr=PIPE,
stdin=DEVNULL,
text=True
# * the purpose of this is to evaluate user defined code
# prior to it being executed
) as proc:
if proc.wait():
# This will leave behind the temporary file,
# which is useful for debugging syntax errors, etc.
raise RuntimeError(proc.communicate()[1])

@@ -346,7 +346,7 @@ def submit_livelike_task_jobs(
bc_mgr = self.task_events_mgr.broadcast_mgr
rtconf = bc_mgr.get_updated_rtconfig(itask)
try:
platform = get_platform(
platform = get_platform( # type: ignore[assignment]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #6564

@@ -1278,7 +1278,7 @@ def _prep_submit_task_job(
workflow, itask, '(platform not defined)', exc)
return False
else:
itask.platform = platform
itask.platform = platform # type: ignore[assignment]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #6564

Comment on lines +107 to +112
async def test_kill_preparing_pipeline(
flow, scheduler, run, monkeypatch: pytest.MonkeyPatch
):
"""Test killing a preparing task through various stages of the preparing
pipeline that involve submitting subprocesses and waiting for them to
complete."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliver-sanders Is this what you were after?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug? Not sure if this is a bug or not
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kill should abort preparing tasks
3 participants