Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job status query fails if slurm accounting storage is disabled #38

Open
prs513rosewood opened this issue Feb 23, 2024 · 4 comments
Open

Comments

@prs513rosewood
Copy link

I get an error when running a job on with a slurm instance whose accounting storage is disabled (i.e. the sacct command just replies Slurm accounting storage is disabled). Here's the stack trace :

The job status query failed with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-21T16:00 --endtime now --name ddcf5013-8004-418b-8832-9a563aaf5280
Error message: Slurm accounting storage is disabled

argument of type 'NoneType' is not iterable
Traceback (most recent call last):
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 190, in _wait_thread
    asyncio.run(self._wait_for_jobs())
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 180, in _wait_for_jobs
    still_active_jobs = [
                        ^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 180, in <listcomp>
    still_active_jobs = [
                        ^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 263, in check_active_jobs
    if j.external_jobid not in status_of_jobs:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable

Looks like there's some error handling here:

if status_of_jobs is None and sacct_query_duration is None:

But after the loop over attempts to get job status the rest of the code assumes no error and treats status_of_jobs as a valid set.

The slurm profile also uses sacct but falls back to scontrol if that fails, might be a solution : https://github.com/Snakemake-Profiles/slurm/blob/c44315217d1ce36493dc7dccbd013528657747f9/%7B%7Bcookiecutter.profile_name%7D%7D/slurm-status.py#L40

@cmeesters
Copy link
Member

Thank you for this report. We definitively need to update the error message!

We had the fallback in the executor, but decided to drop it to be able to check the states in asynchronous mode with one command. A cluster without accounting db is pretty unusual. Re-introducing the fallback might not be so easy.

Is your particular cluster in an experimental stage?

@prs513rosewood
Copy link
Author

prs513rosewood commented Feb 26, 2024

Thanks for looking at this, I know this is a weird edge case. The cluster in question is somewhat artisanal.

I think the slurm cluster profile may be a workable fallback for me. And it looks like 6a197ae fixes the issue of status_of_jobs being invalid.

@cmeesters
Copy link
Member

I think the slurm cluster profile may be a workable fallback for me.

Perhaps. Then again, you might want to use storage plugins and/or other plugins. That would be a mess. Is there any chance your admins set up the cluster ... eh, properly?

@hmkim
Copy link

hmkim commented Sep 23, 2024

As you mentioned, sometimes slurm accouting is not enabled in non-production and the way to support this is to use the

cluster-generic plugin to run the job, is that correct?


 snakemake \
         --executor cluster-generic \
         --cluster-generic-submit-cmd 'sbatch --job-name={rule} --partition={my_partion_name} --cpus-per-task={threads} --export=ALL --chdir=$PWD' \
         --jobs unlimited

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants