Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] slurm_par example only spins up 1 node instead of 2 #429

Open
bgunnar5 opened this issue Jun 29, 2023 · 0 comments
Open

[BUG] slurm_par example only spins up 1 node instead of 2 #429

bgunnar5 opened this issue Jun 29, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@bgunnar5
Copy link
Member

Bug Report

Description
When running the slurm_par example, the runs step fails for each sample ran due to a slurm allocation issue. The following error is placed inside each runs.slurm.err file that's generated: srun: error: Only allocated 1 nodes asked for 2.

To Reproduce
Steps to reproduce the behavior:

  1. Pull the slurm_par example with merlin example slurm_par
  2. Cd into the slurm/ directory
  3. Queue the tasks with merlin run slurm_par.yaml
  4. Run the workers with merlin run-workers slurm_par.yaml
  5. When it's done running look in the output directory at runs/00/runs.slurm.err to see the error

Expected behavior
We want two nodes allocated with slurm for this step.

Please answer these questions to help us pinpoint the problem

  • Does the problem occur in merlin run --local mode, distributed mode or neither? Distributed
  • If a distributed problem, which backend and queue servers are you using? How are they configured? Broker is rabbitmq, results backend is redis. Configured through LaunchIT
  • On what machines/architectures are you running merlin? Is this bug on a specific machine or can you reproduce it elsewhere? rztopaz and reproduced on ruby

Additional context
Bug found by Casey Lamarche

@bgunnar5 bgunnar5 added the bug Something isn't working label Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant