-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel inference #108
base: develop
Are you sure you want to change the base?
Parallel inference #108
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #108 +/- ##
========================================
Coverage 98.03% 98.03%
========================================
Files 3 3
Lines 51 51
========================================
Hits 50 50
Misses 1 1 ☔ View full report in Codecov by Sentry. |
src/anemoi/inference/parallel.py
Outdated
|
||
|
||
def init_network(): | ||
"""Reads Slurm environment to set master address and port for parallel communication""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's one thing I wonder about in general -- say you have Anemoi inference inside a slurm job which happens to do other things as well, eg, run anemoi inference first, intended in non-dist mode, and then some postproc in parallel. That makes the SLURM_<...> variables visible to the anemoi code, but does not expect it to heed it. Wouldn't that cause misbehavior, as in, making anemoi inference thinking it is running in dist mode when it actually shouldn't be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftr, it's not a hypothetical question, I do exactly that in Cascade :) And we have two layers of the problem:
- how to allow Amenoi parallel inference as well as Anemoi-in-Cascade-non-parallel, in a mutually non disruptive way
- how to allow Anemoi parallel inference in Cascade, that is, slurm-within-slum :)
for the first one, it may be easiest to use custom env vars (ANEMOI_NODELIST, ANEMOI_JOBID, ...), and have the example bash submit script set them to slurm's vars. And for the second one... well, a more profound deliberation needs to be taken, but let's just keep it in mind laterally, and solve it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making anemoi inference thinking it is running in dist mode when it actually shouldn't be?
Very good point
I am reticent to add more env vars. I think a 'num-gpus' entry to the anemoi-inference config, which defaults to 1, would suffice. But yeah more thought is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmi I believe the first point is resolved now that parallelism is disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about adding the parallel stuff to the default Runner. I would rather create a separate ParallelRunner
for it.
In develop we now have the option to instantiate runners from the config, so you would explicitly select it when you want to do parallel inference.
Opinions on factoring out this code into a new runner?
- Add a
Runner.log()
that wraps the logger. - Factor out the output block into a
Runner.output()
. - Then in the ParallelRunner overrides of those functions, check rank before calling
super()
- Passing of the
model_comm_group
can be done in the same way as we did for the CRPS runner that's now in develop (see my other comment about predict_step)
Thanks Cathal, this looks very interesting. I had a question about the slurm dependency to launch the parallel processes. Do you mean it will only work using slurm or could this also work on AWS? |
Hi @jswijnands . Yeah currently you would need Slurm to get the 'srun' program. Whether or not this works on AWS would depend on your exact setup. If you are using AWS Parallel Cluster you could get slurm on that cluster. It would be nice to get more details about your setup to make sure it is supported. Could you send me an email (cathal.obrien@ecmwf.int) or message me on the Slack? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
src/anemoi/inference/runner.py
Outdated
|
||
# Detach tensor and squeeze (should we detach here?) | ||
output = np.squeeze(y_pred.cpu().numpy()) # shape: (values, variables) | ||
if global_rank == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls see offline discussion :-)
src/anemoi/inference/runner.py
Outdated
@@ -239,25 +241,39 @@ def model(self): | |||
return model | |||
|
|||
def forecast(self, lead_time, input_tensor_numpy, input_state): | |||
|
|||
# determine processes rank for parallel inference and assign a device | |||
global_rank, local_rank, world_size = get_parallel_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In training we seed the whole parallel model group with the same random seed I think it would be nice to do the same here as well. The challenge is to generate appropriate seeds...
Now all the parallel code has been refactored into it's own parallel runner class. Now parallelism is not automatic. you must add "runner: parallel" to the inference config file (and launch with srun). Duplicated logging is mostly gone now, logging for non-zero ranks is reduced to warnings and errors only. |
|
||
# Use subprocess to execute scontrol and get the first hostname | ||
result = subprocess.run( | ||
["scontrol", "show", "hostname", slurm_nodelist], stdout=subprocess.PIPE, text=True, check=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scontrol
doesn't seem to work inside a container (exit code 127), Also it increases the dependency on slurm. So I will try find an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job on the parallel runner, LGTM!
Future runners that need both single and parallel functionality will cause some difficulties, but let's tackle that when it comes.
Do you want to add a small entry to the docs on parallel inference, with the example job script in there too?
It would be good to also have a standalone mode without slurm, where the runner spawns its own subprocesses. That would be very useful for debugging and running in the cloud. But I would do that in a follow-up PR.
@@ -244,18 +244,15 @@ def predict_step(self, model, input_tensor_torch, fcstep, **kwargs): | |||
return model.predict_step(input_tensor_torch) | |||
|
|||
def forecast(self, lead_time, input_tensor_numpy, input_state): | |||
self.model.eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
This PR allows you to run inference across multiple GPUs and nodes.
Parallel inference relies on PR77 to anemoi-models. Running sequentially with an older version of models will still work. Trying to run in parallel with an older version will prompt you to upgrade your anemoi models.
When a newer version of models is released with PR77 included, It might be worthwhile bumping the minimum version required by anemoi-inference.
Currently Slurm is required to launch the parallel processes, and some Slurm env vars are read to set up the networking. Below is an example Slurm batch script
One QOL of life feature would be make only process 0 log. Currently you get a lot of spam when running in parallel because each process logs. Any ideas how to do this would nicely would be great