-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] RL multinode #152
base: main
Are you sure you want to change the base?
[WIP] RL multinode #152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Please see my comments below.
@@ -69,6 +69,11 @@ def __init__( | |||
self.stderr_file: Optional[TextIO] = None | |||
self.stats = {} | |||
|
|||
# Add node rank awareness | |||
self.node_rank = int(os.environ.get("RANK", 0)) | |||
self.port_offset = self.node_rank * 1000 # Ensure different port ranges for each node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why different port ranges for each node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we should be able to update the sync port if the one selected by the toolkit environment is being used. But that's not true and I found out that the reason for those clashes is the subprocess subshell instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. These vllms are running on different nodes, aren't they?
examples/rl_gsm8k/utils.py
Outdated
# Check GPU availability | ||
num_gpus = torch.cuda.device_count() | ||
print('###############################') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use logger.info, and maybe smth like messages, "I'm rank X, training on Y GPU"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just sanity checking, it will be deleted soon. It was there to check that that number is the same as Num processes
, but things are clear now.
"--deepspeed_multinode_launcher", | ||
"standard", | ||
"--same_network", | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would training multi-node training without accelerate work without DeepSpeed? If not, should add an exception.
run = None | ||
|
||
def safe_wandb_log(metrics, step): | ||
if run is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just check the rank there and nothing unless it is 0?
Refactoring to mitigate NCCL timeouts and handle multi-node rollout generation |
No description provided.