Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] RL multinode #152
base: main
Are you sure you want to change the base?
[WIP] RL multinode #152
Changes from 6 commits
3738d09
e3c30ec
40c211e
8382fe1
cc0a37f
103ca66
7f1f713
751f1e3
36c853b
0f62cfb
46d0c88
99415df
b786d6b
2a55ec4
64306e1
81144a2
4da0ea0
a81b18a
41ad0f7
dc87445
9432194
e5c8b20
a0cf1dc
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just check the rank there and nothing unless it is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why different port ranges for each node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we should be able to update the sync port if the one selected by the toolkit environment is being used. But that's not true and I found out that the reason for those clashes is the subprocess subshell instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. These vllms are running on different nodes, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use logger.info, and maybe smth like messages, "I'm rank X, training on Y GPU"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just sanity checking, it will be deleted soon. It was there to check that that number is the same as
Num processes
, but things are clear now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would training multi-node training without accelerate work without DeepSpeed? If not, should add an exception.