-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a NCCL-like initialization mechanism #705
Comments
@manjugv FYI. |
We discussed this in our WG. When you don’t give OOB, we need Allgather in the implementation and this feature is lacking in the implementation. This requires a lot of developer cycles, and we are trying to figure out an easy way to implement this feature. :) |
We don't expose ucp/ucx at the interface level as this will impact the portability. The UCC interfaces are agnostic of UCX. |
I thought it might be possible to leverage the internal_oob or the internal team implementation for that.
I didn't mean to expose UCP/UCP at the interface level but to use an opaque structure. |
Some more context on why the current interface is not a good fit for Legion/Legate. The current UCC context creation API requires an out-of-bounds allgather callback. The obvious way to implement this is using In the general case it's not possible to provide an allgather callback (using purely Legion primitives) that is guaranteed to work in every context. Typically in a Legion application a single controller thread is managing all the resources in a node, and spawns a task per GPU/core to take part in communicator initialization. At the point where these tasks are running they are no longer able to pass data to each other (that's what we're trying to initialize a communicator to do in the first place). It might be possible to work around this using lower-level primitives, but that would start to veer off the "blessed" Legion path. The NCCL model, where a single rank produces a value, that the calling code (externally to NCCL) broadcasts to all other ranks, can be much more easily fitted within Legion's task model. At the very least we need to be in control of the out-of-bounds communication; having to provide a callback is what gets us into trouble. We could possibly build up an allgather on top of Legion primitives, but we would need to be in charge of invoking it. Footnotes
|
The existing OOB initialization may be a blocker for some users to adopt UCC.
I suggest to add another initialization alternative that would allow users to query a unique identifier (ucp_address?) and do the communication themselves.
The text was updated successfully, but these errors were encountered: