-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMI2 support assumes it initializes job resources #928
Comments
Thanks for the report @benson31. Yeah, SOS only supports MPI+OpenSHMEM via the You might be our first MPI+OpenSHMEM user, so I'm curious about your use-case if you can say more... Our developer resources can be pretty limited for solving these sorts of problems. The PMI over MPI solution was actually an intern's summer project. I don't see a straightforward way to get Also, manually setting |
@davidozog I haven't had a chance to clean it up at all, so please forgive the debugging-centric nature of the code, but this patch worked for me: if (!PMI2_Initialized()) {
// As you have it already
}
else
{
int flag, pmi_ret;
char size_str[128];
pmi_ret = PMI2_Info_GetJobAttr("universeSize", size_str, 128, &flag);
if (PMI2_SUCCESS == pmi_ret)
{
if (flag)
size = atoi(size_str);
else
{
printf("No universeSize; the docs say this is predefined, so something's wrong.\n");
}
}
else
{
printf("PMI2_Info_GetJobAttr(...) failed with error code: %d\n", pmi_ret);
}
pmi_ret = PMI2_Job_GetRank(&rank);
if (PMI2_SUCCESS != pmi_ret)
printf("ERROR: PMI2_Job_GetRank failed with error code: %d\n");
} I was able to run our test (a simple "hello world" from all PEs) using PMI2 with this patch in place. I didn't try to run any of the other tests since that was sufficient for the person who asked me for help with this (and none of them have MPI anyway). It would obviously need to be cleaned up to match your error-handling format, etc, but I didn't have time to dig into that. Re: the use-case. I work on a large-scale machine learning code, LBANN, and we are experimenting with the OpenSHMEM model for certain operations. The current use-case comes from @timmoon10 and the relevant code block is here. Re: flags. When building with |
@benson31 Thanks for the all the info, and sorry it's taken so long to reply. I'm running through our backlog of issues towards a new SOS release and this came up. Is your patch to PMI2 still in use for LBANN (with @timmoon10)? Do you require PMI2 support when doing MPI+SHMEM? In other words, I'm a little uncertain as to whether the If we do incorporate the patch, we should probably take the time to do the same for PMI 1.0 and PMIx, but it's not a high priority on our end because we (arguably?) conform to the OpenSHMEM v1.5 specification's interoperability requirements with the |
The patch is preferable since @benson31's workaround with |
Oh yeah! I forgot to mention that believe I figured out the environment issue - you have to set The brief reason for that is the SOS runtime is making MPI calls when you enable PMI-MPI, so you have to build with the MPI compiler wrapper. We're working on documenting this now. My apologies that it's been so poorly documented. Does setting Very cool. Side question - I looked quickly through the LBANN code base for an example using NVSHMEM, and I only see configury sorts of stuff (and |
@davidozog Thanks for following up on this. I'll need a little time to revive all these tests and remember what I was doing, but when I do, I'll give Re NVSHMEM in LBANN, this file has the only real use-case in LBANN. |
@benson31 and @timmoon10 - have you tried compiling with I want to verify that the environment issues only came about because building with I think we can close this issue, but it would be great if you could verify. The requirement is now documented on our troubleshooting wiki. Can you think of any reasons to include the above patch? One reason I'm skeptical is because hard-setting the runtime-pmi2 size to "universeSize" seems like it's specialized to SHMEM+MPI hybrid executions, not the PMI2 runtime layer itself. It could break PMI2 executions, right? For hybrid programs, SOS's supported path is through |
This block is missing an
else
statement to set thesize
andrank
variables when someone else (e.g., MPI) has initialized PMI2 for this job.This came up in a simple MPI+OpenSHMEM program:
(Inverting the order of
shmem_init
andMPI_Init
caused a bigger mess involving FPEs and very large core dumps...)The compiler was GCC@7.3.0 and MPI ws OpenMPI@4.0.0, launched with
srun
.A workaround was to configure with
--enable-pmi-mpi
. After manually settingCPPFLAGS
,LDFLAGS
, andLIBS
, everything compiled fine and this example ran.The text was updated successfully, but these errors were encountered: