eee9f7ae3a
It turns that there is an incompatibility between the Cray PMI library and the default configuration for building Open MPI (master). To work around this, we now disable use of aprun for direct launch of Open MPI jobs except under specific conditions. The problem is that there are now (on master) packages getting initialized that do not work properly across a fork operation. As part of a constructor in the Cray PMI library, a fork operation is done to simplify use of shared memory between the processes in a job on the same node. This ends up thoroughly messing up the Open MPI initialization process in the case that dlopen support is enabled. The initialization process gets about half-way through when the PMIX framework is opened and components are loaded, which triggers the Cray PMI constructor and hence the fork operation. There are two workarounds for this: 1) configure Open MPI for Cray XE/XC systems using aprun with the --disable-dlopen option 2) set the PMI_NO_FORK environment variable in the shell in which the aprun command is run. Without taking these measures, a Open MPI job will just hang at job startup in the first attempt to "thread-shift" the PMIx fence_nb operation. Additional hangs occur at shutdown if this problem is worked around, again due to the insertion of a fork operation halfway through the Open MPI initialization procedure. This commit detects if the conditions that bring out the hang situation are present, and if so, prints out a message and aborts the job launch. Note on systems using slurm, the PMI_NO_FORK environment variable is set as part of the srun job launch, hence this issue is avoided on those systems. Signed-off-by: Howard Pritchard <howardp@lanl.gov> |
||
---|---|---|
.. | ||
configure.m4 | ||
help-pmix-cray.txt | ||
Makefile.am | ||
owner.txt | ||
pmix_cray_component.c | ||
pmix_cray_pmap_parser.c | ||
pmix_cray_pmap_parser.h | ||
pmix_cray.c | ||
pmix_cray.h |