1
1
openmpi/opal/mca/pmix/cray
Howard Pritchard eee9f7ae3a pmix/cray: abort job if using aprun for general case
It turns that there is an incompatibility between the Cray PMI
library and the default configuration for building Open MPI (master).
To work around this, we now disable use of aprun for direct launch
of Open MPI jobs except under specific conditions.

The problem is that there are now (on master) packages getting
initialized that do not work properly across a fork operation.
As part of a constructor in the Cray PMI library, a fork operation
is done to simplify use of shared memory between the
processes in a job on the same node.  This ends up thoroughly
messing up the Open MPI initialization process in the case
that dlopen support is enabled.  The initialization process gets
about half-way through when the PMIX framework is opened and
components are loaded, which triggers the Cray PMI constructor
and hence the fork operation.

There are two workarounds for this:
1) configure Open MPI for Cray XE/XC systems using aprun with the
   --disable-dlopen option
2) set the PMI_NO_FORK environment variable in the shell in which
   the aprun command is run.

Without taking these measures, a Open MPI job will just hang at
job startup in the first attempt to "thread-shift" the PMIx
fence_nb operation.  Additional hangs occur at shutdown if this
problem is worked around, again due to the insertion of a fork
operation halfway through the Open MPI initialization procedure.

This commit detects if the conditions that bring out the hang
situation are present, and if so, prints out a message and
aborts the job launch.

Note on systems using slurm, the PMI_NO_FORK environment variable
is set as part of the srun job launch, hence this issue is avoided
on those systems.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-25 06:28:19 -07:00
..
configure.m4 pmix/cray: fix disable-dlopen problem 2016-11-21 13:45:10 -06:00
help-pmix-cray.txt pmix/cray: abort job if using aprun for general case 2016-11-25 06:28:19 -07:00
Makefile.am pmix/cray: abort job if using aprun for general case 2016-11-25 06:28:19 -07:00
owner.txt add owner files to opa/ompi/orte mca directories 2015-02-22 15:10:23 -07:00
pmix_cray_component.c pmix/cray: abort job if using aprun for general case 2016-11-25 06:28:19 -07:00
pmix_cray_pmap_parser.c pmix/cray: whitespace cleanup 2016-11-18 19:30:40 -07:00
pmix_cray_pmap_parser.h Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
pmix_cray.c pmix/cray: fix disable-dlopen problem 2016-11-21 13:45:10 -06:00
pmix_cray.h Merge pull request #2444 from hppritcha/topic/cray_pmix_ws_cleanup 2016-11-21 06:03:56 -07:00