1
1
openmpi/orte/mca
Ralph Castain 89559396ea Resolve a race condition when running under a SLURM environment.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.

If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.

Mucho bad. :-/

This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).

This commit was SVN r20018.
2008-11-18 13:59:23 +00:00
..
errmgr - The intel compiler does not play nice with the 2008-08-08 16:26:09 +00:00
ess Fix C/R restart case by passing the correct address to the orte_ess_base_build_nidmap() function. This cropped up from r19866. 2008-11-10 15:19:28 +00:00
filem Some more work on the man pages: 2008-08-07 19:20:40 +00:00
grpcomm Cleanup modex logic to allow modex-less launch: 2008-11-03 21:48:52 +00:00
iof Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output 2008-11-14 20:36:18 +00:00
notifier Correct the notifier default module to include the new added API 2008-11-13 18:03:41 +00:00
odls Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output 2008-11-14 20:36:18 +00:00
oob Ensure we know how to route to a different job family when it connects to us 2008-11-03 14:25:14 +00:00
plm Resolve a race condition when running under a SLURM environment. 2008-11-18 13:59:23 +00:00
ras Remove linking components against ORTE and OPAL libs. This was 2008-11-08 00:56:57 +00:00
rmaps Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway. 2008-11-03 17:23:10 +00:00
rml This is a first step towards supporting fully-routed OOB communications: 2008-10-31 21:10:00 +00:00
routed Discovered while documenting the "preconnect" mca params that several of them didn't make sense any more. After chatting with Jeff, we agreed to the following: 2008-11-05 19:41:16 +00:00
snapc fix some typos. should be moved to v1.3 2008-11-10 19:05:26 +00:00