89559396ea
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit. If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state. Mucho bad. :-/ This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths). This commit was SVN r20018. |
||
---|---|---|
.. | ||
errmgr | ||
ess | ||
filem | ||
grpcomm | ||
iof | ||
notifier | ||
odls | ||
oob | ||
plm | ||
ras | ||
rmaps | ||
rml | ||
routed | ||
snapc |