1
1
openmpi/orte/mca/plm/base
Ralph Castain 72530f8fed Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed:
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.

2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.

This represents the best-available fix for ticket #221 so I am closing that ticket at this time.

This commit was SVN r18536.
2008-05-29 13:38:27 +00:00
..
base.h Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. 2008-02-28 19:58:32 +00:00
help-plm-base.txt Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 2008-05-29 13:38:27 +00:00
Makefile.am Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
plm_base_close.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_base_jobid.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_base_launch_support.c Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 2008-05-29 13:38:27 +00:00
plm_base_open.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_base_orted_cmds.c Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 2008-05-29 13:38:27 +00:00
plm_base_proxy.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_base_receive.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_base_select.c This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
plm_private.h Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. 2008-02-28 19:58:32 +00:00