1
1
openmpi/orte
Ralph Castain b6196e8a39 When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive.
This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed.

In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean.

I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it.

This commit was SVN r16451.
2007-10-15 18:00:30 +00:00
..
class reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
dss Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter. 2007-07-14 15:14:07 +00:00
etc remove defunct file 2007-07-20 02:10:38 +00:00
include reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
mca When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. 2007-10-15 18:00:30 +00:00
orted Fixes trac:1160. There is still some other problem in the OOB, but we 2007-10-15 15:41:36 +00:00
runtime When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. 2007-10-15 18:00:30 +00:00
test These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. 2007-10-05 19:48:23 +00:00
tools When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. 2007-10-15 18:00:30 +00:00
util Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 2007-10-11 15:57:26 +00:00
Doxyfile Fix the broken Doxyfile so people can generate what little code base documentation we have :-) 2006-04-13 12:52:17 +00:00
Makefile.am Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. 2007-07-12 19:53:18 +00:00