1
1
openmpi/orte/tools
Ralph Castain b456fb2d42 Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2*heartrate, declares orted failure if not seen in last 2*heartrate time.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.

Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.

More info to come on devel list.

This commit was SVN r18555.
2008-06-02 21:46:34 +00:00
..
orte-checkpoint This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
orte-clean This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
orte-ps This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
orte-restart This commit represents a bunch of work on a Mercurial side branch. As 2008-05-13 20:00:55 +00:00
orted need orted_LDFLAGS as a placeholder 2008-03-25 13:41:09 +00:00
orterun Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2*heartrate, declares orted failure if not seen in last 2*heartrate time. 2008-06-02 21:46:34 +00:00
wrappers Per recent off-list discussions about the build system, I have done 2008-03-22 02:04:05 +00:00
Makefile.am Merge in tmp/jjh-scratch 2008-04-23 00:17:12 +00:00