1
1
openmpi/orte/mca/rmgr
Josh Hursey 755658694e Bring in changes to support Cray's Compute Node Linux (CNL) and
Application Level Placement Scheduler (ALPS).

This commit was tested under two Cray machines at ORNL: Jaguar (Catamount)
and Rizzo (CNL Test cage). Both machines performed as they should across
the commit.

It is likely that mor changes will follow this the work and environment
stabilizes.

Most of the infrastructure works the same for Catamount and CNL
except for a few bits. Below are the highlights:

Default IFACE Change:
 On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access
 to will fail on this interface, and should be set to:
    IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS).
 So if we detect that we are running with YOD then use the former interface
 and if we detect that we are running with ALPS then use the latter.
 We will want to pursue a more elegant solution if this interface continues to 
 change across machines.

PtlGetId and cnos_register_ptlid:
 The header suggests that these should never be called when launching with YOD.
 But in the ALPS environment the cnos_barrier() will hang forever if these 
 functions are not called after PtlNIInit(). Since these functions only need to
 be called once, and the orte rmgr/cnos component is loaded before the ompi 
 common/portals componet then just call these functions once in the rmgr/cnos
 component.

cnos_barrier_init():
 This is a noop for YOD, but critical for ALPS. So be sure to call it before
 calling the first barrier in the rmgr/cnos component.

cnos_barrier vs cnos_pm_barrier:
 It is suggested the cnos_pm_barrier only be used during finalization 
 as it will indicate to the launcher (yod or aprun) that the app is about
 to complete. It was suggested that we use the regular cnos_barrier() instead.
 I want to look into this a bit more to make sure there are not adverse
 side effects. A note has been placed in the code to indicate this reasoning.

This commit was SVN r15756.
2007-08-03 19:46:38 +00:00
..
base reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
cnos Bring in changes to support Cray's Compute Node Linux (CNL) and 2007-08-03 19:46:38 +00:00
proxy reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
urm reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
Makefile.am Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node 2006-10-11 18:43:13 +00:00
rmgr_types.h Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start. 2007-05-21 18:31:28 +00:00
rmgr.h reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00