1
1
openmpi/orte
Josh Hursey 755658694e Bring in changes to support Cray's Compute Node Linux (CNL) and
Application Level Placement Scheduler (ALPS).

This commit was tested under two Cray machines at ORNL: Jaguar (Catamount)
and Rizzo (CNL Test cage). Both machines performed as they should across
the commit.

It is likely that mor changes will follow this the work and environment
stabilizes.

Most of the infrastructure works the same for Catamount and CNL
except for a few bits. Below are the highlights:

Default IFACE Change:
 On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access
 to will fail on this interface, and should be set to:
    IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS).
 So if we detect that we are running with YOD then use the former interface
 and if we detect that we are running with ALPS then use the latter.
 We will want to pursue a more elegant solution if this interface continues to 
 change across machines.

PtlGetId and cnos_register_ptlid:
 The header suggests that these should never be called when launching with YOD.
 But in the ALPS environment the cnos_barrier() will hang forever if these 
 functions are not called after PtlNIInit(). Since these functions only need to
 be called once, and the orte rmgr/cnos component is loaded before the ompi 
 common/portals componet then just call these functions once in the rmgr/cnos
 component.

cnos_barrier_init():
 This is a noop for YOD, but critical for ALPS. So be sure to call it before
 calling the first barrier in the rmgr/cnos component.

cnos_barrier vs cnos_pm_barrier:
 It is suggested the cnos_pm_barrier only be used during finalization 
 as it will indicate to the launcher (yod or aprun) that the app is about
 to complete. It was suggested that we use the regular cnos_barrier() instead.
 I want to look into this a bit more to make sure there are not adverse
 side effects. A note has been placed in the code to indicate this reasoning.

This commit was SVN r15756.
2007-08-03 19:46:38 +00:00
..
class reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
dss Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter. 2007-07-14 15:14:07 +00:00
etc remove defunct file 2007-07-20 02:10:38 +00:00
include reapply r15517 and r15520, which were removed in r15527 so that I could get 2007-07-20 02:34:29 +00:00
mca Bring in changes to support Cray's Compute Node Linux (CNL) and 2007-08-03 19:46:38 +00:00
orted - fix a problem showed up with the sun thread tests. 2007-07-26 11:30:27 +00:00
runtime Mark the orte_abort function as noreturn and change the return value from 2007-07-31 16:09:52 +00:00
test Bring orte tests up to date with revised rml system. 2007-07-23 13:05:34 +00:00
tools - Remove the solution and project files, will commit them later. 2007-07-31 17:07:02 +00:00
util Bring over the extra debugging output that helped a user find his NSF mount problems. This just adds ERROR_LOG messages when the session directory creation process fails so we can see where it is happening - really helps users (and us as well) figure out what specifically went wrong. 2007-07-18 19:50:54 +00:00
Doxyfile Fix the broken Doxyfile so people can generate what little code base documentation we have :-) 2006-04-13 12:52:17 +00:00
Makefile.am Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. 2007-07-12 19:53:18 +00:00