1
1
Граф коммитов

54 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
daeb9b6c4f Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain.
Thanks to Gilles for pointing out some of the discrepancies.

This commit was SVN r32398.
2014-08-01 14:44:11 +00:00
Ralph Castain
6c5e592785 Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment.
Update the orte/test/mpi/coll_test.c test to show the revised example.

This commit was SVN r32234.

The following SVN revision numbers were found above:
  r32203 --> open-mpi/ompi@a523dba41d
  r32210 --> open-mpi/ompi@2ce11ed5c4
  r32222 --> open-mpi/ompi@d55f16db50
2014-07-15 03:48:00 +00:00
Ralph Castain
a523dba41d NOTE: this modifies the MPI-RTE interface
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.

This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:

 * In order to allow scalable
 * generation of collective id's, they are formed as:
 *
 * top 32-bits are the jobid of the procs involved in
 * the collective. For collectives across multiple jobs
 * (e.g., in a connect_accept), the daemon jobid will
 * be used as the id will be issued by mpirun. This
 * won't cause problems because daemons don't use the
 * collective_id
 *
 * bottom 32-bits are a rolling counter that recycles
 * when the max is hit. The daemon will cleanup each
 * collective upon completion, so this means a job can
 * never have more than 2**32 collectives going on at
 * a time. If someone needs more than that - they've got
 * a problem.
 *
 * Note that this means (for now) that RTE-level collectives
 * cannot be done by individual threads - they must be
 * done at the overall process level. This is required as
 * there is no guaranteed ordering for the collective id's,
 * and all the participants must agree on the id of the
 * collective they are executing. So if thread A on one
 * process asks for a collective id before thread B does,
 * but B asks before A on another process, the collectives will
 * be mixed and not result in the expected behavior. We may
 * find a way to relax this requirement in the future by
 * adding a thread context id to the jobid field (maybe taking the
 * lower 16-bits of that field).

This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.

This commit was SVN r32203.
2014-07-10 18:53:12 +00:00
Ralph Castain
449cd8f3d7 Update a couple of fields, add a scheduler field to proc_info
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Adrian Reber
fde1040d2f Use unique collective ids for the checkpoint/restart code
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
14bf1c9463 Some minor cleanups:
* don't return null if someone wants to print ORTE_SUCCESS

* rename some stale process types

* keep show_help local if we are in standalone operation as there is nobody to send it to

cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30400.
2014-01-23 21:35:20 +00:00
Ralph Castain
24c811805f ****************************************************************
This change contains a non-mandatory modification
       of the MPI-RTE interface. Anyone wishing to support
       coprocessors such as the Xeon Phi may wish to add
       the required definition and underlying support
****************************************************************

Add locality support for coprocessors such as the Intel Xeon Phi.

Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.

So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:

1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board

2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions

3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.

4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.

5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.

6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.

cmr:v1.7.4:reviewer=hjelmn

This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
Ralph Castain
37db1727a2 Refs trac:3710
Simplify the whole stripping of prefix method by consolidating it into a single MCA param. Allow for multiple prefixes to be stripped, each separated in the param by a comma. If no prefix is given, or the specified prefix isn't in the nodename, then just use the hostname itself.

This commit was SVN r28974.

The following Trac tickets were found above:
  Ticket 3710 --> https://svn.open-mpi.org/trac/ompi/ticket/3710
2013-08-01 00:32:10 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
6ee32767d4 Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument.
This commit was SVN r28219.
2013-03-26 18:27:50 +00:00
Ralph Castain
ab73d11368 Oops - push missing definitions
This commit was SVN r27688.
2012-12-18 16:43:03 +00:00
Ralph Castain
b9b41d8662 For cases where the alpha+non-zero prefix must be removed from a node name, be sure to do it everywhere we access node names - otherwise, modex methods such as pmi will fail to correctly identify procs on the same node
This commit was SVN r27022.
2012-08-13 20:44:56 +00:00
Ralph Castain
6ee35e4977 Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Ralph Castain
0dfe29b1a6 Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.

Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.

This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Ralph Castain
c69a04e16b Cleanup the pidmap decoding for apps to avoid confusion
This commit was SVN r26498.
2012-05-27 16:21:38 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Ralph Castain
b44f8d4b28 Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.

Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h

Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.

This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
042ee3ec48 Support the option of outputting error_log messages with something other than the process name
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
33b68132cc Update the rmcast framework
This commit was SVN r24370.
2011-02-12 16:52:03 +00:00
Ralph Castain
3ca0b4138b Let the nidmap functions update a new orte_process_info field as to the number of daemons in the system
This commit was SVN r23088.
2010-05-04 02:40:09 +00:00
Ralph Castain
75e99e6118 Do a better job of selecting cm ess component, handle tool and daemon issues
This commit was SVN r22942.
2010-04-07 18:59:21 +00:00
Ralph Castain
2541aa98ab Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort.
This commit was SVN r22727.
2010-02-27 17:37:34 +00:00
Ralph Castain
9f3ccebeaa We need to barrier for orte apps when the job is initially started, but we must not do the barrier when a proc is restarted as the other procs in the job won't know to participate.
This commit was SVN r22388.
2010-01-10 02:21:30 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
Ralph Castain
4e0223a638 Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Ralph Castain
c45ff0d59f Take the next step towards fully utilizing static ports for the daemons to eliminate the initial "phone home" to mpirun by modifying the orted termination procedure to eliminate the need for a full barrier-like operation. Instead, we add a "onesided" barrier to the grpcomm framework API that releases the orted once it has completed its own contribution to the barrier - i.e., the orteds now exit as the "ack" message rolls up towards mpirun instead of sending the "ack" directly to mpirun.
This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection.

Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes.

This commit was SVN r21203.
2009-05-11 14:11:44 +00:00
Ralph Castain
4be24521aa Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans.
Minor mod to the ompi layer to use the new #define's - just one-line name replacements.

This commit was SVN r21144.
2009-05-04 11:07:40 +00:00
Rainer Keller
be66cc2279 - We're using uint16_t, uint32_t, and friends,
so #include <stdint.h> if we have it...

This commit was SVN r20835.
2009-03-21 01:26:27 +00:00
Rainer Keller
ec0ed48718 - Revert r20739
This commit was SVN r20742.

The following SVN revision numbers were found above:
  r20739 --> open-mpi/ompi@781caee0b6
2009-03-05 21:56:03 +00:00
Rainer Keller
781caee0b6 - First of two or three patches, in orte/util/proc_info.h:
Adapt orte_process_info to orte_proc_info, and
   change orte_proc_info() to orte_proc_info_init().
 - Compiled on linux-x86-64
 - Discussed with Ralph

This commit was SVN r20739.
2009-03-05 20:36:44 +00:00
Ralph Castain
6db641c86d Pass the number of nodes in a job to the process
This commit was SVN r20595.
2009-02-19 20:45:07 +00:00
Ralph Castain
2966206f58 Fix a race condition in the IOF and add some new user-requested features:
1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen.

2. add a new "timestamp-output" option that timestamp's each line of output

3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file.

4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window.

This commit was SVN r20392.
2009-01-30 22:47:30 +00:00
Ralph Castain
5e6d3ba289 Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an
y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified.

NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun.

This commit was SVN r20390.
2009-01-30 18:31:43 +00:00
Ralph Castain
ba5498cdc6 Repair the MPI-2 dynamic operations. This includes:
1. repair of the linear and direct routed modules

2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI

3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature

4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.

Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB

5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.

Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.

6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch

7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch

8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies

9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.

10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon

There is still more cleanup to be done for efficiency, but this at least works.

Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.

Fixes ticket #1256

This commit was SVN r18804.
2008-07-03 17:53:37 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Ralph Castain
e050f37578 Cleanup a few warnings about initializing variables.
Remove an obsolete data value.

This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Ralph Castain
6166278e18 Improve the scalability of the modex operation and fix a bug reported by Tim P
The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.

Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).

Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.

This commit was SVN r17988.
2008-03-27 15:17:53 +00:00
Ralph Castain
60d931217f Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules.
Specifically, add two new APIs:

1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job.

Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections.

2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job.

The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module.

This commit was SVN r17969.
2008-03-26 01:00:24 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Ralph Castain
ff99aa054f In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate.
This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection.

For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP.

For a proc using tree routed, the lifeline is the local daemon.

Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost.

This commit was SVN r17761.
2008-03-06 15:30:44 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Ralph Castain
3dbd4d9be7 Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by:
1. taking advantage of the fact that we no longer create the launch  message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found.

Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large.

2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc.

Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once.

The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes):

Per-node scaling, taken at 1ppn:

#nodes           original trunk                         revised trunk
             time               size                time               size
      1      0.10                819                0.09                564
      2      0.14               1070                0.14                677
      3      0.15               1321                0.14                790
      4      0.15               1572                0.15                903
      8      0.17               2576                0.20               1355
     16      0.25               4584                0.21               2259
     32      0.28               8600                0.27               4067
     64      0.50              16632                0.39               7683

Per-proc scaling, taken at 64 nodes

   ppn             original trunk                         revised trunk
              time               size                time               size
      1       0.50              16669                0.40               7720
      2       0.55              32733                0.54              11048
      3       0.87              48797                0.81              14376
      4       1.0               64861                0.85              17704


Condensing those numbers, it appears we gained:

per-node message size: 251 bytes/node -> 113 bytes/node

per-proc message size: 251 bytes/proc  -> 52 bytes/proc

per-job message size:  568 bytes/job -> 399 bytes/job 
(job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated)

The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling.

Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences.

Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it.

This commit was SVN r16428.
2007-10-11 15:57:26 +00:00
Ralph Castain
54b2cf747e These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
Tim Prins
c46ed1d5d4 Make it so the universe size is passed through the ODLS instead of through a gpr trigger during MPI init. This matches what is currently being done with the app number.
The default odls has been updated and works fine. The process odls has been updated, but I could not verify its operation. The bproc ODLS has not been updated yet. Ralph will look at it soon.

This commit was SVN r15257.
2007-07-02 01:33:35 +00:00
Ralph Castain
d9acc93efa Compute and pass the local_rank and local number of procs (in that proc's job) on the node.
To be precise, given this hypothetical launching pattern:

host1: vpids 0, 2, 4, 6
host2: vpids 1, 3, 5, 7

The local_rank for these procs would be:

host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3
host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3

and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid.

I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future.

This commit was SVN r14706.
2007-05-21 14:30:10 +00:00
Ralph Castain
a1153fdc8f Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.

This commit was SVN r12788.
2006-12-07 03:11:20 +00:00
George Bosilca
6afa4c6c64 Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.

This commit was SVN r11270.
2006-08-20 15:54:04 +00:00