1
1
Граф коммитов

308 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
b1bf557024 Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application. 2014-12-05 15:47:09 -08:00
Ralph Castain
c88f181efe Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate. 2014-12-03 18:11:17 -08:00
Elena
b17ea23ce0 fixes for direct routed component under mpirun 2014-11-26 13:36:49 +02:00
Gilles Gouaillardet
578fe41788 fix hangs introduced by previous commit a6744b8177 2014-11-25 17:50:44 +09:00
Gilles Gouaillardet
a6744b8177 fix misc memory leaks specific to the master 2014-11-25 13:52:10 +09:00
Ralph Castain
48f702827e First part of memory leak cleanups from Gilles 2014-11-24 16:53:33 -08:00
Ralph Castain
6fbc68c830 Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one 2014-11-03 14:43:17 -08:00
Elena
e319c95267 fixes for grpcomm rcd/brucks algorithms 2014-10-09 06:12:26 +02:00
Ralph Castain
9e7e90265f Temporarily make the direct grpcomm component the default until we can debug the other modules
This commit was SVN r32707.
2014-09-11 14:47:54 +00:00
Ralph Castain
4eb6291334 Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives.
This commit was SVN r32703.
2014-09-10 17:36:16 +00:00
Ralph Castain
6323b226c7 Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation).
This commit was SVN r32673.
2014-09-06 19:19:44 +00:00
Ralph Castain
fafdbeec0c Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them.
This commit was SVN r32624.
2014-08-28 20:35:35 +00:00
Ralph Castain
731a878ff3 Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe
This commit was SVN r32620.
2014-08-27 19:52:20 +00:00
Ralph Castain
b87b69e977 Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction
This commit was SVN r32617.
2014-08-27 16:16:46 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
6c5e592785 Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment.
Update the orte/test/mpi/coll_test.c test to show the revised example.

This commit was SVN r32234.

The following SVN revision numbers were found above:
  r32203 --> open-mpi/ompi@a523dba41d
  r32210 --> open-mpi/ompi@2ce11ed5c4
  r32222 --> open-mpi/ompi@d55f16db50
2014-07-15 03:48:00 +00:00
Ralph Castain
d55f16db50 Fix a hang in daemon collectives when run on multinode systems
This commit was SVN r32222.
2014-07-12 00:43:12 +00:00
Ralph Castain
a523dba41d NOTE: this modifies the MPI-RTE interface
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.

This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:

 * In order to allow scalable
 * generation of collective id's, they are formed as:
 *
 * top 32-bits are the jobid of the procs involved in
 * the collective. For collectives across multiple jobs
 * (e.g., in a connect_accept), the daemon jobid will
 * be used as the id will be issued by mpirun. This
 * won't cause problems because daemons don't use the
 * collective_id
 *
 * bottom 32-bits are a rolling counter that recycles
 * when the max is hit. The daemon will cleanup each
 * collective upon completion, so this means a job can
 * never have more than 2**32 collectives going on at
 * a time. If someone needs more than that - they've got
 * a problem.
 *
 * Note that this means (for now) that RTE-level collectives
 * cannot be done by individual threads - they must be
 * done at the overall process level. This is required as
 * there is no guaranteed ordering for the collective id's,
 * and all the participants must agree on the id of the
 * collective they are executing. So if thread A on one
 * process asks for a collective id before thread B does,
 * but B asks before A on another process, the collectives will
 * be mixed and not result in the expected behavior. We may
 * find a way to relax this requirement in the future by
 * adding a thread context id to the jobid field (maybe taking the
 * lower 16-bits of that field).

This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.

This commit was SVN r32203.
2014-07-10 18:53:12 +00:00
Ralph Castain
5d5ae41ea5 Cleanup a memory leak in the daemons - thanks to Artem for spotting it
This commit was SVN r31970.
2014-06-09 17:14:02 +00:00
Ralph Castain
cf2c7381d0 Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs.
Refs trac:4643

This commit was SVN r31915.

The following Trac tickets were found above:
  Ticket 4643 --> https://svn.open-mpi.org/trac/ompi/ticket/4643
2014-06-01 16:08:56 +00:00
Ralph Castain
1107f9099e Per the RFC issued here:
http://www.open-mpi.org/community/lists/devel/2014/05/14827.php

Refactor PMI support

This commit was SVN r31907.
2014-06-01 04:28:17 +00:00
Nathan Hjelm
59d09ad9de orte: fix several small memory leaks
grpcomm: fix memory leaks

We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.

oob,rml: fix memory leaks

This commit fixes several leaks:

 - Both the oob/base and oob/tcp were leaking objects on their peer
   hash tables. Iterate on the hash tables and free any objects.

 - Leaked sent messages because of missing OBJ_RELEASE. I placed the
   release in ORTE_RML_SEND_COMPLETE to catch all the possible
   paths.

ess/base: close the state framework

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6 The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
top_ompi_srcdir  ->  OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR

We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.

Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.

This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
c4c9bc1573 As per the RFC:
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php

Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM

This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Ralph Castain
9d2f5f6b1f Silence warning
cmr=v1.8:reviewer=ompi-gk1.8

This commit was SVN r31294.
2014-03-29 19:10:26 +00:00
Ralph Castain
f259d50ed7 Fully fix the PMI2 warning - turned out to be larger than originally thought due to the way the function was being handled across multiple files. Properly resolve the problem by not compiling the file if PMI2 is not desired, and then appropriately setting the visibility of the function within the module
Refs trac:4400

This commit was SVN r31084.

The following Trac tickets were found above:
  Ticket 4400 --> https://svn.open-mpi.org/trac/ompi/ticket/4400
2014-03-17 17:36:37 +00:00
Ralph Castain
e152449be4 Silence warning
cmr=v1.7.5:reviewer=ompi-gk1.7

This commit was SVN r31083.
2014-03-17 17:05:24 +00:00
Ralph Castain
1326ed704f Per the RFC discussed here:
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php

add support for async modex when requested.

cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support

This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Mike Dubman
874c4e2558 PMI2: add missing file from prev commit
Refs trac:4119

This commit was SVN r30301.

The following Trac tickets were found above:
  Ticket 4119 --> https://svn.open-mpi.org/trac/ompi/ticket/4119
2014-01-16 13:17:08 +00:00
Mike Dubman
98234b5a69 SLURM/PMI2: Fix parsing of PMI2 process mapping
fixed by AlexM, reviewed by miked
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30300.
2014-01-16 12:05:29 +00:00
Ralph Castain
286ff6d552 For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun.
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!

Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.

This behavior will *only* take effect under the following conditions:

1. launched via mpirun

2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX

3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.

The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:

1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".

2. adjust the SM rendezvous logic to loop until the required file has been created

Creating a placeholder to bring this over to 1.7.5 when ready.

cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale

This commit was SVN r30259.
2014-01-11 17:36:06 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Ralph Castain
a8a91b374e Update component-level selection comments to match latest revisions
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30087.
2013-12-25 19:12:43 +00:00
Ralph Castain
24c811805f ****************************************************************
This change contains a non-mandatory modification
       of the MPI-RTE interface. Anyone wishing to support
       coprocessors such as the Xeon Phi may wish to add
       the required definition and underlying support
****************************************************************

Add locality support for coprocessors such as the Intel Xeon Phi.

Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.

So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:

1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board

2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions

3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.

4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.

5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.

6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.

cmr:v1.7.4:reviewer=hjelmn

This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
Ralph Castain
9902748108 ***** THIS INCLUDES A SMALL CHANGE IN THE MPI-RTE INTERFACE *****
Fix two problems that surfaced when using direct launch under SLURM:

1. locally store our own data because some BTLs want to retrieve 
   it during add_procs rather than use what they have internally

2. cleanup MPI_Abort so it correctly passes the error status all
   the way down to the actual exit. When someone implemented the
   "abort_peers" API, they left out the error status. So we lost
   it at that point and *always* exited with a status of 1. This 
   forces a change to the API to include the status.

cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch

This commit was SVN r29405.
2013-10-08 18:37:59 +00:00
Ralph Castain
f4f2287958 Singletons currently start out by spawning an HNP - this is required solely in the cases where the singleton subsequently calls MPI_Comm_spawn or publishes port info without support from an external orte-server. In all other cases, the HNP is of no value and can actually be a detriment by creating additional overhead on the node. This is particularly concerning for async operations where processes may begin as singletons and then dynamically wireup to perform pt2pt communications.
So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it.

cmr:v1.7.4:reviewer=jsquyres

This commit was SVN r29354.
2013-10-04 02:58:26 +00:00
Nathan Hjelm
11722457ce Fix typo in grpcomm_pmi_module.c that was giving the wrong locality for direct launched jobs. Refs trac:3824
This commit was SVN r29348.

The following Trac tickets were found above:
  Ticket 3824 --> https://svn.open-mpi.org/trac/ompi/ticket/3824
2013-10-03 14:38:45 +00:00
Ralph Castain
71a24d6e74 Add some debug
This commit was SVN r29326.
2013-10-02 01:37:02 +00:00
Ralph Castain
d565a76814 Do some cleanup of the way we handle modex data. Identify data that needs to be shared with peers in my job vs data that needs to be shared with non-peers - no point in sharing extra data. When we share data with some process(es) from another job, we cannot know in advance what info they have or lack, so we have to share everything just in case. This limits the optimization we can do for things like comm_spawn.
Create a new required key in the OMPI layer for retrieving a "node id" from the database. ALL RTE'S MUST DEFINE THIS KEY. This allows us to compute locality in the MPI layer, which is necessary when we do things like intercomm_create.

cmr:v1.7.4:reviewer=rhc:subject=Cleanup handling of modex data

This commit was SVN r29274.
2013-09-27 00:37:49 +00:00
George Bosilca
9e6c3c0646 Save the error code.
This commit was SVN r29196.
2013-09-17 23:50:11 +00:00
Ralph Castain
c71e760e6c The modex code was unfortunately written solely for PMI1 when updated to minimize calls to PMI_get - add the required PMI2 code
This commit was SVN r29084.
2013-08-28 23:52:32 +00:00
Ralph Castain
6d24b34940 Extend the dpm framework API to support persistent accept/connect operations:
* paccept - establish a persistent listening port for async connect requests

* pconnect - async connect to remote process that has posted a paccept port. Provides a timeout mechanism, and allows the underlying implementation to retry until timeout 

* pclose - shuts down a prior paccept posting

Includes example programs paccept.c and pconnect.c in orte/test/mpi. New MPI extension interfaces coming...

This commit was SVN r29063.
2013-08-23 18:02:50 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
611d7f9f6b When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.

Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:

* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally

* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test

* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).

Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it

Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.

This commit was SVN r29040.
2013-08-17 00:49:18 +00:00
Nathan Hjelm
88cadc552d Make opal/db/pmi use as few PMI keys as possible.
This commit reintroduces key compression into the pmi db. This feature
compresses the keys stored into the component into a small number of
PMI keys by serializing the data and base64 encoding the result. This
will avoid issues with Cray PMI which restricts us to ~ 3 PMI keys per
rank.

This commit was SVN r28993.
2013-08-03 01:06:59 +00:00
Ralph Castain
6c1a140e99 Per request from Nathan, add a "commit" API to the opal db framework. This allows him to aggregate keys to work around the Cray's severe PMI limitations
This commit was SVN r28917.
2013-07-22 22:57:16 +00:00
Joshua Ladd
0b5c1f2ea8 Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28666.
2013-06-21 15:28:14 +00:00
Nathan Hjelm
518d1fe200 Fix two typos that prevented alps direct launch from working
This commit was SVN r28628.
2013-06-13 17:04:08 +00:00
Ralph Castain
3a354c4ea3 Cleanup the verbose output channel name
This commit was SVN r28391.
2013-04-24 23:44:02 +00:00
Ralph Castain
c5e1a7dc65 fix typo
This commit was SVN r28390.
2013-04-24 23:37:59 +00:00
Ralph Castain
2e8946db0a Add some debug output
This commit was SVN r28371.
2013-04-23 23:11:22 +00:00
Ralph Castain
45af6cf59e The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem.
This commit was SVN r28308.
2013-04-08 23:34:16 +00:00
Ralph Castain
e6ae088813 Cleanup error outputs when a daemon fails to start
This commit was SVN r28261.
2013-03-28 16:51:19 +00:00
Ralph Castain
21ee48de57 Add missing static declaration
This commit was SVN r28247.
2013-03-27 21:59:17 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
6ee32767d4 Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument.
This commit was SVN r28219.
2013-03-26 18:27:50 +00:00
Ralph Castain
cf9796accd Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Ralph Castain
bd9265c560 Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:

1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).

2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.

3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.

4. the ORTE grpcomm and ess/pmi components, and the nidmap code,  were updated to work with the new db framework and to specify internal/global storage options.

No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.

This commit was SVN r28112.
2013-02-26 17:50:04 +00:00
Ralph Castain
744ed49b2d Begin cleanup of the thread_lock calls in ORTE. We'll ignore the ones in the rml/oob for now as that code block is being rewritten anyway.
This commit was SVN r28053.
2013-02-13 01:53:12 +00:00
Ralph Castain
cfaefb3286 Remove the only place where PMI was used outside a component, and relocate that code to common/pmi.
This commit was SVN r27944.
2013-01-28 20:14:51 +00:00
Ralph Castain
c65de32218 Cleanup the PMI subsystems to support Sam's "rml-less" shared memory wireup. Only retrieve keys that are specifically requested, and only when they are requested. Let string values be segmented across multiple keys, but don't do it for anything else.
This commit was SVN r27737.
2013-01-03 02:16:10 +00:00
Ralph Castain
c26ed7dcdd Fix comm_spawn when ORTE progress thread is enabled by ensuring that all operations on the global list of active collectives are done in events to avoid conflicts.
This commit was SVN r27658.
2012-12-09 02:53:20 +00:00
Ralph Castain
26f1cd0909 Fix compiler warnings
This commit was SVN r27588.
2012-11-12 02:50:45 +00:00
Nathan Hjelm
bdedd8b0d3 Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.

This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Brian Barrett
e61c00212d Add files found in svn but not tarball
This commit was SVN r27549.
2012-11-01 02:27:03 +00:00
Ralph Castain
c4fd3df2df Remove unused variables
This commit was SVN r27319.
2012-09-12 12:03:24 +00:00
Ralph Castain
c82cfecc1c Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them.
As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along.

This commit was SVN r27318.
2012-09-12 11:31:36 +00:00
Ralph Castain
efc4a40c8a It is okay for a key not to be found
This commit was SVN r27187.
2012-08-30 15:12:23 +00:00
Ralph Castain
6dbb7a8493 Just because a peer didn't post a particular pmi key, that doesn't mean it is an immediate irrecoverable error - could be they just don't have a matching interface. Let the upper layer decide what to do about it.
This commit was SVN r27186.
2012-08-30 14:12:09 +00:00
Ralph Castain
a3b08f5800 Fix a few things relating to comm_spawn that causes new daemons to be launched. Ensure that all new daemons receive a full pidmap. Properly mark the daemon job as "updated" when daemons are added
This commit was SVN r27177.
2012-08-29 03:11:37 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
d310dd8c58 Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems
This commit was SVN r27161.
2012-08-28 17:17:34 +00:00
Ralph Castain
11c68e2299 Correct the count in the pmi key
This commit was SVN r27156.
2012-08-28 15:05:02 +00:00
Ralph Castain
64cf75cec5 Add some debug
This commit was SVN r27087.
2012-08-17 02:19:26 +00:00
Ralph Castain
c4ee297a60 Cleanup the pmi grpcomm module so it passes non-btl modex data correctly.
This commit was SVN r26992.
2012-08-10 20:35:50 +00:00
Ralph Castain
0dfe29b1a6 Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.

Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.

This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Brian Barrett
b22faedd9d Remove the Portals4 SHMEM reference implementation runtime support, as we're
no longer using the runtime provided by the reference implementation.

Remove the Catamount support from ORTE, since we're no longer supporting
Catamount.  Left the Catamount timer component, because I'm not sure whether
it's used on the XTs running CNL.

This commit was SVN r26677.
2012-06-27 14:17:43 +00:00
Shiqing Fan
6f746cdb33 remove a unused file.
This commit was SVN r26645.
2012-06-25 10:17:21 +00:00
Ralph Castain
0a713cd27e Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later.
Modex is still a blocking operation at this point.

This commit was SVN r26618.
2012-06-19 13:38:42 +00:00
Ralph Castain
9e0bb6ae28 Revert r26600 and r26601 for a couple of reasons:
1. they modified the OMPI-ORTE interface, which is something I promised to avoid doing unless absolutely necessary, and

2. the framework ident is already in the component name key provided to the modex db. What is missing is the project ident, but as Jeff and I discussed last week, we really need to add that field to the component struct anyway to avoid multi-project collisions on framework names. That will be done over the next couple of weeks as a separate effort.

This commit was SVN r26613.

The following SVN revision numbers were found above:
  r26600 --> open-mpi/ompi@5ba4deff07
  r26601 --> open-mpi/ompi@0e3094c318
2012-06-16 09:11:03 +00:00
Ralph Castain
96c778656a Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP.
Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build.

This commit was SVN r26606.
2012-06-15 10:15:07 +00:00
Ralph Castain
0e3094c318 Update the other grpcomm modules to new API
This commit was SVN r26601.
2012-06-14 03:28:48 +00:00
Ralph Castain
5ba4deff07 Extend the modex database to support multiple projects and frameworks that might have duplicate component names. No visible API change in the BTL's as it was executed solely in the ompi modex code.
This commit was SVN r26600.
2012-06-14 02:55:06 +00:00
Brian Barrett
7406ef1241 Make all the PMI components depend on the common pmi library and properly
install the common pmi library

This commit was SVN r26588.
2012-06-11 15:58:09 +00:00
Shiqing Fan
2abf783fa0 Remove a unnecessary definition before the real one.
This commit was SVN r26562.
2012-06-06 14:15:39 +00:00
Nathan Hjelm
b9959a95cd ack! one more
This commit was SVN r26472.
2012-05-22 20:52:52 +00:00
Nathan Hjelm
cdc3c87ba6 move pmi init/finalize into a common component
This commit was SVN r26470.
2012-05-22 15:15:39 +00:00
Ralph Castain
c4f8043064 Per Nathan, with a little cleanup by me: update the PMI support to aggregate modex info, thus reducing the number of keys required so it fits within Cray default constraints
This commit was SVN r26456.
2012-05-19 16:12:52 +00:00
Ralph Castain
b2f77bf08f Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.

This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
7741ba47be Fix comm_spawn that spans multiple nodes
This commit was SVN r26268.
2012-04-13 01:59:07 +00:00
Ralph Castain
4d16790836 Fix collectives for jobs running across partial allocations
This commit was SVN r26267.
2012-04-13 00:38:47 +00:00
Ralph Castain
9cd4c06488 Get things to build and run when --disable-orte is specified
This commit was SVN r26263.
2012-04-10 21:50:01 +00:00
Ralph Castain
f5cd996b91 Fix the case where n=1
This commit was SVN r26258.
2012-04-09 22:44:56 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
366f9d1518 Add some missing localities to the hwloc pretty-print, fix pmi modex
This commit was SVN r26105.
2012-03-06 06:21:10 +00:00
Ralph Castain
a0edae52f2 Ensure the wrapper flags get entered in the right order, with -lpmi coming before the alps util libs
This commit was SVN r25809.
2012-01-27 20:56:21 +00:00
Ralph Castain
be3dfb6a1a Ensure that we only add -lpmi once to the wrapper compilers, no matter how many components might use it.
This commit was SVN r25753.
2012-01-20 04:56:38 +00:00