1
1

4303 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
611d7f9f6b When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.

Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:

* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally

* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test

* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).

Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it

Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.

This commit was SVN r29040.
2013-08-17 00:49:18 +00:00
Ralph Castain
f8a72feb25 Silence unitialized var warning
This commit was SVN r29036.
2013-08-16 21:39:28 +00:00
Ralph Castain
c74c54e18d Cleanup uninitialized warnings
This commit was SVN r29033.
2013-08-16 21:23:09 +00:00
Ralph Castain
33beab5918 Avoid segfault due to uninitialized variable
This commit was SVN r29030.
2013-08-16 21:10:38 +00:00
Ralph Castain
bebe852057 Add new info key for publish that allows user to designate that the port is to be unique - i.e., to return an error if that service has already been published. Default is to overwrite
This commit was SVN r29028.
2013-08-14 04:21:17 +00:00
Ralph Castain
318467c04f If we only have global scope, then don't fall back to looking at local scope if the lookup target wasn't found else we will hang
This commit was SVN r29025.
2013-08-13 04:45:33 +00:00
Nathan Hjelm
6c75699068 coll/ml: fix typo in assert that could cause an abort in debug builds.
cmr=v1.7.3:reviewer=manjugv

This commit was SVN r29024.
2013-08-12 14:31:44 +00:00
Jeff Squyres
c09ec204ad Change usNIC BTL to always use small fragments when there is a
non-contiguous converter.  We can't "convert on the fly" because the #
of bytes requested may not divide evenly into the convertor data type.

This commit was SVN r29014.
2013-08-11 17:04:13 +00:00
Nathan Hjelm
47320713bb coll/ml: do not register variables in open and fix a bug in the coll/ml parser
cmr=v1.7.3:reviewer=pasha

This commit was SVN r29010.
2013-08-09 17:55:30 +00:00
Rolf vandeVaart
cd72024a3c Refactor some of the initialization code.
This commit was SVN r29009.
2013-08-09 14:54:17 +00:00
Edgar Gabriel
f7391eca23 Lazy open does not work for the addproc sharedfp component since it starts by
spawning a process using MPI_Comm_spawn. For this, the first operation has to
be collective which we can not guarantuee outside of the MPI_File_open
operation.

This commit was SVN r29008.
2013-08-06 20:48:20 +00:00
Edgar Gabriel
e348f5567f add unignore for me.
This commit was SVN r29007.
2013-08-06 20:47:08 +00:00
George Bosilca
837b3363fe Silence few warnings.
This commit was SVN r29004.
2013-08-06 09:38:30 +00:00
Brian Barrett
2cc947513b * Fix some compile errors
* Need to subtract 1 off the size so that we stay in the bit length requirements

This commit was SVN r28997.
2013-08-05 18:49:48 +00:00
Jeff Squyres
87910daf51 Fix a collection of bugs found by QA and Coverity, and make some minor
improvements:

* Fix minor memory leaks during component_init
* Ensure that an initialization loop does not underflow an unsigned int
* Improve mlock limit checking
* Fix set of BTL modules created during component_init when failing to
  get QP resources or otherwise excluding some (but not all) usnic
  verbs devices
* Fix/improve error messages to be consistent with other Cisco
  documentation
* Randomize the initial sliding window sequence number so that we
  silently drop incoming frames from previous jobs that still have
  existant processes in the middle of dying (and are still
  transmitting) 
* Ensure we don't break out of add_procs too soon and create an
  asymetrical view of what interfaces are available

This commit was SVN r28975.
2013-08-01 16:56:15 +00:00
Nathan Hjelm
8429485a39 mpool/grdma: use the rcache even if not using mpi_leave_pinned or
mpi_leave_pinned_pipeline

This change should improve performance is the non-pinned case where
the same memory region is involved in multiple simultaneous transfers.

cmr=v1.7.3:reviewer=brbarret

This commit was SVN r28973.
2013-07-31 23:50:41 +00:00
Nathan Hjelm
1382d4fb53 Fix typos in _Complex ops
This commit was SVN r28959.
2013-07-26 17:02:45 +00:00
Nathan Hjelm
e4f105ffb3 revert change that shouldn't have been part of r28952
This commit was SVN r28953.

The following SVN revision numbers were found above:
  r28952 --> open-mpi/ompi@cb90a4a7fc
2013-07-25 20:23:55 +00:00
Nathan Hjelm
cb90a4a7fc Add simple algorithms to support MPI_IN_PLACE for MPI_Alltoall, MPI_Alltoallv, and MPI_Alltoallw.
Working on faster algorithms for tuned that will come at a later time.

cmr=v1.7.3:ticket=trac:2965

This commit was SVN r28952.

The following Trac tickets were found above:
  Ticket 2965 --> https://svn.open-mpi.org/trac/ompi/ticket/2965
2013-07-25 19:19:41 +00:00
Nathan Hjelm
99adeb7f6e Fix support for complex datatypes when fortran is not available but _Complex is
This commit was SVN r28951.
2013-07-25 19:08:21 +00:00
Edgar Gabriel
012f99c3b6 ompi_ignore this component until we find a solution for the dependence on
libmpi for the external process that is being spawned by this component.

This commit was SVN r28945.
2013-07-24 20:02:47 +00:00
Jeff Squyres
f7337b8f77 Correct faulty max payload and MTU computations (and update some
debugging that helped us find those).

This commit was SVN r28942.
2013-07-24 16:06:28 +00:00
Ralph Castain
db214a2321 Refs trac:3697 - use the opal_pmi_error function instead of ompi_error as the returned error codes are from PMI
This commit was SVN r28941.

The following Trac tickets were found above:
  Ticket 3697 --> https://svn.open-mpi.org/trac/ompi/ticket/3697
2013-07-24 04:05:41 +00:00
Jeff Squyres
5323051047 Use sysfs to check MPI has enough VFs, QPs, and CQs
Use the new sysfs files to check that there are enough VFs, QPs, and
CQs for all the MPI processes on this server.
    
Move the checking code into its own subroutine to make it smaller and
easier to read/grok.

This commit was SVN r28937.
2013-07-24 00:38:32 +00:00
Ralph Castain
59a71765cf Hmmm...these error outputs will never occur, which is probably not what the author intended. So do the output and THEN jump to the error exit.
This commit was SVN r28918.
2013-07-22 22:58:03 +00:00
Edgar Gabriel
8ffc1aac89 update the _component.c files in ompio to use the explicit assignment of the
mca_register_component_params element of the structure.

This commit was SVN r28914.
2013-07-22 21:11:05 +00:00
Nathan Hjelm
b17cd13c09 sharedfp: ensure sharedfp components register their parameters in mca_register_component_params not mca_component_open
This commit was SVN r28910.
2013-07-22 17:53:58 +00:00
Jeff Squyres
b437041aeb Update one more comment.
This commit was SVN r28908.
2013-07-22 17:29:00 +00:00
Jeff Squyres
4b6006402d Use the RTE framework instead of calling ORTE directly.
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.

This commit was SVN r28907.
2013-07-22 17:28:23 +00:00
Jeff Squyres
ca9da8a554 Fix minor typo in the comments/docs.
This commit was SVN r28905.
2013-07-22 17:24:17 +00:00
Rolf vandeVaart
67badf384c Only search SONAME of library. Expand comments.
This commit was SVN r28904.
2013-07-22 15:54:45 +00:00
Brian Barrett
e1d72409cd add missing header
This commit was SVN r28897.
2013-07-21 19:40:31 +00:00
Brian Barrett
704f1ecc18 fix non-orte builds of PSM
This commit was SVN r28893.
2013-07-21 19:12:32 +00:00
Brian Barrett
05ab9cbaa6 Need to ship pmi_internal.h
This commit was SVN r28891.
2013-07-21 19:00:50 +00:00
Brian Barrett
495384d8b7 Update documentation in rte.h to match recent changes
This commit was SVN r28887.
2013-07-20 22:14:12 +00:00
Brian Barrett
414ba3dad8 Update PMI RTE to match error handling changes that were part of r28852.
Note that the PMI RTE still doesn't listen for asynchronous errors, so
the error handler still won't ever actually do anything :).

This commit was SVN r28886.

The following SVN revision numbers were found above:
  r28852 --> open-mpi/ompi@e4e678e234
2013-07-20 22:09:02 +00:00
Brian Barrett
5bfd980968 update PMI RTE component to adapt to ORTE changes
This commit was SVN r28885.
2013-07-20 22:06:47 +00:00
Brian Barrett
d984d25da3 Remove orte header file from sharedfp components (OMPI layer should not
include ORTE layer with the RTE framework).  Thankfully, nothing used
orte_show_help, so easy fix.

This commit was SVN r28884.
2013-07-20 22:03:44 +00:00
Jeff Squyres
194b285447 First commit of the Cisco usNIC BTL.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs.  A few noteworthy points:

 * This BTL does most of its own fragmentation; it tells the PML that
   it has a very high max_send_size (much higher than the network
   MTU).
 * Since UD fragments are, by definition, unreliable, the usnic BTL
   handles all of its own reliability via a sliding window approach
   using the opal_hotel construct and many tricks stolen from the
   corpus of knowledge surrounding efficient TCP.
 * There is a fun PML latency-metric based optimization for NUMA
   awareness of short messages.
 * Note that this is ''not'' a generic UD verbs BTL; it is specific to
   the Cisco usNIC device.

This commit was SVN r28879.
2013-07-19 22:13:58 +00:00
Jeff Squyres
3546163c48 Devices that do not support RC QP's are also intentionally skipped;
don't warn about skipping them.

This commit was SVN r28874.
2013-07-19 19:05:18 +00:00
Ralph Castain
e4e678e234 Per the RFC and discussion on the devel list, update the RTE-MPI error handling interface. There are a few differences in the code from the original RFC that came out of the discussion - I've captured those in the following writeup
George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.

The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.

In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-)

The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it.

Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API.

This commit was SVN r28852.
2013-07-19 01:08:53 +00:00
Ralph Castain
8a8b4896be Need to protect libgen.h as some systems might not have it
This commit was SVN r28845.
2013-07-18 20:21:37 +00:00
Edgar Gabriel
185e365dad make the sm sharedfp component compile on Mac.
This commit was SVN r28844.
2013-07-18 20:17:14 +00:00
Edgar Gabriel
93cef82873 remove the ylib component from the fcoll framework. It is not used, there are
no plans to use it. We can always recover it from svn if we would ever change
our minds.

This commit was SVN r28840.
2013-07-18 16:18:06 +00:00
Pavel Shamis
68969ba6e5 Removing bogus references in iboffload code.
cmr:v1.7:reviewer=hjelmn

This commit was SVN r28834.
2013-07-17 22:35:24 +00:00
Rolf vandeVaart
49663fb802 Move CUDA-aware configurary to its own file and other minor changes due to review.
This commit was SVN r28832.
2013-07-17 22:12:29 +00:00
Edgar Gabriel
6e8522fec5 infuse life into the shared file pointer framework. For this:
- extend the framework API
 - remove the dummy component, not require anymore
 - add four components to perform the actual job.

This commit was SVN r28828.
2013-07-17 21:55:24 +00:00
Edgar Gabriel
ac694b7056 in preparation for the new shared file pointer components to be committed
soon:
 - add a new abstraction layer to be used internally for some operations
 - add a new mca parameter to control lazy intialization of shared file
 pointer structures

This commit was SVN r28826.
2013-07-17 21:30:50 +00:00
Vishwanath Venkatesan
ce8f8f0829 Changing the MPI Datatype from MPI_LONG to OMPI_OFFSET_DATATYPE for send/recv offsets
This commit was SVN r28822.
2013-07-17 19:16:53 +00:00
Nathan Hjelm
d4c6029cf3 sbgp/ibnet: set mca_sbgp_ibnet_component.mtu to IBV_MTU_1024 before registering it. cmr:v1.7:reviewer=pasha
This commit was SVN r28821.
2013-07-17 19:16:31 +00:00