1
1

20 Коммитов

Автор SHA1 Сообщение Дата
Dave Goodell
f6bb853409 usnic: properly check src iface in route queries
rtnetlink doesn't check the source address when determining whether to
return route info for a query.  So we need to check that the OIF matches
the OIF of the source interface name.  Without this check, OMPI might
pair a local interface which does not have a route to a particular
remote interface.

Fixes Cisco bug CSCup55797.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32090.
2014-06-25 22:39:02 +00:00
Jeff Squyres
baeae72370 usnic: remove "device" and "port" language from show_help() messages
Move away from verbs-specific terms "device" and "port" in the usnic
BTL help messages.  Replace them with "usNIC interface" (since usNIC
has no concept of a port).

cmr=v1.8.2:ticket=trac:4734

This commit was SVN r32029.

The following Trac tickets were found above:
  Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
2014-06-18 15:20:50 +00:00
Jeff Squyres
184e4fc0ca usnic: ensure that procs agree on use_udp value
Add the component use_udp value into the modex.  If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used).  This prevents accidental customer
misconfigurations.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31689.
2014-05-08 16:43:50 +00:00
Jeff Squyres
e9c3df652e usnic: reduce sizeof(ompi_btl_usnic_addr_t) to 56 bytes
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes.  Also change mtu from
int to uint16_t; there was no need for it to be that large.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31688.
2014-05-08 16:38:59 +00:00
Jeff Squyres
cdb396697c usnic: do not disqualify if a peer does not put usnic modex info
If ompi_modex_recv() fails with OPAL_ERR_DATA_VALUE_NOT_FOUND, it
simply means that the peer process did not put any usnic BTL modex
info -- it is not an error.  So have the usnic BTL simply ignore that
peer (vs. disqualifying itself / treating this like a real error).

Refs trac:4442.

This commit was SVN r31258.

The following Trac tickets were found above:
  Ticket 4442 --> https://svn.open-mpi.org/trac/ompi/ticket/4442
2014-03-27 19:37:07 +00:00
Jeff Squyres
3cbdf33b88 This is what r30852 should have been: Consolidate into a single, outter loop of ibv_create_ah() calls
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style.  That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint.  If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later.  We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30879.

The following SVN revision numbers were found above:
  r30850 --> open-mpi/ompi@3641500442
  r30852 --> open-mpi/ompi@4e282a3295

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-27 17:19:50 +00:00
Jeff Squyres
45810f0efb Revert r30852: the wrong version of this patch got committed to SVN.
This commit was SVN r30878.

The following SVN revision numbers were found above:
  r30852 --> open-mpi/ompi@4e282a3295
2014-02-27 15:02:15 +00:00
Jeff Squyres
4e282a3295 Consolidate into a single, outter loop of ibv_create_ah() calls
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style.  That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint.  If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later.  We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30852.

The following SVN revision numbers were found above:
  r30850 --> open-mpi/ompi@3641500442

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 11:02:12 +00:00
Dave Goodell
3641500442 usnic: Loop on the ibv_create_ah call.
ibv_create_ah() may need to effect an ARP resolution, which may take
some time.  Rather than blocking in ibv_create_ah(), the usNIC driver
may return NULL and set errno to EAGAIN indicating that we should try
again (i.e., the ARP resolution is proceeding under the covers).

So add a simple loop here to loop over ibv_create_ah() until it
returns non-(NULL+EAGAIN).  A future commit will make this a bit more
efficient.

Authored-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Dave Goodell <dgoodell@cisco.com>

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30850.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:50:38 +00:00
Dave Goodell
c40f8879c8 usnic: improve interface matching (esp. for UDP)
Prior to this commit we matched local interfaces to remote interfaces in
order to create endpoints in a simplistic way.  If any remote interfaces
were on the same subnet as any of our local interfaces then only local
interfaces would be paired (IP-routed remote interfaces would be
ignored).

This commit introduces a more general scheme which attempts to make the
"best" pairing of local interfaces to remote interfaces.  We now cast
the problem as a graph theory problem known as the "Assignment Problem",
or finding a maximum-cardinality, minimum-weight bipartite matching.  We
solve this problem by reducing the bipartite graph of interface
connectivity to a flow network and then solving for a minimum cost flow.
This is then easily converted into back into a matching on the original
bipartite graph.

In the new scheme, interfaces on the same subnet are preferred over
interfaces requiring intermediate routing hops and higher bandwidth
links are preferred over lower bandwidth links.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30849.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:50:26 +00:00
Dave Goodell
044a190cac usnic: consolidate orte includes into compat.h
These includes only exist in the Cisco-internal usnic-v1.6 code base,
but they should not exist anywhere except btl_usnic_compat.h in order to
minimize source differences between usnic-v1.6 and v1.7/trunk.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30844.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:47:33 +00:00
Dave Goodell
f6036d11c8 usnic: fix sender hash comparisons for UDP
There was a duplicated subnet check in the sender hash lookup routine.
This caused receivers to always fail the sender hash lookup if the
sender was in a different subnet, so the receiver would discard the
packet as though it were coming from a different job.

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30841.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:46:50 +00:00
Dave Goodell
a54f53f242 usnic: also match interfaces in different subnets
This functionality is required for routable UDP/IP usnic traffic.

Previously we would only setup endpoints for remote interfaces on the
same subnet as the current module's local interface.  This behavior
still holds if two processes share any common subnets.  However, if the
two processes only have no subnets in common then we assume that all
interfaces are reachable from all other interfaces and wire them up in a
1-1, randomly-matched order somewhat similarly to the "tcp" BTL's
behavior.

Only match in different subnets if we detect UDP support in the lower
layer.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30839.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:44:49 +00:00
Dave Goodell
dbbe6a8254 usnic: fix proc structure memory leak
Valgrind showed this one, just a bit of sloppiness with the reference
counting.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30832.

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-26 07:39:34 +00:00
Dave Goodell
dd82bd3c19 usnic: fix invalid rfstart initialization
endpoint_rfstart was being initialized from a value which was not yet
set.  Also ensure that rfstart is a valid index in the range
0..WINDOW_SIZE-1, since it is used as the index into endpoint_rcvd_segs,
which has WINDOW_SIZE elements.

Without this change there is significant risk of memory corruption or
segfaults, resulting in hangs or crashes, if malloc ever returns us a
value >=WINDOW_SIZE (4096).  Right now we seem to be getting lucky that
the malloc is returning zero-pages to us when we are allocating endpoint
structures (possibly because the freelist performs a single large
allocation for all endpoints).

Fixes Cisco bug CSCui88781.

Reviewed-by: rfaucett@cisco.com
Reviewed-by: jsquyres@cisco.com

cmr=v1.7.3:reviewer=jsquyres

This commit was SVN r29075.
2013-08-27 22:43:20 +00:00
Ralph Castain
45e695928f As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time:
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.

* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"

* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded

* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base

* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames

This commit was SVN r29052.
2013-08-20 18:59:36 +00:00
Ralph Castain
611d7f9f6b When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.

Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:

* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally

* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test

* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).

Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it

Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.

This commit was SVN r29040.
2013-08-17 00:49:18 +00:00
Jeff Squyres
87910daf51 Fix a collection of bugs found by QA and Coverity, and make some minor
improvements:

* Fix minor memory leaks during component_init
* Ensure that an initialization loop does not underflow an unsigned int
* Improve mlock limit checking
* Fix set of BTL modules created during component_init when failing to
  get QP resources or otherwise excluding some (but not all) usnic
  verbs devices
* Fix/improve error messages to be consistent with other Cisco
  documentation
* Randomize the initial sliding window sequence number so that we
  silently drop incoming frames from previous jobs that still have
  existant processes in the middle of dying (and are still
  transmitting) 
* Ensure we don't break out of add_procs too soon and create an
  asymetrical view of what interfaces are available

This commit was SVN r28975.
2013-08-01 16:56:15 +00:00
Jeff Squyres
4b6006402d Use the RTE framework instead of calling ORTE directly.
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.

This commit was SVN r28907.
2013-07-22 17:28:23 +00:00
Jeff Squyres
194b285447 First commit of the Cisco usNIC BTL.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs.  A few noteworthy points:

 * This BTL does most of its own fragmentation; it tells the PML that
   it has a very high max_send_size (much higher than the network
   MTU).
 * Since UD fragments are, by definition, unreliable, the usnic BTL
   handles all of its own reliability via a sliding window approach
   using the opal_hotel construct and many tricks stolen from the
   corpus of knowledge surrounding efficient TCP.
 * There is a fun PML latency-metric based optimization for NUMA
   awareness of short messages.
 * Note that this is ''not'' a generic UD verbs BTL; it is specific to
   the Cisco usNIC device.

This commit was SVN r28879.
2013-07-19 22:13:58 +00:00