If the btl_usnic_connectivity_map MCA param is set to a non-NULL
value, then each MPI process will output a file named
<prefix>-<hostname>.pid<pid>.job<jobid>.mcwrank<MCW rank>.txt. Its
contents will detail which usNIC device(s) (and therefore which
link(s)) are being used to communicate with each peer MPI process.
Here is a sample output file (named
mpi005.pid26071.job1640759297.mcwrank0.txt):
{{{
device=usnic_0,interface=eth4,ip=10.10.0.5/16,mac=24:57:20:05:20:00,mtu=9000
device=usnic_1,interface=eth5,ip=10.2.0.5/16,mac=24:57:20:05:21:00,mtu=9000
device=usnic_2,interface=eth6,ip=10.3.0.5/16,mac=24:57:20:05:50:00,mtu=9000
peer=1,hostname=mpi006,device=usnic_0@peer_ip=10.10.0.6/16@peer_mac=24:57:20:06:20:00,device=usnic_1@peer_ip=10.2.0.6/16@peer_mac=24:57:20:06:21:00,device=usnic_2@peer_ip=10.3.0.6/16@peer_mac=24:57:20:06:50:00
peer=2,hostname=mpi007,device=usnic_0@peer_ip=10.10.0.7/16@peer_mac=24:57:20:07:20:00,device=usnic_1@peer_ip=10.2.0.7/16@peer_mac=24:57:20:07:21:00,device=usnic_2@peer_ip=10.3.0.7/16@peer_mac=24:57:20:07:50:00
peer=3,hostname=mpi008,device=usnic_0@peer_ip=10.10.0.8/16@peer_mac=24:57:20:08:20:00,device=usnic_1@peer_ip=10.2.0.8/16@peer_mac=24:57:20:08:21:00,device=usnic_2@peer_ip=10.3.0.8/16@peer_mac=24:57:20:08:50:00
}}}
Reviewed by Reese Faucette
cmr=v1.8.2
This commit was SVN r32156.
When ibv_create_ah() fails due to an address resolution failure, it
really only means that we can't reach that one peer -- so we should
just ignore that one peer. If ibv_create_ah() fails for some other
reason, then give up on the entire usnic_X device.
Change the show_help() message that is displayed when ibv_create_ah()
fails due to address resolution failure; indicate that it's likely a
routing problem. Also opal_output_verbose() the same info, since
show_help() is de-duplicated (and this particular show_help() message
can be squelched).
Fixes Cisco bugs CSCup35851 and CSCup35872.
cmr=v1.8.2:ticket=trac:4734
This commit was SVN r32061.
The following Trac tickets were found above:
Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
We don't use this functionality any more; we use the transport_type
and device name to identify usnic devices. It's slightly easier
because we can transport_type+name from ibv_device_open() and don't
have to do an additional ibv_query_device() to get its attributes.
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30882.
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style. That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint. If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later. We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30879.
The following SVN revision numbers were found above:
r30850 --> open-mpi/ompi@3641500442
r30852 --> open-mpi/ompi@4e282a3295
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Basically: since usnic is a connectionless transport, we do not get
OS-provided services "for free" that connection-oriented transports
get, namely: "hey, I wasn't able to make a connection to peer X", and
"hey, your connection to peer X has died."
This connectivity-checker runs in a separate progress thread in the
usnic BTL in local rank 0 on each server. Upon first send in any
process, the connectivty-checker agent will send some UDP pings to the
peer to ensure that we can reach it. If we can't, we'll abort the job
with a nice show_help message.
There's a lengthy comment in btl_usnic_connectivity.h explains the
scheme and how it works.
Reviewed by Dave Goodell.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30860.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Prior to this commit we matched local interfaces to remote interfaces in
order to create endpoints in a simplistic way. If any remote interfaces
were on the same subnet as any of our local interfaces then only local
interfaces would be paired (IP-routed remote interfaces would be
ignored).
This commit introduces a more general scheme which attempts to make the
"best" pairing of local interfaces to remote interfaces. We now cast
the problem as a graph theory problem known as the "Assignment Problem",
or finding a maximum-cardinality, minimum-weight bipartite matching. We
solve this problem by reducing the bipartite graph of interface
connectivity to a flow network and then solving for a minimum cost flow.
This is then easily converted into back into a matching on the original
bipartite graph.
In the new scheme, interfaces on the same subnet are preferred over
interfaces requiring intermediate routing hops and higher bandwidth
links are preferred over lower bandwidth links.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30849.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
This commit decouples OMPI deployment from the version(s) of the lower
layers of the stack by probing for UDP support.
Verbs applications assume a 40-byte header (there is no current
mechanism for querying payload offset). So to support a 42-byte UDP
header without causing existing applications like ibv_ud_pingpong or
older versions of OMPI to crash, we must inform libusnic_verbs that we
are aware of the nonstandard payload offset. We do this by overriding
the `transport_type` field of the device to be 42 before calling
`ibv_open_device`. If the library resets it to something else, then we
know the lower layers are UDP capable. Otherwise we use the older
custom-L2 format.
This necessitated some minor ugliness in common_verbs, but it's as tidy
as Jeff and I know how to make it right now.
This commit only adds support for UDP headers and connectivity over the
same L2 network, it does not touch routing or interface pairing.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30838.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Just trying to be deliberate about keeping fastpath-accessed fields
grouped together to fit into the same 64-byte cache line.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30837.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Authored-by: Reese Faucette <rfaucett@cisco.com>
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30834.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
* Use the prefix rule for global variables
* Elimiante seed_prng() since it isn't necessary any more
These files will need to get edited again then the RNG type obeys the
prefix rule.
Refs trac:4298
This commit was SVN r30803.
The following SVN revision numbers were found above:
r30801 --> open-mpi/ompi@e39d9f4080
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
Some older versions of libibverbs do not have `ibv_event_type_str`,
leading to compilation failures on older machines, irrespective of
whether they could ever support usNIC anyway. If we encounter any other
build issues related to "old verbs" then we should just cause the usnic
BTL to disqualify itself when it encounters "old" traits.
Thanks to Paul Hargrove for reporting the issue:
http://www.open-mpi.org/community/lists/devel/2014/02/14056.php
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30674.
1. Fix ompi_info memory leak in usnic BTL: do not allocate memory in
the component register function, because ompi_info only calls the
component register function and then dlclose's the component -- it
does not call component finalize. Instead, defer parsing the MCA
param (and alloc'ing memory) until the component init function so
that any allocated memory can be freed in the component close
function.
1. Also add a new check to ensure that we actually have some part
numbers to check. Add a show_help message if we don't find any
vendor part IDs to check.
1. Add a verbose output if usnic disqualifies itself from selection
because THREAD_MULTIPLE was specified.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r30073.
MOFED apparently has a /usr/include/infiniband/verbs.h that also
defines a (slightly different but fully compatible) container_of
macro. So put proper #ifndef protection around our definition of
container_of.
Thanks to Rolf vandeVaart for pointing out the issue.
Reviewed by Dave Goodell.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29799.
If we need to use a convertor, go back to stashing that convertor in the
frag and populating segments "on the fly" (in
ompi_btl_usnic_module_progress_sends). Previously we would pack into a
chain of chunk segments at prepare_src time, unnecessarily consuming
additional memory.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29592.
This includes suppressing picky-mode warnings about __VA_ARGS__, which
we know are supported by any compilers we care about.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29590.
This new routine can be called in exceptional situations, either
conditionally in BTL code or from a debugger, to help with debugging in
cases where MSGDEBUG1/2 or stats logging are impractical but more detail
is needed.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
This commit was SVN r29483.
MSGDEBUG2 now means "print a one-liner for all PML calls into BTL, and
also when BTL calls PML with a recv completion (not send completions)"
MSGDEBUG1 means print more internal gory detail
MSGDEBUG is gone, replaced by MSGDEBUG1
In the process also found that PUT_DEST style fragments could
potentially be leaked in usnic_free() since send_fragment tests were
being applied to see if it was eligible to be freed.
This commit was SVN r29185.
The Cisco-maintained v1.6 port of the usnic BTL has diverged from the
upstream trunk and v1.7 branches. This commit adjusts the trunk to more
closely match the v1.6 branch to simplify future merging and
cherry-picking.
The usnic MCA parameters also need work on this side.
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29138.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
The usnic BTL now builds cleanly under `--enable-picky` when `MSGDEBUG1`
is set.
Reviewed-by: jsquyres
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29097.
improvements:
* Fix minor memory leaks during component_init
* Ensure that an initialization loop does not underflow an unsigned int
* Improve mlock limit checking
* Fix set of BTL modules created during component_init when failing to
get QP resources or otherwise excluding some (but not all) usnic
verbs devices
* Fix/improve error messages to be consistent with other Cisco
documentation
* Randomize the initial sliding window sequence number so that we
silently drop incoming frames from previous jobs that still have
existant processes in the middle of dying (and are still
transmitting)
* Ensure we don't break out of add_procs too soon and create an
asymetrical view of what interfaces are available
This commit was SVN r28975.
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.
This commit was SVN r28907.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs. A few noteworthy points:
* This BTL does most of its own fragmentation; it tells the PML that
it has a very high max_send_size (much higher than the network
MTU).
* Since UD fragments are, by definition, unreliable, the usnic BTL
handles all of its own reliability via a sliding window approach
using the opal_hotel construct and many tricks stolen from the
corpus of knowledge surrounding efficient TCP.
* There is a fun PML latency-metric based optimization for NUMA
awareness of short messages.
* Note that this is ''not'' a generic UD verbs BTL; it is specific to
the Cisco usNIC device.
This commit was SVN r28879.