mca_btl_base_segment_t and replace them with des_local and des_remote
This change also updates the BTL version to 3.0.0. This commit does
not represent the final version of BTL 3.0.0. More changes are coming.
In making this change I updated all of the BTLs as well as BTL user's
to use the new structure members. Please evaluate your component to
ensure the changes are correct.
RFC text:
This is the first of several BTL interface changes I am proposing for
the 1.9/2.0 release series.
What: Change naming of btl descriptor members. I propose we change
des_src and des_dst (and their associated counts) to be des_local and
des_remote. For receive callbacks the des_local member will be used to
communicate the segment information to the callback. The proposed change
will include updating all of the doxygen in btl.h as well as updating
all BTLs and BTL users to use the new naming scheme.
Why: My btl usage makes use of both put and get operations on the same
descriptor. With the current naming scheme I need to ensure that there
is consistency beteen the segments described in des_src and des_dst
depending on whether a put or get operation is executed. Additionally,
the current naming prevents BTLs that do not require prepare/RMA matched
operations (do not set MCA_BTL_FLAGS_RDMA_MATCHED) from executing
multiple simultaneous put AND get operations. At the moment the
descriptor can only be used with one or the other. The naming change
makes it easier for BTL users to setup/modify descriptors for RMA
operations as the local segment and remote segment are always in the
same member field. The only issue I forsee with this change is that it
will require a little more work to move BTL fixes to the 1.8 release
series.
This commit was SVN r32196.
Move away from verbs-specific terms "device" and "port" in the usnic
BTL help messages. Replace them with "usNIC interface" (since usNIC
has no concept of a port).
cmr=v1.8.2:ticket=trac:4734
This commit was SVN r32029.
The following Trac tickets were found above:
Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
Add the component use_udp value into the modex. If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used). This prevents accidental customer
misconfigurations.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31689.
If they exist, call the usnic verbs extensions to both enable UDP
support and get the UD receiver header length that should be used
(rather than assume 40/struct GRH).
This commit was SVN r30912.
We don't use this functionality any more; we use the transport_type
and device name to identify usnic devices. It's slightly easier
because we can transport_type+name from ibv_device_open() and don't
have to do an additional ibv_query_device() to get its attributes.
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30882.
Basically: since usnic is a connectionless transport, we do not get
OS-provided services "for free" that connection-oriented transports
get, namely: "hey, I wasn't able to make a connection to peer X", and
"hey, your connection to peer X has died."
This connectivity-checker runs in a separate progress thread in the
usnic BTL in local rank 0 on each server. Upon first send in any
process, the connectivty-checker agent will send some UDP pings to the
peer to ensure that we can reach it. If we can't, we'll abort the job
with a nice show_help message.
There's a lengthy comment in btl_usnic_connectivity.h explains the
scheme and how it works.
Reviewed by Dave Goodell.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30860.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Prior to this commit we matched local interfaces to remote interfaces in
order to create endpoints in a simplistic way. If any remote interfaces
were on the same subnet as any of our local interfaces then only local
interfaces would be paired (IP-routed remote interfaces would be
ignored).
This commit introduces a more general scheme which attempts to make the
"best" pairing of local interfaces to remote interfaces. We now cast
the problem as a graph theory problem known as the "Assignment Problem",
or finding a maximum-cardinality, minimum-weight bipartite matching. We
solve this problem by reducing the bipartite graph of interface
connectivity to a flow network and then solving for a minimum cost flow.
This is then easily converted into back into a matching on the original
bipartite graph.
In the new scheme, interfaces on the same subnet are preferred over
interfaces requiring intermediate routing hops and higher bandwidth
links are preferred over lower bandwidth links.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30849.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30846.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
This commit adds mechanisms for writing and running unit tests in the
usnic BTL. The short version of how to run the tests is:
1. Configure with `--enable-ompi-btl-usnic-unit-tests`. This will cause
the unit testing code and test runner utility to be built.
2. Run the tests by running `ompi_btl_usnic_run_tests`.
See `README.test` for full details.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30845.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
These includes only exist in the Cisco-internal usnic-v1.6 code base,
but they should not exist anywhere except btl_usnic_compat.h in order to
minimize source differences between usnic-v1.6 and v1.7/trunk.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30844.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Lower layer (hardware or software) bugs can result in a mismatch between
our BTL-layer payload size and the actual packet length. We now check
that in order to catch these cases, which otherwise can result in
MPI-layer message corruption.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30843.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
There was a duplicated subnet check in the sender hash lookup routine.
This caused receivers to always fail the sender hash lookup if the
sender was in a different subnet, so the receiver would discard the
packet as though it were coming from a different job.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30841.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
This commit decouples OMPI deployment from the version(s) of the lower
layers of the stack by probing for UDP support.
Verbs applications assume a 40-byte header (there is no current
mechanism for querying payload offset). So to support a 42-byte UDP
header without causing existing applications like ibv_ud_pingpong or
older versions of OMPI to crash, we must inform libusnic_verbs that we
are aware of the nonstandard payload offset. We do this by overriding
the `transport_type` field of the device to be 42 before calling
`ibv_open_device`. If the library resets it to something else, then we
know the lower layers are UDP capable. Otherwise we use the older
custom-L2 format.
This necessitated some minor ugliness in common_verbs, but it's as tidy
as Jeff and I know how to make it right now.
This commit only adds support for UDP headers and connectivity over the
same L2 network, it does not touch routing or interface pairing.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30838.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Authored-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30836.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Authored-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30835.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
We're going to be bringing a bunch of usnic code to the SVN trunk
soon, and I basically brought this commit over out of order. So I'm
reverting it for now; the same functionality will come back shortly.
This commit was SVN r30805.
The following SVN revision numbers were found above:
r30804 --> open-mpi/ompi@5bedcc15bf
These constants are now upstream (see
https://git.kernel.org/cgit/libs/infiniband/libibverbs.git/commit/?id=f57a9c67eabb9e7f19c624ac3c8c27b7be55796c),
so let's support them properly in Open MPI.
Added bonus: consolidating these checks up in
ompi_check_openfabrics.m4 allowed removing some custom checks and
AC_DEFINE's from the usnic configure.m4 script.
Also change the usnic/configure.m4 check for IBV_EVENT_GID_CHANGE to
use AC_CHECK_DECLS (vs. AC_CHECK_DECL).
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r30804.
* Use the prefix rule for global variables
* Elimiante seed_prng() since it isn't necessary any more
These files will need to get edited again then the RNG type obeys the
prefix rule.
Refs trac:4298
This commit was SVN r30803.
The following SVN revision numbers were found above:
r30801 --> open-mpi/ompi@e39d9f4080
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
Avoid compiler warning about (unnecessarily) initializing 2 variables
during instantiation at the top of a switch block (but outside of any
case statements): just declare the variables at the top of the outter
block. They're already safely initialized, so don't worry about
initializing them in the instantiation.
Reviewed by Dave Goodell.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=Don't instantiate+init variables in a switch block
This commit was SVN r30228.
Complements r30073: tighten up the string parsing of the vendor parts
ID MCA param a bit. Also fix a small memory leak: ensure to free the
array uint32_t's parsed out of the MCA param.
This commit was SVN r30128.
The following SVN revision numbers were found above:
r30073 --> open-mpi/ompi@6003702a51
The following Trac tickets were found above:
Ticket 4301 --> https://svn.open-mpi.org/trac/ompi/ticket/4301
1. Fix ompi_info memory leak in usnic BTL: do not allocate memory in
the component register function, because ompi_info only calls the
component register function and then dlclose's the component -- it
does not call component finalize. Instead, defer parsing the MCA
param (and alloc'ing memory) until the component init function so
that any allocated memory can be freed in the component close
function.
1. Also add a new check to ensure that we actually have some part
numbers to check. Add a show_help message if we don't find any
vendor part IDs to check.
1. Add a verbose output if usnic disqualifies itself from selection
because THREAD_MULTIPLE was specified.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r30073.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
This new routine can be called in exceptional situations, either
conditionally in BTL code or from a debugger, to help with debugging in
cases where MSGDEBUG1/2 or stats logging are impractical but more detail
is needed.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
This commit was SVN r29483.
The Cisco-maintained v1.6 port of the usnic BTL has diverged from the
upstream trunk and v1.7 branches. This commit adjusts the trunk to more
closely match the v1.6 branch to simplify future merging and
cherry-picking.
The usnic MCA parameters also need work on this side.
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29138.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
Authored-by: Reese Faucette <rfaucett@cisco.com>
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29136.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29135.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29134.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
- round segment buffer allocation to cache-line
- split some routines into an inline fast section and a called
slower section
- introduce receive fastpath in component_progress that:
o returns immediately if there is a packet available on priority
queue and fastpath is enabled
o disables fastpath for 1 time after use to provide fairness to
other processing
o defers receive buffer posting
o defers bookeeping for receive until next call
to usnic_component_progress
Authored-by: Reese Faucette <rfaucett@cisco.com>
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29133.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
improvements:
* Fix minor memory leaks during component_init
* Ensure that an initialization loop does not underflow an unsigned int
* Improve mlock limit checking
* Fix set of BTL modules created during component_init when failing to
get QP resources or otherwise excluding some (but not all) usnic
verbs devices
* Fix/improve error messages to be consistent with other Cisco
documentation
* Randomize the initial sliding window sequence number so that we
silently drop incoming frames from previous jobs that still have
existant processes in the middle of dying (and are still
transmitting)
* Ensure we don't break out of add_procs too soon and create an
asymetrical view of what interfaces are available
This commit was SVN r28975.
Use the new sysfs files to check that there are enough VFs, QPs, and
CQs for all the MPI processes on this server.
Move the checking code into its own subroutine to make it smaller and
easier to read/grok.
This commit was SVN r28937.
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.
This commit was SVN r28907.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs. A few noteworthy points:
* This BTL does most of its own fragmentation; it tells the PML that
it has a very high max_send_size (much higher than the network
MTU).
* Since UD fragments are, by definition, unreliable, the usnic BTL
handles all of its own reliability via a sliding window approach
using the opal_hotel construct and many tricks stolen from the
corpus of knowledge surrounding efficient TCP.
* There is a fun PML latency-metric based optimization for NUMA
awareness of short messages.
* Note that this is ''not'' a generic UD verbs BTL; it is specific to
the Cisco usNIC device.
This commit was SVN r28879.