There's no longer any need for the usnic BTL to have its own progress
thread: it can use the opal_progress_thread() infrastructure. This
commit removes the code to startup/shutdown the usnic-BTL-specific
progress thread and instead, just adds its events to the OPAL-wide
progress thread.
This necessitated a small change in the finalization step.
Previously, we would stop the progress thread and then tear down the
events. We can no longer stop the progress thread, and if we start
tearing down events, this will cause shutdown/hangups to be sent
across sockets, potentially firing some of the still-remaining events
while some (but not all) of the data structures have been torn down.
Chaos ensues.
Instead, queue up an event to tear down all the pending events. Since
the progress thread will only fire one event at a time, having a
teardown event means that it can tear down all the pending events
"atomically" and not have to worry that one of those events will get
fired in the middle of the teardown process.
In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled
FI_MSG_PREFIX inconsistently between sends and receives. This has
been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is
handled consistently for both sends and receives.
Run-time detect which libfabric we are running with and adapt behavior
appropriately.
Followup to open-mpi/ompi@65b66ab: if we're not debugging, then #if
out an entire block so that the compiler doesn't warn about variables
that are assigned and not used.
When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint. Without this fix, OMPI will hang endlessly
while retransmitting packets to port 0 on the remote host.
fi_av_insert() is invoked with a context containing each endpoint
USNIC_NUM_CHANNELS times. If the address on that endpoint fails to
resolve / is unreachable / has some error, we'll therefore get
USNIC_NUM_CHANNELS error completions with that same endpoint. We
therefore only want to warn about the unreachability of (and
OBJ_RELEASE) that endpoint the *first* time.
Fixes CSCut46822.
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do). Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
When we get errors, the entry.data field tells us how many errors are
being reported. So decrement the loop count variable by that much.
This fixes CSCut30441.
Also include two other minor changes:
1. More C99-style member initialization in the component struct
1. Fix the BTL module member initialization to not be redundant
Add the functions that changed between BTL 2.0 and 3.0 into compat.h
and compat.c:
* module.btl_prepare_src: the signature and body of this method
changed between 2.0 and 3.0. However, the functions that this
method calls did *not* need to change, so they are copied over
wholesale (with the exception that they no longer accept the unused
`registration` parameter).
* module.btl_prepare_dst: this method does not exist in BTL 3.0.
* module.btl_put: the signature and body of this method changed
between 2.0 and 3.0.
If there are not enough resources (e.g., low VFs), we can end up
calling finalize_one_channel() on the same channel multiple times. So
ensure to NULL out fields that we have freed already so that we do not
try to free them a second time.
Fixes CSCus26648.
We had several problems in the old code:
1. We were specifying an arbitrary timeout (100 ms) and then abandoning
all remaining pending AV insert operations. We would then free the
endpoint buffer that we gave to fi_av_insert(), usually causing
libfabric's progress thread to write to a freed buffer.
2. We were claiming in a show_help message that the timeout was
controllable via an MCA parameter. This commit removes that
parameter, since there's no good method for us to specify a timeout
like this to libfabric right now.
3. We also weren't waiting for the correct number of fi_av_insert()
operations to complete. We were waiting for nprocs, which is
accidentally fine for 2 procs on separate hosts, but not for most
other proc counts.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
This commit represents the conversion of the usnic BTL from verbs to
libfabric.
For the moment, libfabric is embedded in Open MPI (currently in the
usnic BTL). This is because the libfabric API is still changing, and
also has not yet been released. Ultimately, this embedded copy of
libfabric will likely disappear and the usnic BTL will rely on an
external installation of libfabric.
New configure options:
* --with-libfabric: will cause configure to fail if libfabric support
cannot be built
* --without-libfabric: will prevent libfabric support from being built
* --with-libfabric=DIR: use an external libfabric installation
* --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR,
use LIBDIR for the libfabric installation library dir
The --with-libnl3[-libdir] arguments are now gone.
structure
This structure member was originally used to specify the remote segment
for an RDMA operation. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment information has been
renamed to des_segments/des_segment_count.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
Per #4874, code review revealed a possible race condition in the
module struct and the connectivity agent. Move the setup of the
connectivity agent listener until the module struct has been fully
setup.
This commit was SVN r32573.
Instead of waiting to destroy the connectivity agent during component
shutdown, have the module shutdown send an "unlisten" command to the
cagent that will tell it to stop listening on a given interface.
This commit was SVN r32536.
Make the del_procs, module finalize, and endpoint destructors be the
same between trunk and v1.8, with one exception: the very beginning of
v1.8 module_finalize calls del_procs for each proc to simulate/pretend
the trunk/v1.9 PML behavior of calling del_procs before module_finalize.
This commit was SVN r32437.
Fixes an assertion failure in --enable-debug builds and SEGVs in normal
builds.
I'm not 100% sure I like this model, but it at least seems to be
consistent. Some variation on this scheme will need to be adapted to
the trunk, where usnic_del_procs() is called by the PML instead of
internally in usnic_finalize().
A related bug (but with different mechanics) is #4832.
This commit was SVN r32424.
Previously, the connectivity agent was pretty dumb: it took whatever
pings it got and ACKed them. Then we added an agent check to ensured
that the ping actually came from the source interface that it said it
came from. Now we add another check such that when a ping is received
on interface X that corresponds to usnic module Y, we ensure that the
source interface of the ping is on the all_endpoints list for module Y
(i.e., module Y expects to be able to talk to that peer interface).
This detects cases where peers have come to different conclusions
about which interfaces should be used to communicate (which is bad!).
This usually reflects a network misconfiguration.
Fixes CSCuq05389.
This commit was SVN r32383.
Update compat.h to only handle compatability between v1.7/v1.8 and
v1.9/2.0 (i.e., the current trunk). Remove what seems to be the last
vestiages of OMPI/ORTE pollution in the now-OPAL-ized usnic BTL.
Currently use a hard-coded constant for the MCW size (i.e.,
MPI_COMM_WORLD size) for some initialization values in the v1.9/2.0
series; still need to figure out something better there.
This commit was SVN r32365.
Adapt to moving down to OPAL: find a PML-registered error callback,
and use that when we don't have a module context and we need to abort.
Failing that, just call exit().
This commit was SVN r32362.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.