Significantly increase the default retrans timeout. If the
retrans timeout is too soon, we can end up in a retransmission storm
where the logic will continually re-transmit the same frames during a
single run through the usNIC progress function (because the timer for
a single frame expires before we have run through re-transmitting all
the frames pending re-transmission).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).
Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.
Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).
As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Might as well save a few bytes when sending this struct across the
network via the __opal_attribute_packed__ attribute.
That being said, also re-order the elements in this struct so that
there's no holes to begin with. Do this so that the compiler/runtime
won't effect (slow) unaligned reads/writes because of the
__opal_attribute_packed__ attribute.
The "packed" attribute is really more about defensive programming
(e.g., if we make a mistake and have a hole, "packed" will remove it
for us).
*** Do not bring this commit back to existing/already-released release
branches: it will cause incompatibility, since it effectively changes
the usNIC BTL wire protocol.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a crash that can occur if a transport
is usable but doesn't have zero-copy support. In this
case do not attempt to use zero-copy and set the max
send size off the bcopy limit.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
OpenUCX broke the UCT API again in v1.8. This commit updates
btl/uct to fix compilation with current OpenUCX master
(future v1.8). Further changes will likely be needed for
the final release.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.
References #6568
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Trying out to run processes via mpirun in Podman containers has shown
that the CMA btl_vader_single_copy_mechanism does not work when user
namespaces are involved.
Creating containers with Podman requires at least user namespaces to be
able to do unprivileged mounts in a container
Even if running the container with user namespace user ID mappings which
result in the same user ID on the inside and outside of all involved
containers, the check in the kernel to allow ptrace (and thus
process_vm_{read,write}v()), fails if the same IDs are not in the same
user namespace.
One workaround is to specify '--mca btl_vader_single_copy_mechanism none'
and this commit adds code to automatically skip CMA if user namespaces
are detected and fall back to MCA_BTL_VADER_EMUL.
Signed-off-by: Adrian Reber <areber@redhat.com>
Due to IF_NAMESIZE being a reused and conditionally defined macro,
issues could arise from macro mismatches. In particular, in cases where
opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32.
If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This
can cause a mismatch when using the same macro on a system. Thus
different parts of the code can have differring ideas on the size of a
structure containing a char name[IF_NAMESIZE]. To avoid this error case,
we avoid reusing the IF_NAMESIZE macro and instead define our own as
OPAL_IF_NAMESIZE.
Signed-off-by: William Zhang <wilzhang@amazon.com>
After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error
path. Moved the free call into cleanup to ensure we always free this
memory before leaving the function.
Signed-off-by: William Zhang <wilzhang@amazon.com>
Remove compatibility code for multiple versions of BTL_IN_OPAL,
BTL_VERSION, and RCACHE_VERSION. This stuff was really only necessary
when we were actively swapping code between multiple release branches
that had large variations in core OMPI infrastructure. These large
variations have now been around for quite a while, so the need for
this "compat" layer is significantly reduced. It hasn't been removed
simply because a few of the "compat" names a slightly more friendly
than the real names (e.g., the SEND/RECV/PUT names).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes an error in the 32-bit compare-and-swap atomic support
for Aries networks. The code was incorrectly using the non-fetching
version of cswap which was causing the routing to return
OPAL_ERR_BAD_ARG.
Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>
Avoid printing an error message about ENOTCONN return codes from
getpeername() when handling an incoming connection request. At
this point in the receive state machine, the remote process has
been verified to be a valid OMPI instance. In all-to-all startup
at 4k rank scale, we're seeing this error message when the remote
side drops the connection because it realizes it's the "loser"
in the connection race. We were already doing all the right things,
other than printing a scary error message. So skip the error
message and call it good.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
free the component mpool in mca_btl_vader_component_close()
and after freeing soem objects that depend on it such as
mca_btl_vader_component.vader_frags_user
Thanks Christoph Niethammer for reporting this.
Refs. open-mpi/ompi#6524
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't
have to replicate hard-coded values.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS.
The issue occurs when Open MPI is configured with
--enable-opal-btl-usnic-unit-tests
Thanks George Marselis for reporting this issue
Refs. open-mpi/ompi#6441
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Update the OPAL_CHECK_OFI configury macro:
- Make it safe to call the macro multiple times:
- The checks only execute the first time it is invoked
- Subsequent invocations, it just emits a friendly "checking..."
message so that configure output is sensible/logical
- With the goal of ultimately removing opal/mca/common/ofi, rename the
output variables from OPAL_CHECK_OFI to be
opal_ofi_{happy|CPPFLAGS|LDFLAGS|LIBS}.
- Update btl/ofi, btl/usnic, and mtl/ofi for these new conventions.
- Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that
causes the macro to be invoked at a fairly random time, which makes
configure stdout confusing / hard to grok.
- Remove a little left-over kruft in OPAL_CHECK_OFI, too (which
resulted in an indenting change, making the change to
opal_check_ofi.m4 look larger than it really is).
Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
So long BTL openib! After many years of (mostly) faithful service, it
is time to remove the openib BTL. It has been fully replaced by other
components, such as the UCX PML and OFI MTL.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
It doesn't seem like the BTL was using uninitialized pointer. But simply
setting the rcache pointer to NULL after destroying it makes the valgrind
errors go away.
Fixes Issue #6345
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
This commit fixes a bug introduced in
f62d26ddbc. That commit changed how
vader allocates fragment memory from the shared memory
segment. Unfortunately, the values used for the fragment sizes did not
include space for the fragment header. This can cause an overrun of
data from one fragment to the header of the next fragment.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.
References #6201
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
remove whitespace around '=' when setting btl_uct_LIBS
Thanks Ake Sandgren for reporting this
Refs. open-mpi/ompi#6173
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.
- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`
Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.
I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.
Fixes: #5810Fixes: #5914
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
In 457f058 I broke the TCP BTL with --enable-ipv6. This patch
fixes the compile error, so IPv6 works again.
Fixed#5996
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code. Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure. This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.
Refs #5818
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6). There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads. The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.
This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT. It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.
The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race. THis patch is a work around until that patch can
be developed.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.
At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.
This commit also fixes some bugs in the active-message send path:
- Correctly set all fragment fields in prepare_src.
- Fix bug when using buffered-send. We were not reading the return
code correctly (which is in bytes). This resulted in a message
getting sent multiple times.
- Don't try to progress sends from the btl_send function when in an
active-message callback. It could lead to deep recursion and an
eventual crash if we get a trace like
send->progress->am_complete->ob1_callback->send->am_complete...
Closes#5820Closes#5821
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This reverts commit 6acebc40a1.
This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run. See
https://github.com/open-mpi/ompi/issues/5849 for more information.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.
Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.
Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:
_Atomic intptr x, y;
intptr_t z = 0;
x = y = z;
Will produce a compiler error of the form:
operand cannot have void type: op "="
assignment type mismatch:
long "=" void
To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.
Fixes#5814
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In many cases, this was a simple string replace. In a few places, it
entailed:
1. Updating some comments and removing now-redundant foo[size-1]='\0'
statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
wants the entire destination buffer length).
This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Disable async receive for CUDA under OpenIB. While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit. This change has been validated
and approved by Akshay.
References #3972
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.
Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.
Fixes#5375, #5638
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster
Make it compile in non-debug builds
Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks
Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.
Remove ULFM specific cases
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.
Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Due to decreasing support by vendors/other orgs for the OpenIB BTL,
only look for iWarp/RoCE devices by default. Allow IB HCAs
with ports configured for ethernet.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
OFI BTL uses context for completion but never ask for it in
fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT
to eliminate any potential error.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes two bugs in the RMA/atomic emulation code:
1) Fix a fragment leak when using AMO emulation.
2) Always initialize the single-copy emulation code. This is required
to use the AMO support.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The descriptor flags field in a fragment were being ready after the
fragment may have been freed. This commit reads the flags before
calling the user callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for atomic operations as well as rdma for
systems without rdma support. This support is implemented using an
internal send tag.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The current cast is *functional*, but isn't really the way it should
be done. This commit makes the cast the way it should be done.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>