This commit fixes several bugs identified by @ggouaillardet and MTT:
- Fix SEGV in long send completion caused by missing update to the
request callback data.
- Add an MPI_Barrier to the fence short-cut. This fixes potential
semantic issues where messages may be received before fence is
reached.
- Ensure fragments are flushed when using request-based RMA. This
allows MPI_Test/MPI_Wait/etc to work as expected.
- Restore the tag space back to 16-bits. It was intended that the
space be expanded to 32-bits but the required change to the
fragment headers was not committed. The tag space may be expanded
in a later commit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
0715802f52 missed that there is a call
to a common/verbs_usnic symbol in the common/verbs component. This
call needs to be compiled out when the common/verbs_usnic component is
not built.
This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:
- The code that signaled the actual start of an access epoch changed
the eager_send_active flag on a synchronization object without
holding the object's lock. This could cause another thread waiting
on eager sends to block indefinitely because the entirety of
ompi_osc_pt2pt_sync_expected could exectute between the check of
eager_send_active and the conditon wait of
ompi_osc_pt2pt_sync_wait.
- The bookkeeping of fragments could get screwed up when performing
long put/accumulate operations from different threads. This was
caused by the fragment flush code at the end of both put and
accumulate. This code was put in place to avoid sending a large
number of unexpected messages to a peer. To fix the bookkeeping
issue we now 1) wait for eager sends to be active before stating
any large isend's, and 2) keep track of the number of large isends
associated with a fragment. If the number of large isends reaches
32 the active fragment is flushed.
- Use atomics to update the large receive/send tag counters. This
prevents duplicate tags from being used. The tag space has also
been updated to use the entire 16-bits of the tag space.
These changes should also fixopen-mpi/ompi#1299.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This component is a workaround to a bug in libibverbs that prints a
dire warning that usNIC devices are not supported (of course not --
usNIC devices provide functionality through libfabric, not
libibverbs). This component was written before a better workaround
was created: a "no op" libibverbs plugin for usNIC devices
(https://github.com/cisco/libusnic_verbs, and is also available in
binary form on cisco.com).
Hence, this component no longer builds by default. It's still
available if a user specifically asks for it (e.g., if they do not
want to install the "no op" libibverbs plugin), but it's not the
default. This component also has the side-effect of making
libopen-pal.so depend on libibverbs.so, which can be annoying for
packagers (which is another reason it isn't built by default any
more).
The send code in the ugni btl has an optimization that enables it to
return 1 (fragment gone) in some cases. This optimization involved
removing the btl ownership and callback flags to ensure the fragment
stuck around long enough for its completion flag to be checked. This
works fine for the single-threaded case but not in the multi-threaded
case. It is possible that a fragment will be completed by another
thread while a thread is in mca_btl_ugni_send. This competition can
lead to a leaked fragment, missed callback, or both. To fix the issue
without removing the optimization a reference count has been added to
the fragment. Callbacks and fragment release will not be made until
the fragment reference count has reach 0. The count is incremented
before sending the frag and decremented after the completion flag has
been checked. The fix has been verified to work using a multi-threaded
RMA benchmark with the osc/pt2pt component.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a race condition that can cause an endpoint to be
added to the wait list multiple times. To fix the issue an additional
check has been added to ensure the endpoint is not on the wait list
after the wait list lock is held. The wait list processing code has
also been updated to keep the wait list lock until all wait listed
endpoints have been handled. This reduces the chance that an endpoint
that is being processed by the wait list code is not re-added to the
list by a competing send.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Three minor updates from the code review of
https://github.com/open-mpi/ompi-release/pull/933:
* Remove an extra blank line a show_help message
* We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so
change the flag to REGINT_GE_ONE
* Change "num_blocks" definition to be in terms of block_len (not
eq_size)
A bunch of empirical testing has shown that increasing the retranmit
timeout from 1ms to 5ms doesn't adversely affect performance, yet
decreases the number of gratuitious retransmissions.
Add endpoints in a blocked manner so that we don't overrun the
fi_av_insert() event queue. Also make the AV EQ length an MCA param,
and report it in mca_btl_base_verbose >=5 output.
Sequence numbers will wrap around; it is not sufficient to check for
(seq-1) -- must use the SEQ_DIFF macro to properly handle the
wraparound.
This bug wasn't serious; it just meant we might retransmit one or two
extra times when retransmits were triggerd and the sequence numbers
wrapped around their sliding windows.
when SMT is enabled, a core must be counted as long as one of its hwthread is allowed
Thanks Ben Menadue for the report.
This fixes a regression from open-mpi/ompi@6d149554a7
This commit adds code to handle large unaligned gets. There are two
possible code paths for these transactions:
1) The remote region and local region have the same alignment. In
this case the get will be broken down into at most three get
transactions: 1 transaction to get the unaligned start of the region
(buffered), 1 transaction to get the aligned portion of the region,
and 1 transaction to get the end of the region.
2) The remote and local regions do not have the same alignment. This
should be an uncommon case and is not optimized. In this case a
buffer is allocated and registered locally to hold the aligned data
from the remote region. There may be cases where this fails (low
memory, can't register memory). Those conditions are unlikely and
will be handled later.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If the state of the request is not set to OMPI_REQUEST_ACTIVE
then MPI_Test would immediately signal such request completed
while hcoll may still be working on it.
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>