One should use the correct module object when calling
c_coll.coll_allgather. Otherwise there will be a segfault in the
case, for example, when hcoll is used. In that case
c_coll.coll_allgather = mca_coll_hcoll_allgather while
c_coll.coll_gather_module = tuned.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
As we changed the ABI (forcing a major release), we can limit
the size of the predefined communicators by moving the collective
structure outside the communicator. This might have a minimal,
but unnoticeable, impact on performance. This approach has been
discussed during the January 2017 devel meeting.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Under heavy load the locking code could fail if the underlying btl
module started to return OPAL_ERR_OUT_OF_RESOURCE on atomic
operations. This commit updates the code to gracefully handle btl
errors.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
in this context, AMD64 really means amd64 or em64t, so let's
rename this into X86_64 in order to avoid any confusion
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit implements onesided operations for noncontiguous
datatypes using two different algorithms.
* If the result and/or origin datatype is noncontiguous and the
target datatype is contiguous, then an iovec MD is created for
the result and origin. The operation is performed using a
single Portals4 call (unless it exceeds the max message size).
* If the target datatype is noncontigous, then an algorithm
similar to the one in osc-rdma is used to loop over the
contiguous blocks of each datatype. The operation is
performed using multiple Portals4 calls.
This commit ensures that individual operations do not exceed the
max atomic size or the max message size supported by the device.
Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
add padding so the memory allocated by MPI_Win_allocate_shared()
is 64 bytes aligned.
Thanks Joseph Schuchart for the bug report
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit fixes a number of threading issues discovered in
osc/pt2pt. This includes:
- Lock the synchronization object not the module in osc_pt2pt_start.
This fixes a race between the start function and processing post
messages.
- Always lock before calling cond_broadcast. Fixes a race between
the waiting thread and signaling thread.
- Make all atomically updated values volatile.
- Make the module lock recursive to protect against some deadlock
conditions. Will roll this back once the locks have been
re-designed.
- Mark incoming complete *after* completing an accumulate not
before. This was causing an incorrect answer under certain
conditions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Using MPI_MINLOC or MPI_MAXLOC with the following data types
leads to data corruption:
* MPI_DOUBLE_INT
* MPI_LONG_INT
* MPI_SHORT_INT
* MPI_LONG_DOUBLE_INT
Detect this print a error message and abort.
This workaround should be removed once the following issue is resolved:
* https://github.com/open-mpi/ompi/issues/1666
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* When using `MPI_Put` with `MPI_Win_lock_all` a hang is possible since
the `put` is waiting on `eager_send_active` to become `true` but
that variable might not be reset in the case of `MPI_Win_lock_all`
depending on other incoming events (e.g., `post` or ACKs of lock
requests.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* When using `MPI_Lock`/`MPI_Unlock` with `MPI_Get` and non-contiguous
datatypes is is possible that the unlock finishes too early before
the data is actually present in the recv buffer.
* We need to wait for the irecv to complete before unlocking the target.
This commit waits for the outgoing fragment counts to become equal
before unlocking.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* If the user uses PSCW synchronization after a Fence then the previous
epoch is not reset which can cause the PSCW to transfer data before
it is ready leading to wrong answers.
* This commit resets the `eager_send_active` in the start call.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Instead of ompi_datatype_get_extent(), use ompi_datatype_get_true_extent()
to get the local and remote lower bound. For derived types like
subarray, true_lb is the correct offset for RDMA operations.
Instead of ompi_datatype_get_extent(), use ompi_datatype_get_true_extent()
to get the origin and target lower bound. For derived types like
subarray, true_lb is the correct offset for RDMA operations. Also,
instead of the extent use the size of the datatype.
This commit fixes a typo in compare-and-swap when retrieving the
memory region associated with a displacement. It was erroneously 8
bytes instead of the datatype size. This can cause an incorrect RMA
range error when the compare-and-swap is less than 4 bytes from the
end of the region.
Fixedopen-mpi/ompi#2080
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for using network AMOs for MPI_Accumulate,
MPI_Fetch_and_op, and MPI_Compare_and_swap. This support is only
enabled if the ompi_single_intrinsic info key is specified or the
acc_single_interinsic MCA variable is set. This configuration
indicates to this implementation that no long accumulates will be
performed since these do not currently mix with the AMO
implementation.
This commit also cleans up the code somwhat. This includes removing
unnecessary struct keywords where the type is also typedef'd.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit cleans up some code in the passive target path. The code
used the buffered frag control send path but it is more appropriate to
use the unbuffered one. This avoids checking structures that are
should not be in use in this path.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an ordering bug in the code that keeps track of all
attached memory windows. The code is intended to keep the memory
regions sorted but was often inserting at the wrong index. Thanks to
Christoph Niethammer for reporting the issue. The reproducer will be
added to nightly MTT testing.
Fixesopen-mpi/ompi#2012
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It is possible for another thread to process a lock ack before the
peer is set as locked. In this case either setting the locked or the
eager active flag might clobber the other thread. To address this the
flags have been made volatile and are set atomically. Since there is
no a opal_atomic_or or opal_atomic_and function just use cmpset for
now.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes some bugs uncovered during thread testing of
2.0.1rc1. With these fixes the component is running cleanly with
threads.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the sematics of ompi request callbacks. If a
request's callback has freed or re-posted (using start) a request
the callback must return 1 instead of OMPI_SUCCESS. This indicates
to ompi_request_complete that the request should not be modified
further. This fixes a race condition in osc/pt2pt that could lead
to the req_state being inconsistent if a request is freed between
the callback and setting the request as complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The original lock_all algorithm in osc/pt2pt sent a lock message to
each peer in the communicator even if the peer is never the target of
an operation. Since this scales very poorly the implementation has
been replaced by one that locks the remote peer on first communication
after a call to MPI_Win_lock_all.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>