This commit fixes an issue with non-debug builds where adding an
attachment to the attachment list doesn't actually happen. This
causes all MPI_Win_detach calls to fail. The call was within an
assert which is optimized out in optimized builds.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit 8ee80d885561c4349ab97725416613614d63c77f)
This commit increaes the osc_rdma_max_attach variable from 32
to 64. The new default is kept low due to the small number
of registration resources on some systems (Cray Aries). A
larger max attachement value can be set by the user on other
systems.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit 54c8233f4f670ee43e59d95316b8dc68f8258ba0)
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This commit addresses two issues in osc/rdma:
1) It is erroneous to attach regions that overlap. This was being
allowed but the standard does not allow overlapping attachments.
2) Overlapping registration regions (4k alignment of attachments)
appear to be allowed. Add attachment bases to the bookeeping
structure so we can keep better track of what can be detached.
It is possible that the standard did not intend to allow #2. If that
is the case then #2 should fail in the same way as #1. There should
be no technical reason to disallow #2 at this time.
References #7384
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit 6649aef8bde750d2f1f28e52ff295919face3e0b)
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
mark the "self" peer OMPI_OSC_RDMA_PEER_LOCAL_BASE when
the window is dynamically created and use_cpu_atomics is set
in order to correctly handle communications to self.
Thanks Bart Janssens for reporting this issue.
Refs. open-mpi/ompi#6394
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(back-ported from commit open-mpi/ompi@fe05fcc11a)
gcc complains about ret possibly being used uninitialized. That will
never happen but we should still quiet the warning. This commit sets
ret to a valid value.
Fixes#5513
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The aggregation code in osc/rdma is currently broken and will likely
not be reused. This commit cleans it out.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a crash that occurs when using btl/vader as an RDMA
btl. This btl supports using CPU atomics and does not support using
the btl for self communication so we must use the local memory
optimizations in osc/rdma.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The osc/rdma module did not wait for all pending atomics to complete
before tearing down. This could lead to weird issues as the target
location may no longer be registered or allocated.
This commit also fixes an offset calculation issue in
ompi_osc_get_data_blocking ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug is osc/rdma that can occur if the total size
of the shared memory segment gets larger than 4 GiB. The bug was
caused by a typo. The type of my_base_offset should have been size_t
not int.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is a large update to the osc/rdma component. Included in
this commit:
- Add support for using hardware atomics for fetch-and-op and single
count accumulate when using the accumulate lock. This will improve
the performance of these operations even when not setting the
single intrinsic info key.
- Rework how large accumulates are done. They now block on the get
operation to fix some bugs discovered by an IBM one-sided test. I
may roll back some of the changes if the underlying bug in the
original design is discovered. There appear to be no real
difference (on the hardware this was tested with) in performance so
its probably a non-issue. References #2530.
- Add support for an additional lock-all algorithm: on-demand. The
on-demand algorithm will attempt to acquire the peer lock when
starting an RMA operation. The lock algorithm default has not
changed. The algorithm can be selected by setting the
osc_rdma_locking_mode MCA variable. The valid values are two_level
and on_demand.
- Make use of the btl_flush function if available. This can improve
performance with some btls.
- When using btl_flush do not keep track of the number of put
operations. This reduces the number of atomic operations in the
critical path.
- Make the window buffers more friendly to multi-threaded
applications. This was done by dropping support for multiple
buffers per MPI window. I intend to re-add that support once the
underlying performance bug under the old buffering scheme is
fixed.
- Fix a bug in request completion in the accumulate, get, and put
paths. This also helps with #2530.
- General code cleanup and fixes.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes the following bugs:
- Allow a btl to be used for communication if it can communicate with
all non-self peers and it supports global atomic visibility. In
this case CPU atomics can be used for self and the btl for any
other peer.
- It was possible to get into a state where different threads of an
MPI process could issue conflicting accumulate operations to a
remote peer. To eliminate this race we now update the peer flags
atomically.
- Queue up and re-issue put operations that failed during a BTL
callback. This can occur during an accumulate operation. This was
an unhandled error case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
higher priority than rdma and default to psm2.
Context: the Intel Omni-path driver (hfi1) has verbs support, so the openib
btl is available to use. However, at a bad performance. Without this
change osc rdma using btl openib is the default choice when running on Intel
Omni-path, with a lower performance than osc pt2pt over mtl psm2.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.
Fixes#3616
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The data endpoint was not being set correctly for local peers in some
cases. This commit fixes the bug and cleans the associated code to
simplify the logic.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
a buffer defined by (buf, count, dt)
will have data starting at buf+offset and ending len bytes later with
len = opal_datatype_span(&dt.super, count, &offset);
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
origin_datatype and target_datatype might be different and hence have different extent,
so use either origin_extent or target_extent when appropriate.
Refs open-mpi/ompi#3569
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The osc_rdma_get_remote_segment() has the 3rd and 4th args as
* target_disp
* length
which it uses to determine if the rdma falls within the bounds of
the window or not (actually it only checks the upper bound, but I'm
okay with that).
Anyway the caller previously was passing in the length argument as
target_datatype->super.size * target_count
which which doesn't really represent the number of bytes after target_disp
for which data exists. In particular I could create a datatype as
{ disp -4, len 4 } and use target_disp 4
and that would be bytes 0-3 of the window where the original code
would think it was bytes 4-7 and could abort at the range check.
Ive changed it to use the opal_datatype_span() function.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
See bug report
https://github.com/open-mpi/ompi/issues/3548
If a 1sided test is launched -host hostA:2,hostB:1 some of the ranks
call allocate_state_single() and others call allocate_state_shared().
These functions were producing different values for module->state_size
but that's used when they lookup peer info from each other in
ompi_osc_rdma_peer_setup() so they need to all have matching
module->state_offset values.
This change adds a few unused bytes in the memory allocate_state_single()
creates so it matches.
Signed-off-by: Mark Allen <markalle@us.ibm.com>