New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).
Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.
Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).
As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 968b1a51b59898877a8c7268d463d3d7d78d86a3)
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit ce2910a28aea61043b81324c67999f3a47cfe7ac)
Trying out to run processes via mpirun in Podman containers has shown
that the CMA btl_vader_single_copy_mechanism does not work when user
namespaces are involved.
Creating containers with Podman requires at least user namespaces to be
able to do unprivileged mounts in a container
Even if running the container with user namespace user ID mappings which
result in the same user ID on the inside and outside of all involved
containers, the check in the kernel to allow ptrace (and thus
process_vm_{read,write}v()), fails if the same IDs are not in the same
user namespace.
One workaround is to specify '--mca btl_vader_single_copy_mechanism none'
and this commit adds code to automatically skip CMA if user namespaces
are detected and fall back to MCA_BTL_VADER_EMUL.
Signed-off-by: Adrian Reber <areber@redhat.com>
(cherry picked from commit fc68d8a90fe86284e9dc730f878b55c0412f01d2)
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.
References #6568
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit ae91b11de2314ab11a9842d9738cd14f8f1e393b)
Commit d7053a3 broke things for the case when Open MPI 4.0.x is built
without UCX support. Problem was it was trying to partially initialize
the btl to try and delay printing of a help message till wireup. Well
this sort of doesn't work in all cases. Rather than keep piling on
changes to support a help message for a BTL that we are deprecating, take
a keep it simple stupid approach.
So, revert most of d7053a3 and instead put the help message back in the
original location, during scan of ports of the available HCAs to check
for whether or not link layer for that port is configured for ethernet or infiniband.
If Open MPI was built with UCX support, don't emit the help message, if
UCX was not linked in, emit the help message.
Verified on a system with connectX5 HCAs configured with two ports configured
for ethernet and two for infiniband.
relates to #6785
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Per patches from @SteVwonder and @garlick
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit d4070d5f58f0c65aef89eea5910b202b8402e48b)
This commit updates the uct btl to support the v1.6.x release of
UCX. This release breaks API.
Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>
(cherry picked from commit b78066720c3e3299bd76f2e22d2c0e415db572fc)
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
This is mostly based off recent UCX additions to their patcher:
https://github.com/openucx/ucx/pull/2703
They added triggers for
* mmap when (flags & MAP_FIXED) && (addr != NULL)
* shmat when (shmflg & SHM_REMAP) && (shmaddr != NULL)
Beyond that I noticed they already had a trigger for
* madvise when (advice == MADV_FREE)
that we didn't so I added that.
And the other main thing is we didn't really have shmat/shmdt
active for some systems because we only had a path for
syscall(SYS_shmdt, ) but we needed to also have a path for
syscall(SYS_ipc, IPCOP_shmdt, ) and same for shmat.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit eb888118e83f56c131aff900b03eab34c92b7805)
- initialize memory hooks infrastructure only in case
if external memory hooks are requested
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a0a93060668cd11a783cc94c753efb3129df9dde)
free the component mpool in mca_btl_vader_component_close()
and after freeing soem objects that depend on it such as
mca_btl_vader_component.vader_frags_user
Thanks Christoph Niethammer for reporting this.
Refs. open-mpi/ompi#6524
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@77060cad07)
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 2ef5bd8b3671f1e10caf00d06d66d120eac9c5be)
* Forcing the 'hash' gds component should not be necessary any more.
Port of PR #6498 (component names changed so a cherry-pick would not work)
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.
This is a one-off commit for the v4.0.x branch since btl/openib has been removed
from master.
Refs. open-mpi/ompi#6137
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fixes an issue introduced in open-mpi/ompi@0a2ce58040
This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master.
Refs. open-mpi/ompi#6137
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@0a2ce58040)
This commit fixes a bug introduced in
f62d26ddbc8cda4d985cceee531a2ec32406d1f6. That commit changed how
vader allocates fragment memory from the shared memory
segment. Unfortunately, the values used for the fragment sizes did not
include space for the fragment header. This can cause an overrun of
data from one fragment to the header of the next fragment.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new mpool base module type: basic. This module can
be used with an opal_free_list_t to allocate space from a
pre-allocated block (such as a shared memory region). The new module
only supports allocation and is not meant for more dynamic use cases
at this time.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
I think the strncat() calls here need to be of the form
strncat(str, new_str_to_add, len - strlen(new_str_to_addstr) - 1);
since in the OMPI calls len is being used as total number of bytes
in str.
strncat(dest,src,n) on the other hand is documented as writing up to
n chars from the incoming string plus 1 for the null, for n+1 total
bytes it can write.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit 30d60994d258f5f0b7c432efd284d1b6b8333faf)
Conflicts:
opal/mca/hwloc/base/hwloc_base_util.c
- there was a set of UCX related issues reported which caused
by mmap API hooks conflicts. We added diagnostic of such
problems to simplify bug-resolving pipeline
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit d8e3562bae700d84873c1d5ca9c45c846d7387ed)
Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't
have to replicate hard-coded values.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 14563770a1d64c465ee1f205c9981de39970bb33)
do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS.
The issue occurs when Open MPI is configured with
--enable-opal-btl-usnic-unit-tests
Thanks George Marselis for reporting this issue
Refs. open-mpi/ompi#6441
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b4097626ab)
When Slurm is built against PMIx, some installations place a copy of the
PMIx library that Slurm is linking against in the Slurm PMI location.
Current configury ignores that location. The desired behavior is to look
for a PMIx lib in that location when --with-pmi is given. If the user
also specifies --with-pmix and gives a different location, then override
anything previously found and look for it where the user directed.
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit cd1b5641beca7f158360983cd31f7297548b0a3c)
It never lived up to its purpose (and has caused amorphous indirect
errors such as https://github.com/open-mpi/ompi/issues/2519), so
delete it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit dd20174532928e0c9cdbe7b206868e6e4bea9d0b)
Update the OPAL_CHECK_OFI configury macro:
- Make it safe to call the macro multiple times:
- The checks only execute the first time it is invoked
- Subsequent invocations, it just emits a friendly "checking..."
message so that configure output is sensible/logical
- With the goal of ultimately removing opal/mca/common/ofi, rename the
output variables from OPAL_CHECK_OFI to be
opal_ofi_{happy|CPPFLAGS|LDFLAGS|LIBS}.
- Update btl/usnic and mtl/ofi for these new conventions.
- Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that
causes the macro to be invoked at a fairly random time, which makes
configure stdout confusing / hard to grok.
- Remove a little left-over kruft in OPAL_CHECK_OFI, too (which
resulted in an indenting change, making the change to
opal_check_ofi.m4 look larger than it really is).
Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.
Updated from master cherry pick: the OFI BTL does not exist on the
v4.0.x branch. Therefore, did not include the OFI BTL changes on
master in this cherry pick.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527)
Some macros defined by the embedded hwloc ends up in opal_config.h
because hwloc configury m4 files are slurped into Open MPI. These
macros are not required here, and they might conflict with an external
hwloc install, so simply #undef them in hwloc/external/external.h
after including <opal_config.h> but before including the external
<hwloc.h>.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f22b7d4f46b03554add3ff2254d1c893359aff84)