The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 2ef5bd8b36)
* Forcing the 'hash' gds component should not be necessary any more.
Port of PR #6498 (component names changed so a cherry-pick would not work)
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.
This is a one-off commit for the v4.0.x branch since btl/openib has been removed
from master.
Refs. open-mpi/ompi#6137
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fixes an issue introduced in open-mpi/ompi@0a2ce58040
This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master.
Refs. open-mpi/ompi#6137
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@0a2ce58040)
This commit fixes a bug introduced in
f62d26ddbc. That commit changed how
vader allocates fragment memory from the shared memory
segment. Unfortunately, the values used for the fragment sizes did not
include space for the fragment header. This can cause an overrun of
data from one fragment to the header of the next fragment.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new mpool base module type: basic. This module can
be used with an opal_free_list_t to allocate space from a
pre-allocated block (such as a shared memory region). The new module
only supports allocation and is not meant for more dynamic use cases
at this time.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
I think the strncat() calls here need to be of the form
strncat(str, new_str_to_add, len - strlen(new_str_to_addstr) - 1);
since in the OMPI calls len is being used as total number of bytes
in str.
strncat(dest,src,n) on the other hand is documented as writing up to
n chars from the incoming string plus 1 for the null, for n+1 total
bytes it can write.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit 30d60994d2)
Conflicts:
opal/mca/hwloc/base/hwloc_base_util.c
- there was a set of UCX related issues reported which caused
by mmap API hooks conflicts. We added diagnostic of such
problems to simplify bug-resolving pipeline
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit d8e3562bae)
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.
Thanks Ben Menadue for the initial bug report
Refs open-mpi/ompi#6016
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't
have to replicate hard-coded values.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 14563770a1)
do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS.
The issue occurs when Open MPI is configured with
--enable-opal-btl-usnic-unit-tests
Thanks George Marselis for reporting this issue
Refs. open-mpi/ompi#6441
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b4097626ab)
When Slurm is built against PMIx, some installations place a copy of the
PMIx library that Slurm is linking against in the Slurm PMI location.
Current configury ignores that location. The desired behavior is to look
for a PMIx lib in that location when --with-pmi is given. If the user
also specifies --with-pmix and gives a different location, then override
anything previously found and look for it where the user directed.
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit cd1b5641be)
It never lived up to its purpose (and has caused amorphous indirect
errors such as https://github.com/open-mpi/ompi/issues/2519), so
delete it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit dd20174532)
Update the OPAL_CHECK_OFI configury macro:
- Make it safe to call the macro multiple times:
- The checks only execute the first time it is invoked
- Subsequent invocations, it just emits a friendly "checking..."
message so that configure output is sensible/logical
- With the goal of ultimately removing opal/mca/common/ofi, rename the
output variables from OPAL_CHECK_OFI to be
opal_ofi_{happy|CPPFLAGS|LDFLAGS|LIBS}.
- Update btl/usnic and mtl/ofi for these new conventions.
- Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that
causes the macro to be invoked at a fairly random time, which makes
configure stdout confusing / hard to grok.
- Remove a little left-over kruft in OPAL_CHECK_OFI, too (which
resulted in an indenting change, making the change to
opal_check_ofi.m4 look larger than it really is).
Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.
Updated from master cherry pick: the OFI BTL does not exist on the
v4.0.x branch. Therefore, did not include the OFI BTL changes on
master in this cherry pick.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f5e1a672cc)
Reset ptypes when cloning a datatype in order to prevent
a double free() in the opal_datatype_t destructor.
This fixes a bug introduced in open-mpi/ompi@7c938f070fFixesopen-mpi/ompi#6346
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b395342c9f)
The issue was a little complicated due to the internal stack used in the
convertor. The main issue was that in the case where we run out of iov
space to save the raw description of the data while hanbdling a
repetition (loop), instead of saving the current position and bailing out
directly we reading of the next predefined type element. It worked in
most cases, except the one identified by the HDF5 test. However, the
biggest issue here was the drop in performance for all ensuing calls to
the convertor pack/unpack, as instead of handling contiguous loops as a
whole (and minimizing the number of memory copies) we copied data
description by data description.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(back-ported from commit open-mpi/ompi@5a82c4fd07)
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop
Refs. open-mpi/ompi#6285
Thanks Axel Huebl for reporting this
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(back-ported from commit open-mpi/ompi@0832ab5acc)
correctly free ptypes if the datatype is not pre-defined.
Thanks Axel Huebl for reporting this.
Refs. open-mpi/ompi#6291
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 7c938f070f)
opal_config_bottom.h can only be #include'd in opal_config.h,
so there is no need to #include "opal_config.h" inside.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit c8790d29de)
Some macros defined by the embedded hwloc ends up in opal_config.h
because hwloc configury m4 files are slurped into Open MPI. These
macros are not required here, and they might conflict with an external
hwloc install, so simply #undef them in hwloc/external/external.h
after including <opal_config.h> but before including the external
<hwloc.h>.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f22b7d4f46)
Update the OPAL glue configure code to correctly link the opal/pmix3
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios. Roll to PMIx v3.1.2 for plugging of memory leaks and
addition of faster PMIx_Get response
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.
References #6201
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 30b8336cb4)
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1
The change only need to be done in ext3x/ext3x.c.
But since this file is automatically generated from pmix3x/pmix3x.c, we have to update
the latter file.
Refs. open-mpi/ompi#6247
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(back-ported from commit open-mpi/ompi@950ba16aa1)
remove whitespace around '=' when setting btl_uct_LIBS
Thanks Ake Sandgren for reporting this
Refs. open-mpi/ompi#6173
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b89deeb1bb)
Use the PRIsize_t macro to correctly print a size_t
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@78aa6fdd1d)
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b715dd2657)
Use PMIX_* macros instead of OPAL_* macros
master does things differently, so this is a one-off commit
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit e07a64c52d)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit b715dd2657)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Needed to properly read PMIx job data like the following
- `OPAL_PMIX_LOCALLDR`
- `OPAL_PMIX_RANK`
- `OPAL_PMIX_GLOBAL_RANK`
- `OPAL_PMIX_APPLDR`
- `OPAL_PMIX_APP_RANK`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit a557c4130c)
It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a
no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go
ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should
not cause any performance issues as it is equivalent to the memory barrier
in the hand-written atomics.
References #6014
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 30119ee339)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.
Fixes: #5810Fixes: #5914
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 8126779a35)
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit b205039205)
- Always use the external component when configure'd with --with-libevent=external
- Fix the external libevent library version detection
by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
- Use the event2/event.h header (event.h is deprecated since libevent 2.0
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 35e77a286c)
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.
At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.
This commit also fixes some bugs in the active-message send path:
- Correctly set all fragment fields in prepare_src.
- Fix bug when using buffered-send. We were not reading the return
code correctly (which is in bytes). This resulted in a message
getting sent multiple times.
- Don't try to progress sends from the btl_send function when in an
active-message callback. It could lead to deep recursion and an
eventual crash if we get a trace like
send->progress->am_complete->ob1_callback->send->am_complete...
Closes#5820Closes#5821
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 707d35deeb)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:
```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
opal_free_list_item_t *item =
(opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);
if (OPAL_UNLIKELY(NULL == item)) {
opal_mutex_lock (&flist->fl_lock);
opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
opal_mutex_unlock (&flist->fl_lock);
item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
}
return item;
}
```
The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.
This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.
Fixes#2921
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 5c770a7bec)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6ed68da870)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39be6ec15c)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit c498a7e77a)
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit f379ba9c8e)
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:
_Atomic intptr x, y;
intptr_t z = 0;
x = y = z;
Will produce a compiler error of the form:
operand cannot have void type: op "="
assignment type mismatch:
long "=" void
To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.
Fixes#5814
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit dfa8d3a81a)
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 66a7dc4c72)
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 9164e26e2f)
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5dae086f7e)
- used __func__ macro instead of __FUNCTION__ to unify
macro usage with other components
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 9a51e257d1)
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 176da51aec)
The important part of this fix is a couple places 5 was hard-coded that needed to be
strlen(OPAL_INFO_SAVE_PREFIX).
But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There
was an "if" statement making sure all the arguments had appropriate strlen(), but gcc
still complained about the following snprintf() because the size of the struct element
is iterator->ie_key[OPAL_MAX_INFO_KEY + 1].
Signed-off-by: Mark Allen <markalle@us.ibm.com>
libevent does not support multiple threads calling the event loop on
the same event base. This causes external libevent's to print out
re-entrant warning messages. This commit fixes the issue by protecting
the call to the event loop with an atomic swap check.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Remove this component pending re-architecture of the overall OFI
components. We have had similar issues before when multiple components
use the same library - typical issues are race conditions, initialize
and finalize errors, etc. We are seeing similar problems here as we get
broader exposure to different library version and environment
combinations.
The correct fix in the past has been to centralize the library
interactions in a "common" component. We will pursue that here by moving
some additional functions (e.g., endpoint creation) into the existing
opal/mca/common/ofi component. We can't do that and thoroughly test it
in time for the v4.0.0 release, so we'll simply remove this component
from the release.
Once we have things correctly fixed, we'll submit a PR to restore the
component plus the related fixes to some future v4.x release.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Disable async receive for CUDA under OpenIB. While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit. This change has been validated
and approved by Akshay.
References #3972
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 9344afd485)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
KNC is effectively dead. Remove corresponding SCIF
support in Open MPI.
cherry pick of PR #5737
+
news update
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit b9ac3d8931)
Adds device ids of different Broadcom adapters from
BCM57XXX and BCM58XXX family of HCAs.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
(cherry-picked from a53a6f7650)