Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.
Also this commit fixes uninitialized `rc` in fencenb's of both
components.
Thanks Jeff for the guidance
Fixesopen-mpi/ompi#1683
note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.
Thanks N.L.K Nguyen for the report
(back-ported from upstream pmix/master@c8d55350a9)
This commit fixes a long standing bug in rdmacm. It is required that
the thread that calls mca_btl_openib_endpoint_cpc_complete holds the
endpoint lock. This was not the case for rdmacm. This causes debug
builds to abort. This change also required changing
mca_btl_openib_endpoint_send_cts to require the endpoint lock to be
held when calling.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.
Thanks to @artpol84 for tracking this down.
Fixesopen-mpi/ompi#1823
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.
Thanks to @artpol84 for tracking this down.
Fixes#1813
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds another check to the low-priority callback
conditional that short-circuits the atomic-add if there are no
low-priority callbacks. This should improve performance in the common
case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The OPAL_ENABLE_MULTI_THREADS macro is always defined as 1. This was
causing us to always use the multi-thread path for synchronization
objects. The code has been updated to use the opal_using_threads()
function. When MPI_THREAD_MULTIPLE support is disabled at build time
(2.x only) this function is a macro evaluating to false so the
compiler will optimize out the MT-path in this case. The
OPAL_ATOMIC_ADD_32 macro has been removed and replaced by the existing
OPAL_THREAD_ADD32 macro.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The way the gni btl is currently coded,
it will run completely out of gas on KNL at
123 processes/node. Since there are bound to be
those who try to run a MPI process/hyperthread
on KNL nodes, the fma sharing mode needs to be requested.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit fixes a compile error on 32-bit platforms. The
low-priority call counter was always using 64-bit atomics which will
not work if 64-bit atomic math is not available. Updated to use 32-bit
instead.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>