This commit fixes two issues that can occur during a connection:
- Re-entry to connection progress from modex lookup. Added an
additional endpoint state that will keep the code from re-entering
the common endpoint create.
- Fixed a race between a process posting a directed datagram through
a send and a connection being progressed through opal_progress().
The progress code was not obtaining the endpoint lock before
attempting to update the endpoint. To limit the amount of code
changed for 2.0.1 this commit makes the endpoint lock recursive. In
a future update this may be changed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes PR https://github.com/open-mpi/ompi/pull/1687
The code that sets OPAL_HAVE_WORKING_EVENTOPS for internal libevent
was executed even if the external libevent component was configured.
As the result libevent progress wasn't called in opal_progress which
for example caused ring_c to hang when pml/ob1 was used.
Architecture is set by the ompi layer *after* job startup, so the key cannot
have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a
otherwise architecture cannot be retrieved
This commits changed rand(3) and family in libevent to use internal
random function provided in opal to prevent pertubing user's random seed.
Fixesopen-mpi/ompi#1877
On Cray, PR #1846 introduced a double free
situation which led to all kinds of random memory
corruption problems.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
The max inline send size on a queue pair is not available until after
the endpoint is connected. Before this commit the send flags
(including the inline flag) were set before this value was
initialized. This commit moves setting the send_flags down to
mca_btl_openib_put_internal which is only called after the endpoint is
connected. This fixes a bug when using osc/rdma.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug in the pmix2x client code where a loop
variable is not correctly incremented. This was leading to hangs and
crashes when creating intercommunicators. Also fixed two double
increments in other loops.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.
Also this commit fixes uninitialized `rc` in fencenb's of both
components.
Thanks Jeff for the guidance
Fixesopen-mpi/ompi#1683
note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.
Thanks N.L.K Nguyen for the report
(back-ported from upstream pmix/master@c8d55350a9)
This commit fixes a long standing bug in rdmacm. It is required that
the thread that calls mca_btl_openib_endpoint_cpc_complete holds the
endpoint lock. This was not the case for rdmacm. This causes debug
builds to abort. This change also required changing
mca_btl_openib_endpoint_send_cts to require the endpoint lock to be
held when calling.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.
Thanks to @artpol84 for tracking this down.
Fixesopen-mpi/ompi#1823
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>