The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This fixes https://github.com/open-mpi/ompi/issues/1732: i.e., the
case where the outer project has its own check for
<valgrind/valgrind.h>, but also supplements CPPFLAGS (to find
Valgrind's header files) before doing that check.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Ideally, we would tell OMPI to disable autoconf's caching of our
valgrind check result so that its check gets the right result after
adding CPPFLAGS. Not sure if we can do that.
For now, just disable our Valgrind code in embedded mode.
This will keep the x86 backend enabled under Valgrind but
it will auto-disable itself when finding identical APIC ids anyway
(because CPUID returns same outputs for all PUs).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Fixesopen-mpi/ompi#1732
(cherry picked from commit open-mpi/hwloc@8b44fb1c81)
Add the following to zsh shell completion:
* --get-stack-traces
* --report-state-upon-timeout
* --timeout
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Add descriptions for the new --report-state-on-timeout and
--get-stack-traces options.
Also add --timeout, and cross-reference MPIEXEC_TIMEOUT with it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Remove all @open-mpi-git-mirror entries; those are no longer necessary
since the official migration to Git/Github.
Add aliases for @users.noreply.github.com addresses.
Add fixes for what look like accidental name mispellings /
common-name-isms.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
If requested, obtain stacktraces for each application process and report it to stderr upon timeout
stack traces: minor improvements
- Also include the hostname and PID of the each process for which
we're sending the stack traces (vs. just including the ORTE process
name)
- Send a specific error message if we couldn't find "gstack" in the
$PATH (e.g., on OS X)
- Send a sepcific error message if gstack fails to run
- Print a message that obtaining the stack traces may take a few
seconds so that users don't wonder what's happening
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
help-orterun.txt: minor tweaks
Trivial update: show "--timeout" (instead of "-timeout") in the help
message, just to encourage the use of double-dash options.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
trivial: stacktrace -> stack trace
Trivial word smything.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes two bugs in MPI_Wait_any:
- If all requests are inactive then the sync wait would hang forever
because no requests are attached to the sync.
- The request pointer was pointing to the request before the completed
request which caused the wrong request to be freed or marked inactive.
MPI_Wait_some had a similar issue if all the requests were pending.
These issues were identified by MTT.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for Cray Aries atomic operations. This
includes 32-bit and floating point support.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit add support for more atomic operations and type. The
operations added are logical and, logical or, logical xor, swap, min,
and max. New types are 32-bit int by using the
MCA_BTL_ATOMIC_FLAG_32BIT flag, 64-bit float by using the
MCA_BTL_ATOMIC_FLAG_FLOAT flag, and 32-bit float by using both
flags. Floating point numbers are supported by packing the number in
as an int64_t or int32_t. We will update the btl interface in the
future to make this less confusing.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Before dynamic add_procs support was committed to master we called
add_procs with every proc in the job. The XRC code in the openib btl
was taking advantage of this and setting the number of work queue
entries (WQE) based on all the procs on a remote node. Since that is
no longer the case we can not simply increment the sd_wqe field on the
queue pair. To fix the issue a new field has been added to the xrc
queue pair structure to keep track of how many wqes there are total on
the queue pair. If a new endpoint is added that increases the number
of wqes and the xrc queue pair is already connected the code will
attempt to modify the number of wqes on the queue pair. A failure is
ignored because all that will happen is the number of active send work
requests on an XRC queue pair will be more limited.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The original VADER_MAX_ADDRESS was tunned for x86_64 platforms only.
For non x86_64 platforms we can use XPMEM_MAXADDR_SIZE.
Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>