This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.
At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.
This commit also fixes some bugs in the active-message send path:
- Correctly set all fragment fields in prepare_src.
- Fix bug when using buffered-send. We were not reading the return
code correctly (which is in bytes). This resulted in a message
getting sent multiple times.
- Don't try to progress sends from the btl_send function when in an
active-message callback. It could lead to deep recursion and an
eventual crash if we get a trace like
send->progress->am_complete->ob1_callback->send->am_complete...
Closes#5820Closes#5821
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Correctly aggregate slots across -H entries from each app. Take into
account any -H entry when computing nprocs when no value was given.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
While we require C99 to build Open MPI, we do not require C99 to build
user MPI applications. As such, we shouldn't have C99-style comments
(i.e., "//"-style) in mpi.h.in.
Thanks to @AdamSimpson for reporting the issue.
This commit simply converts a //-style comment to a /**/-style
comment. No code or logic changes.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Make sure all pending communications are done on all ranks before
closing the window. This way it will be safe to close the endpoints when
closing the component.
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.
Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.
Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
OpenJDK 11 changed the default javadoc output HTML version to HTML 5
from HTML 4.01. It causes an error on building Open MPI configured
with `--enable-mpi-java` (default: disable). This fix is compatible
with older OpenJDK.
I don't know whether this problem exists with other vender's JDKs.
But this fix should be compatible with other JDKs because the new
syntax is used in other places in the same file.
Thanks to Siegmar Gross for the bug report.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this. As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Do not reorder the available host list as this causes the head node process assignment to differ from those computed on the other nodes
Signed-off-by: Ralph H Castain <rhc@open-mpi.org>