After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error
path. Moved the free call into cleanup to ensure we always free this
memory before leaving the function.
Signed-off-by: William Zhang <wilzhang@amazon.com>
Avoid printing an error message about ENOTCONN return codes from
getpeername() when handling an incoming connection request. At
this point in the receive state machine, the remote process has
been verified to be a valid OMPI instance. In all-to-all startup
at 4k rank scale, we're seeing this error message when the remote
side drops the connection because it realizes it's the "loser"
in the connection race. We were already doing all the right things,
other than printing a scary error message. So skip the error
message and call it good.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
In 457f058 I broke the TCP BTL with --enable-ipv6. This patch
fixes the compile error, so IPv6 works again.
Fixed#5996
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code. Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure. This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.
Refs #5818
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6). There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads. The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.
This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT. It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.
The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race. THis patch is a work around until that patch can
be developed.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This reverts commit 6acebc40a194c92ab38a28553c2c8b04eb391820.
This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run. See
https://github.com/open-mpi/ompi/issues/5849 for more information.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.
Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.
Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.
Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The current cast is *functional*, but isn't really the way it should
be done. This commit makes the cast the way it should be done.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix two facepalms:
1. The "uint32" in the hash map functions refer to the *key* size, not
the *value* size. The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
to the value.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.
Instead, use a hash map. That way, it starts small and can grow if it
needs to. It also makes no assumptions about the values of the kernel
interface indexes.
Fixes#5292.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior). Roll
back these messages to be opal_output_verbose() kinds of messages.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long. Don't copy beyond the end of that source buffer.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Three issues were resulting in hangs during large message transfer:
* The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
address check was binary, rather than a count
* The accept handler would not skip a btl module that was already in use, resulting in all
connections for a given address being vectored to a single btl
* Multiple addresses in the same subnet caused connections to be
stalled, as the receiver would always use the same (first) address
found. Binding the outgoing connection solves this issue
* Lastly fix race condition created by connections being started at the exact same time
by accpeting connections not in the closed state, allowing endpoint_accept to resolve
dispute
Signed-off-by: Jordan Cherry <cherryj@amazon.com>
Their is racing condition in TCP connection establishment
during simultaneous handshake. This PR handles the fix for
it.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
This commit has two changes
1. Adding magic string during handshake can cause
issue when used with older version of MPI. Hence set
RCVTIMEO paramter to 2 second
2. Using single call during handshake instead of
two calls
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards tcp debugging
we are moving few BTL_ERROR to show_help and also
update the function behaviour of
mca_btl_tcp_endpoint_complete_connect to return
SUCCESS and ERROR cases.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards handling failure case
in btl tcp we are using magic string to verify mpi
connection. In case if there is mismatch or missing
magic string we can identify that we are trying to
connect with someother process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Moving non-blocking send/receive function to btl_tcp
will help reusing these function where ever needed.
In this case we plan to reuse receive function to
retrive magic string to validate established connection
is from mpi process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Based on an idea from Brian move the libevent trigger update to a later
stage instead of the generic add/del procs. So, we are doing the
increment/decrement when we register the recv handler for an endpoint,
so basically when we create and connect a socket to a peer. The benefit
is that as long as TCP is not used, there should be no impact on the
performance of other BTLs. The drawback is that the first TCP connection
will be slightly slower, but then once we have a peer connected over
TCP things go back to normal.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Set the default send and receive socket buffer size to 0,
which means Open MPI will not try to set a buffer size during
startup.
The default behavior since near day one of the TCP BTL has been
to set the send and receive socket buffer sizes to 128 KiB. A
number that works great on 1 GbE, but not so great on 10 GbE
fabrics of any real size. Modern TCP stacks, particularly on
Linux, have gotten much smarter about buffer sizes and are much
less efficient if a buffer size is set (even if set to something
large).
Signed-off-by: Brian Barrett <bbarrett@amazon.com>