Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 9164e26e2f)
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5dae086f7e)
The current cast is *functional*, but isn't really the way it should
be done. This commit makes the cast the way it should be done.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix two facepalms:
1. The "uint32" in the hash map functions refer to the *key* size, not
the *value* size. The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
to the value.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.
Instead, use a hash map. That way, it starts small and can grow if it
needs to. It also makes no assumptions about the values of the kernel
interface indexes.
Fixes#5292.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior). Roll
back these messages to be opal_output_verbose() kinds of messages.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long. Don't copy beyond the end of that source buffer.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Three issues were resulting in hangs during large message transfer:
* The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
address check was binary, rather than a count
* The accept handler would not skip a btl module that was already in use, resulting in all
connections for a given address being vectored to a single btl
* Multiple addresses in the same subnet caused connections to be
stalled, as the receiver would always use the same (first) address
found. Binding the outgoing connection solves this issue
* Lastly fix race condition created by connections being started at the exact same time
by accpeting connections not in the closed state, allowing endpoint_accept to resolve
dispute
Signed-off-by: Jordan Cherry <cherryj@amazon.com>
Their is racing condition in TCP connection establishment
during simultaneous handshake. This PR handles the fix for
it.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
This commit has two changes
1. Adding magic string during handshake can cause
issue when used with older version of MPI. Hence set
RCVTIMEO paramter to 2 second
2. Using single call during handshake instead of
two calls
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards tcp debugging
we are moving few BTL_ERROR to show_help and also
update the function behaviour of
mca_btl_tcp_endpoint_complete_connect to return
SUCCESS and ERROR cases.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards handling failure case
in btl tcp we are using magic string to verify mpi
connection. In case if there is mismatch or missing
magic string we can identify that we are trying to
connect with someother process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Moving non-blocking send/receive function to btl_tcp
will help reusing these function where ever needed.
In this case we plan to reuse receive function to
retrive magic string to validate established connection
is from mpi process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Based on an idea from Brian move the libevent trigger update to a later
stage instead of the generic add/del procs. So, we are doing the
increment/decrement when we register the recv handler for an endpoint,
so basically when we create and connect a socket to a peer. The benefit
is that as long as TCP is not used, there should be no impact on the
performance of other BTLs. The drawback is that the first TCP connection
will be slightly slower, but then once we have a peer connected over
TCP things go back to normal.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Set the default send and receive socket buffer size to 0,
which means Open MPI will not try to set a buffer size during
startup.
The default behavior since near day one of the TCP BTL has been
to set the send and receive socket buffer sizes to 128 KiB. A
number that works great on 1 GbE, but not so great on 10 GbE
fabrics of any real size. Modern TCP stacks, particularly on
Linux, have gotten much smarter about buffer sizes and are much
less efficient if a buffer size is set (even if set to something
large).
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This is mostly for error cases, where we need to release the
newly created proc. Currently the code deadlocks because the endpoint
lock is help at the release and the lock is not recursive.
Aslo added some code to print the IP addresses that don't match during
the TCP connection step.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Add a verbose to show all the failed attempts to match the
remote interfaces based on the modex info.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Due to the conversion from ssize_t to int we were losing bytes, and
ended up writing outside the receiver buffer. Similarly on the send,
due to the conversion to a lesser type, we could missinterpret the
end of the fragment.
since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD
is always true and hence can be safely removed
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This patch is based on the "RFC: Reenabling the TCP BTL over local
interfaces (when specifically requested)". It removes the hardcoded
exception for the local devices that has been enforced by the
TCP BTL. Instead, we exclude the local interface only via the
exclude MCA (both IPv4 and IPv6 local addresses are already in the
default if_exclude), which is also the behavior currently described in
our README file.
This commit fixes an abort during finalize because pending events were
removed from the list twice.
References #2030
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
We commonly see messages on the users list where a peer has hung up
because it has crashed. Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>