openmpi

Автор	SHA1	Сообщение	Дата
Geoff Paulsen	593d652077	Merge pull request #5823 from jsquyres/pr/v4.0.x/fix-tcp-btl-show-help-ip-address v4.0.x: btl/tcp: output the IP address correctly	2018-10-03 08:29:37 -05:00
George Bosilca	b63bee5da4	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `a3a492b42c`)	2018-10-02 14:37:31 -04:00
George Bosilca	d450e460d6	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `9164e26e2f`)	2018-10-02 14:37:31 -04:00
Jeff Squyres	81f2f19398	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `5dae086f7e`)	2018-10-02 10:49:51 -04:00
Jeff Squyres	2e37f97a38	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `fe0852bcb4`)	2018-09-21 14:35:51 -05:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00
Jeff Squyres	57bc657e7f	btl/tcp: fix hash map usage Fix two facepalms: 1. The "uint32" in the hash map functions refer to the key size, not the value size. The values are always 64 bits. 2. Pass the straight value to the "set" functions -- not the pointer to the value. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-28 15:29:41 -07:00
Ralph Castain	0ddbc75ce5	Merge pull request #4930 from kizill/fix-ipv6 fixed ipv6 OOB connection problems (fix issue #1585)	2018-06-26 09:13:53 -07:00
Jeff Squyres	3767ce27c0	btl/tcp: trivial whitespace clean No code/logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:04:12 -07:00
Jeff Squyres	9034717876	btl/tcp: use a hash map for kernel IP interface indexes The giant size of the TCP proc struct is causing a problem in some environments (because it is allocated on the stack), and it was too big, anyway. Instead, use a hash map. That way, it starts small and can grow if it needs to. It also makes no assumptions about the values of the kernel interface indexes. Fixes #5292. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:03:30 -07:00
George Bosilca	6ff11267fb	Remove warnings identified by clang. Plus minor spacing and indentation issues. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-04-14 17:14:12 -04:00
Jeff Squyres	f200b866df	btl/tcp: roll back parts of `40afd525f8` Some of the show_help() messages that were added in `40afd525f8` were really normal / expected behavior (e.g., if 2 peers connect in the TCP BTL more-or-less simultaneously, one of them will drop the connection -- no need to show_help() about this; it's expected behavior). Roll back these messages to be opal_output_verbose() kinds of messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-04-07 12:28:10 -07:00
Jeff Squyres	8c419294a8	btl/tcp: fix CID 710596 sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses), but the memory that we are copying from (my_ss->sin_addr) is only 4 bytes long. Don't copy beyond the end of that source buffer. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:22 -07:00
Jeff Squyres	a17f4afdc7	btl/tcp: fix CID 1416634 Fix resource leak in the TCP BTL. Also add a little defensive programming. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:21 -07:00
Jeff Squyres	40afd525f8	btl/tcp: make error messages more specific Convert some verbose messages to opal_show_help() messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-21 19:34:03 -07:00
Stanislav Kirillov	c2bfca19ba	fix ipv6 missing copypaste Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2018-03-21 02:25:04 +03:00
Jordan Cherry	d7e7e3acb7	tcp btl: Fix multiple-link connection establishment. Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers. Three issues were resulting in hangs during large message transfer: * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process address check was binary, rather than a count * The accept handler would not skip a btl module that was already in use, resulting in all connections for a given address being vectored to a single btl * Multiple addresses in the same subnet caused connections to be stalled, as the receiver would always use the same (first) address found. Binding the outgoing connection solves this issue * Lastly fix race condition created by connections being started at the exact same time by accpeting connections not in the closed state, allowing endpoint_accept to resolve dispute Signed-off-by: Jordan Cherry <cherryj@amazon.com>	2018-02-27 16:36:44 +00:00
Mohan Gandhi	6d642e8d94	Btl tcp: Fix racing condition on simultaneous handshake Their is racing condition in TCP connection establishment during simultaneous handshake. This PR handles the fix for it. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-10-03 13:13:43 -07:00
bosilca	ab68aced23	Merge pull request #3738 from bosilca/topic/tcp_event_count Fix the TCP performance impact when BTL not used	2017-09-19 23:08:58 -04:00
George Bosilca	d10522a01c	Set a hard limit on the TCP max fragment size. Some OSes have hardcoded limits to prevent overflowing over an int32_t. We can either detect this at configure (which might be a nicer but incomplete solution), or always force the pipelined protocol over TCP. As it only covers data larger than 1GB, no performance penalty is to be expected. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-09-01 18:52:48 -04:00
George Bosilca	c340da2586	A first cut at the large data problem with TCP. As long as the writev and readv support a sum larger than a uint32_t this version will work. For the other OSes a different patch is required. This patch is a slight modification of the one proposed by @ggouaillardet. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-09-01 18:52:48 -04:00
Josh Hursey	ad87aa2674	Merge pull request #4121 from jjhursey/explore/dlopen-local mca: Dynamic components link against project lib	2017-08-25 13:15:51 -05:00
Joshua Hursey	e1d079544b	mca: Dynamic components link against project lib * Resolves #3705 * Components should link against the project level library to better support `dlopen` with `RTLD_LOCAL`. * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am` with the appropriate project level library: ``` MCA components in ompi/ $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la MCA components in orte/ $(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la MCA components in opal/ $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la MCA components in oshmem/ $(top_builddir)/oshmem/liboshmem.la" ``` Note: The changes in this commit were automated by the script in the commit that proceeds it with the `libadd_mca_comp_update.py` script. Some components were not included in this change because they are statically built only. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-08-24 11:56:16 -04:00
George Bosilca	50f471e31e	Cleanup a set of warnings reported by Ralph. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-08-22 23:00:18 -04:00
Mohan	fc32ae401e	Btl Tcp: Updated tcp handshake methods This commit has two changes 1. Adding magic string during handshake can cause issue when used with older version of MPI. Hence set RCVTIMEO paramter to 2 second 2. Using single call during handshake instead of two calls Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-18 10:06:52 -07:00
Mohan	e3dfe11da9	Btl tcp: Improving verbose around tcp As part of improvement towards tcp btl we are improving verbose in general Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 17:22:16 -07:00
Mohan	4bc7b214dc	Btl tcp: Improving verbose around IPV6 As part of improvement around tcp btl debugging & verbose. we are improving verbose around IPV6 Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:14 -07:00
Mohan	0741fad479	Btl tcp: BTL_ERROR to show_help & update func behaviour As part of improvement towards tcp debugging we are moving few BTL_ERROR to show_help and also update the function behaviour of mca_btl_tcp_endpoint_complete_connect to return SUCCESS and ERROR cases. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:14 -07:00
Mohan	368f9f0dfc	Btl tcp: Using magic string to verify mpi connection As part of improvement towards handling failure case in btl tcp we are using magic string to verify mpi connection. In case if there is mismatch or missing magic string we can identify that we are trying to connect with someother process. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:13 -07:00
Mohan	c30a42917c	Btl tcp: Refactoring non-blocking send/receive function Moving non-blocking send/receive function to btl_tcp will help reusing these function where ever needed. In this case we plan to reuse receive function to retrive magic string to validate established connection is from mpi process. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:13 -07:00
Gilles Gouaillardet	32606ad476	btl/tcp: fix heterogeneous support for put / large messages Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-07-12 10:27:45 +09:00
George Bosilca	bd5650d680	Fix the TCP performance impact when not used Based on an idea from Brian move the libevent trigger update to a later stage instead of the generic add/del procs. So, we are doing the increment/decrement when we register the recv handler for an endpoint, so basically when we create and connect a socket to a peer. The benefit is that as long as TCP is not used, there should be no impact on the performance of other BTLs. The drawback is that the first TCP connection will be slightly slower, but then once we have a peer connected over TCP things go back to normal. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-06-23 11:15:45 +02:00
Ralph Castain	a737d0f963	Merge pull request #3430 from bosilca/topic/tcp_hostname Use the OPAL function to get the hostname.	2017-05-03 06:42:02 -07:00
Brian Barrett	3b991498be	btl tcp: Don't set socket buffer size by default Set the default send and receive socket buffer size to 0, which means Open MPI will not try to set a buffer size during startup. The default behavior since near day one of the TCP BTL has been to set the send and receive socket buffer sizes to 128 KiB. A number that works great on 1 GbE, but not so great on 10 GbE fabrics of any real size. Modern TCP stacks, particularly on Linux, have gotten much smarter about buffer sizes and are much less efficient if a buffer size is set (even if set to something large). Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-04-28 14:14:49 -07:00
George Bosilca	2d8943d920	Use the OPAL function to get the hostname.	2017-04-28 02:48:15 -04:00
Ralph Castain	470452cba0	Correctly check the sa_family and cast the data correctly before passing it to inet_nop, and don't be quite as fancy with the pointer arithmetic as the combination was causing us to segfault every time this debug message was called. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 11:42:57 -07:00
Jeff Squyres	5b484c91f4	btl/tcp: use show_help to print the dropped-TCP warning Make the message more friendly / more detailed, and de-duplicate it (just in case it happens a lot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-01 16:31:29 -08:00
George Bosilca	ec4a235e6a	Allow a TCP proc release during the create. This is mostly for error cases, where we need to release the newly created proc. Currently the code deadlocks because the endpoint lock is help at the release and the lock is not recursive. Aslo added some code to print the IP addresses that don't match during the TCP connection step. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-03-01 13:17:54 -05:00
George Bosilca	bc2890ed11	Upon a new connection go over all available ifaces. Add a verbose to show all the failed attempts to match the remote interfaces based on the modex info. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-02-07 19:15:49 -05:00
Gilles Gouaillardet	c62498ab3d	btl/tcp: remove reference to just removed tcp_local Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-07 09:32:09 +09:00
Jeff Squyres	368ab4d9a5	Merge pull request #2684 from bosilca/topic/tcp_fixes Remove the tcp_local field from the TCP component.	2017-02-06 16:32:06 -05:00
George Bosilca	999d4973a9	Fix an issue with extremely large data identified by tjb900. Due to the conversion from ssize_t to int we were losing bytes, and ended up writing outside the receiver buffer. Similarly on the send, due to the conversion to a lesser type, we could missinterpret the end of the fragment.	2017-01-18 10:33:12 -05:00
George Bosilca	cfeeecd381	Remove the tcp_local field from the TCP component. Instead use the OPAL process name to get the name of the local process. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-01-07 13:24:18 -05:00
Gilles Gouaillardet	7e5da7382e	btl/tcp: plug leaks when closing component remove tcp_local from the tcp_procs table, and release it Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
George Bosilca	d0dddef53d	Protect the tcp_endpoints list from concurrent accesses. Thanks Gilles for your help. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2016-11-11 00:06:03 -05:00
Gilles Gouaillardet	a49422fe84	btl/tcp: get rid of the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD macro since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD is always true and hence can be safely removed Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-08 14:00:05 +09:00
George Bosilca	93fa94f96f	Re-enable support for local addresses. This patch is based on the "RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)". It removes the hardcoded exception for the local devices that has been enforced by the TCP BTL. Instead, we exclude the local interface only via the exclude MCA (both IPv4 and IPv6 local addresses are already in the default if_exclude), which is also the behavior currently described in our README file.	2016-09-23 13:04:33 -04:00
Nathan Hjelm	a681837ba8	btl/tcp: fix double list remove This commit fixes an abort during finalize because pending events were removed from the list twice. References #2030 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-09-13 09:23:12 -06:00
Jeff Squyres	527efec4fb	Merge pull request #2050 from jsquyres/pr/btl-tcp-help-messages Add a show_help message to TCP BTL when peer unexpectedly disconnects	2016-09-06 09:40:31 -04:00
Jeff Squyres	1953e3406f	btl/tcp: add show_help message when peer hangs up We commonly see messages on the users list where a peer has hung up because it has crashed. Instead of having just a BTL_ERROR message, make this a real opal_show_help() message that tells the user that the peer unexpectedly hung up, and they should look into why that peer hung up. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-09-06 09:40:03 -04:00

1 2 3

126 Коммитов