openmpi

Автор	SHA1	Сообщение	Дата
Joseph Schuchart	ae3974d249	Add missing free call to mca_btl_tcp_component_exchange Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2020-06-19 14:30:07 +02:00
Noah Evans	ee3517427e	Add threads framework Add a framework to support different types of threading models including user space thread packages such as Qthreads and argobot: https://github.com/pmodels/argobots https://github.com/Qthreads/qthreads The default threading model is pthreads. Alternate thread models are specificed at configure time using the --with-threads=X option. The framework is static. The theading model to use is selected at Open MPI configure/build time. mca/threads: implement Argobots threading layer config: fix thread configury - Add double quotations - Change Argobot to Argobots config: implement Argobots check If the poll time is too long, MPI hangs. This quick fix just sets it to 0, but it is not good for the Pthreads version. Need to find a good way to abstract it. Note that even 1 (= 1 millisecond) causes disastrous performance degradation. rework threads MCA framework configury It now works more like the ompi/mca/rte configury, modulo some edge items that are special for threading package linking, etc. qthreads module some argobots cleanup Signed-off-by: Noah Evans <noah.evans@gmail.com> Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov> Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2020-03-27 10:15:45 -06:00
Gilles Gouaillardet	174e967dbc	Remove ORTE project Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build without reference to ORTE. Setup opal/pmix framework to be static. Remove support for all PMI-1 and PMI-2 libraries. Add support for "external" pmix component as well as internal v4 one. remove orte: misc fixes - UCX fixes - VPATH issue - oshmem fixes - remove useless definition - Add PRRTE submodule - Get autogen.pl to traverse PRRTE submodule - Remove stale orcm reference - Configure embedded PRRTE - Correctly pass the prefix to PRRTE - Correctly set the OMPI_WANT_PRRTE am_conditional - Move prrte configuration to the end of OMPI's configure.ac - Make mpirun a symlink to prun, when available - Fix makedist with --no-orte/--no-prrte option - Add a `--no-prrte` option which is the same as the legacy `--no-orte` option. - Remove embedded PMIx tarball. Replace it with new submodule pointing to OpenPMIx master repo's master branch - Some cleanup in PRRTE integration and add config summary entry - Correctly set the hostname - Fix locality - Fix singleton operations - Fix support for "tune" and "am" options Signed-off-by: Ralph Castain <rhc@pmix.org> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-02-07 18:20:06 -08:00
William Zhang	e958f3cf22	btl tcp: Use reachability and graph solving for global interface matching Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>	2020-01-21 18:24:08 +00:00
William Zhang	4ebb37a26c	opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE Due to IF_NAMESIZE being a reused and conditionally defined macro, issues could arise from macro mismatches. In particular, in cases where opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32. If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This can cause a mismatch when using the same macro on a system. Thus different parts of the code can have differring ideas on the size of a structure containing a char name[IF_NAMESIZE]. To avoid this error case, we avoid reusing the IF_NAMESIZE macro and instead define our own as OPAL_IF_NAMESIZE. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-29 21:24:39 +00:00
bosilca	b54fdf5dd9	Merge pull request #6541 from bwbarrett/bugfix/enotconn btl/tcp: Skip printing error message in racy cleanup path	2019-03-28 22:42:52 -04:00
Brian Barrett	d5360711fa	btl/tcp: Skip printing error message in racy cleanup path Avoid printing an error message about ENOTCONN return codes from getpeername() when handling an incoming connection request. At this point in the receive state machine, the remote process has been verified to be a valid OMPI instance. In all-to-all startup at 4k rank scale, we're seeing this error message when the remote side drops the connection because it realizes it's the "loser" in the connection race. We were already doing all the right things, other than printing a scary error message. So skip the error message and call it good. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2019-03-28 23:12:35 +00:00
Dmitry Gladkov	9920da4992	btl/tcp: Fix copy-paste misprint Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2019-02-20 11:18:02 +02:00
Brian Barrett	457f058e73	btl tcp: Simplify modex address selection Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs #5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-18 02:23:21 +00:00
Brian Barrett	4f19221af2	btl tcp: Simplify module address storage Today, a btl tcp module is associated with exactly one IP address (IPv4 or IPv6). There's no need to reserve space for both an IPv4 and IPv6 address in the module structure, since the module will only be associated with one or the other. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-17 16:21:17 +00:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Brian Barrett	bd07cc6707	btl tcp: Improve initialization debugging When creating TCP BTL modules, print more information about the module's ethernet association, including the first address associated with the device, as debug output. Fix a flipped output string for IPv4 and IPv6 addresses in the modex send code. Add the addresses being published in the modex to the debugging output in modex send, to help match failures in endpoint match. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
George Bosilca	a3a492b42c	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:08:18 -04:00
Michael Kuron	17b0f1fcc3	Deal with EOPNOTSUPP returned from getsockopt() This can be returned when running on QEMU user-mode emulation, which does not support getsockopt with SO_RCVTIMEO. Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>	2018-09-16 14:55:28 +02:00
Jeff Squyres	f200b866df	btl/tcp: roll back parts of 40afd525f8 Some of the show_help() messages that were added in 40afd525f8 were really normal / expected behavior (e.g., if 2 peers connect in the TCP BTL more-or-less simultaneously, one of them will drop the connection -- no need to show_help() about this; it's expected behavior). Roll back these messages to be opal_output_verbose() kinds of messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-04-07 12:28:10 -07:00
Jeff Squyres	8c419294a8	btl/tcp: fix CID 710596 sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses), but the memory that we are copying from (my_ss->sin_addr) is only 4 bytes long. Don't copy beyond the end of that source buffer. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:22 -07:00
Jeff Squyres	40afd525f8	btl/tcp: make error messages more specific Convert some verbose messages to opal_show_help() messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-21 19:34:03 -07:00
Jordan Cherry	d7e7e3acb7	tcp btl: Fix multiple-link connection establishment. Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers. Three issues were resulting in hangs during large message transfer: * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process address check was binary, rather than a count * The accept handler would not skip a btl module that was already in use, resulting in all connections for a given address being vectored to a single btl * Multiple addresses in the same subnet caused connections to be stalled, as the receiver would always use the same (first) address found. Binding the outgoing connection solves this issue * Lastly fix race condition created by connections being started at the exact same time by accpeting connections not in the closed state, allowing endpoint_accept to resolve dispute Signed-off-by: Jordan Cherry <cherryj@amazon.com>	2018-02-27 16:36:44 +00:00
Mohan Gandhi	6d642e8d94	Btl tcp: Fix racing condition on simultaneous handshake Their is racing condition in TCP connection establishment during simultaneous handshake. This PR handles the fix for it. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-10-03 13:13:43 -07:00
George Bosilca	d10522a01c	Set a hard limit on the TCP max fragment size. Some OSes have hardcoded limits to prevent overflowing over an int32_t. We can either detect this at configure (which might be a nicer but incomplete solution), or always force the pipelined protocol over TCP. As it only covers data larger than 1GB, no performance penalty is to be expected. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-09-01 18:52:48 -04:00
George Bosilca	50f471e31e	Cleanup a set of warnings reported by Ralph. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-08-22 23:00:18 -04:00
Mohan	fc32ae401e	Btl Tcp: Updated tcp handshake methods This commit has two changes 1. Adding magic string during handshake can cause issue when used with older version of MPI. Hence set RCVTIMEO paramter to 2 second 2. Using single call during handshake instead of two calls Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-18 10:06:52 -07:00
Mohan	e3dfe11da9	Btl tcp: Improving verbose around tcp As part of improvement towards tcp btl we are improving verbose in general Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 17:22:16 -07:00
Mohan	4bc7b214dc	Btl tcp: Improving verbose around IPV6 As part of improvement around tcp btl debugging & verbose. we are improving verbose around IPV6 Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:14 -07:00
Mohan	0741fad479	Btl tcp: BTL_ERROR to show_help & update func behaviour As part of improvement towards tcp debugging we are moving few BTL_ERROR to show_help and also update the function behaviour of mca_btl_tcp_endpoint_complete_connect to return SUCCESS and ERROR cases. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:14 -07:00
Mohan	368f9f0dfc	Btl tcp: Using magic string to verify mpi connection As part of improvement towards handling failure case in btl tcp we are using magic string to verify mpi connection. In case if there is mismatch or missing magic string we can identify that we are trying to connect with someother process. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:13 -07:00
Brian Barrett	3b991498be	btl tcp: Don't set socket buffer size by default Set the default send and receive socket buffer size to 0, which means Open MPI will not try to set a buffer size during startup. The default behavior since near day one of the TCP BTL has been to set the send and receive socket buffer sizes to 128 KiB. A number that works great on 1 GbE, but not so great on 10 GbE fabrics of any real size. Modern TCP stacks, particularly on Linux, have gotten much smarter about buffer sizes and are much less efficient if a buffer size is set (even if set to something large). Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-04-28 14:14:49 -07:00
Gilles Gouaillardet	c62498ab3d	btl/tcp: remove reference to just removed tcp_local Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-07 09:32:09 +09:00
Gilles Gouaillardet	7e5da7382e	btl/tcp: plug leaks when closing component remove tcp_local from the tcp_procs table, and release it Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
George Bosilca	d0dddef53d	Protect the tcp_endpoints list from concurrent accesses. Thanks Gilles for your help. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2016-11-11 00:06:03 -05:00
Gilles Gouaillardet	a49422fe84	btl/tcp: get rid of the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD macro since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD is always true and hence can be safely removed Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-08 14:00:05 +09:00
Nathan Hjelm	a681837ba8	btl/tcp: fix double list remove This commit fixes an abort during finalize because pending events were removed from the list twice. References #2030 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-09-13 09:23:12 -06:00
George Bosilca	3adff9d323	Fixes #1793 . Reshape the tearing down process (connection close) to prevent race conditions between the main thread and the progress thread. Minor cleanups.	2016-08-24 22:45:19 -04:00
George Bosilca	d2abff583e	Fix race condition during BTL TCP tear-down. bot🏷️bug bot:assign:@hjelmn	2016-05-30 10:47:14 -05:00
Karol Mroz	ca6ddf3270	btl/tcp: autodetect bandwidth and latency if unset Fixes open-mpi/ompi#120 Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-05-18 16:25:52 +02:00
Karol Mroz	b9c6c43c6b	btl/tcp: add default defines for bandwidth and latency Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-05-18 16:25:52 +02:00
Ralph Castain	7eca2f9650	Add missing include	2016-03-30 01:34:01 -07:00
Gilles Gouaillardet	e852d85cc1	btl/tcp: add missing mca_btl_tcp_dump() subroutine	2016-03-30 16:10:15 +09:00
Ralph Castain	f70c5c495b	Tsk...tsk...replace references to ompi values with opal	2016-03-29 13:35:43 -07:00
George Bosilca	d0165818b3	Initialize all common symbols.	2016-03-29 16:08:27 -04:00
George Bosilca	f69eba1bc4	Update the copyright and cleanup the code. Per @jsquyres suggestion remove all trailing spaces. Credit to `sed -i.bak 's/ $//' /[ch]`.	2016-03-28 14:41:01 -04:00
Thananon Patinyasakdikul	92062492b9	Enable Threading in the BTL TCP Added mca parameter to turn progress thread on/off Add a flag to check if we have btl progress thread. Added macro for ob1 matching lock. Update the AUTHORS file.	2016-03-28 14:41:01 -04:00
George Bosilca	32277db6ab	Add support for async progress in the BTL TCP. All BTL-only operations (basically all data movements with the exception of the matching operation) can now be handled for the TCP BTL by a progress thread.	2016-03-28 14:40:50 -04:00
Nathan Hjelm	6f8f2325ed	btl: btls are now required to set the send flag if supported This commit updates each non-compliant btl to send the MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is supported. This fixes a problem identified after the latest bml/r2 update which excplicitly checks for the send flag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Jeff Squyres	bc9e5652ff	whitespace: purge whitespace at end of lines Generated by running "./contrib/whitespace-purge.sh".	2015-09-08 09:47:17 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	4b59be4e4c	btl tcp: cosmetic changes and updates No logic changes. Update some stale/incorrect comments, fix some indenting and style.	2015-06-06 10:17:20 -07:00

1 2

71 Коммитов