openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	33ab928e1b	ompi_proc_t size reduction: part 1 We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs. Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory. With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication. All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc. Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances: (a) in an error message (b) in a verbose output for debugging purposes Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today. Signed-off-by: Ralph Castain <rhc@pmix.org>	2020-03-23 12:49:44 -07:00
Gilles Gouaillardet	174e967dbc	Remove ORTE project Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build without reference to ORTE. Setup opal/pmix framework to be static. Remove support for all PMI-1 and PMI-2 libraries. Add support for "external" pmix component as well as internal v4 one. remove orte: misc fixes - UCX fixes - VPATH issue - oshmem fixes - remove useless definition - Add PRRTE submodule - Get autogen.pl to traverse PRRTE submodule - Remove stale orcm reference - Configure embedded PRRTE - Correctly pass the prefix to PRRTE - Correctly set the OMPI_WANT_PRRTE am_conditional - Move prrte configuration to the end of OMPI's configure.ac - Make mpirun a symlink to prun, when available - Fix makedist with --no-orte/--no-prrte option - Add a `--no-prrte` option which is the same as the legacy `--no-orte` option. - Remove embedded PMIx tarball. Replace it with new submodule pointing to OpenPMIx master repo's master branch - Some cleanup in PRRTE integration and add config summary entry - Correctly set the hostname - Fix locality - Fix singleton operations - Fix support for "tune" and "am" options Signed-off-by: Ralph Castain <rhc@pmix.org> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-02-07 18:20:06 -08:00
William Zhang	e958f3cf22	btl tcp: Use reachability and graph solving for global interface matching Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>	2020-01-21 18:24:08 +00:00
George Bosilca	476562752f	Correctly report TCP connect errors. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-10-31 18:33:15 -04:00
Stanislav Kirillov	0e0763e006	fix ipv6 btl connection bug Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2019-10-10 11:20:37 +00:00
Austen Lauria	0d4004cc3c	Fix miscellaneous compiler warnings. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2019-10-01 16:27:25 -04:00
William Zhang	4ebb37a26c	opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE Due to IF_NAMESIZE being a reused and conditionally defined macro, issues could arise from macro mismatches. In particular, in cases where opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32. If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This can cause a mismatch when using the same macro on a system. Thus different parts of the code can have differring ideas on the size of a structure containing a char name[IF_NAMESIZE]. To avoid this error case, we avoid reusing the IF_NAMESIZE macro and instead define our own as OPAL_IF_NAMESIZE. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-29 21:24:39 +00:00
William Zhang	8c3b8a87c5	btl tcp: Fix error path memory leak After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error path. Moved the free call into cleanup to ensure we always free this memory before leaving the function. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-15 22:35:04 +00:00
Brian Barrett	457f058e73	btl tcp: Simplify modex address selection Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs #5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-18 02:23:21 +00:00
Brian Barrett	902b57919c	btl tcp: Print number of endpoints on error While trying to debug #3035, it's not clear whether there is an issue with the modex data or printing the address list. Print the number of endpoints on the error, which will help determine which case is happening to Cisco. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
George Bosilca	a3a492b42c	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:08:18 -04:00
Jeff Squyres	5dae086f7e	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 16:12:57 -07:00
Jeff Squyres	fe0852bcb4	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-24 07:39:14 -07:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00
Jeff Squyres	57bc657e7f	btl/tcp: fix hash map usage Fix two facepalms: 1. The "uint32" in the hash map functions refer to the key size, not the value size. The values are always 64 bits. 2. Pass the straight value to the "set" functions -- not the pointer to the value. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-28 15:29:41 -07:00
Ralph Castain	0ddbc75ce5	Merge pull request #4930 from kizill/fix-ipv6 fixed ipv6 OOB connection problems (fix issue #1585)	2018-06-26 09:13:53 -07:00
Jeff Squyres	3767ce27c0	btl/tcp: trivial whitespace clean No code/logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:04:12 -07:00
Jeff Squyres	9034717876	btl/tcp: use a hash map for kernel IP interface indexes The giant size of the TCP proc struct is causing a problem in some environments (because it is allocated on the stack), and it was too big, anyway. Instead, use a hash map. That way, it starts small and can grow if it needs to. It also makes no assumptions about the values of the kernel interface indexes. Fixes #5292. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:03:30 -07:00
Jeff Squyres	a17f4afdc7	btl/tcp: fix CID 1416634 Fix resource leak in the TCP BTL. Also add a little defensive programming. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:21 -07:00
Stanislav Kirillov	c2bfca19ba	fix ipv6 missing copypaste Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2018-03-21 02:25:04 +03:00
Jordan Cherry	d7e7e3acb7	tcp btl: Fix multiple-link connection establishment. Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers. Three issues were resulting in hangs during large message transfer: * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process address check was binary, rather than a count * The accept handler would not skip a btl module that was already in use, resulting in all connections for a given address being vectored to a single btl * Multiple addresses in the same subnet caused connections to be stalled, as the receiver would always use the same (first) address found. Binding the outgoing connection solves this issue * Lastly fix race condition created by connections being started at the exact same time by accpeting connections not in the closed state, allowing endpoint_accept to resolve dispute Signed-off-by: Jordan Cherry <cherryj@amazon.com>	2018-02-27 16:36:44 +00:00
Mohan Gandhi	6d642e8d94	Btl tcp: Fix racing condition on simultaneous handshake Their is racing condition in TCP connection establishment during simultaneous handshake. This PR handles the fix for it. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-10-03 13:13:43 -07:00
Mohan	e3dfe11da9	Btl tcp: Improving verbose around tcp As part of improvement towards tcp btl we are improving verbose in general Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 17:22:16 -07:00
Mohan	4bc7b214dc	Btl tcp: Improving verbose around IPV6 As part of improvement around tcp btl debugging & verbose. we are improving verbose around IPV6 Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-08-17 16:45:14 -07:00
George Bosilca	2d8943d920	Use the OPAL function to get the hostname.	2017-04-28 02:48:15 -04:00
Ralph Castain	470452cba0	Correctly check the sa_family and cast the data correctly before passing it to inet_nop, and don't be quite as fancy with the pointer arithmetic as the combination was causing us to segfault every time this debug message was called. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 11:42:57 -07:00
Jeff Squyres	5b484c91f4	btl/tcp: use show_help to print the dropped-TCP warning Make the message more friendly / more detailed, and de-duplicate it (just in case it happens a lot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-01 16:31:29 -08:00
George Bosilca	ec4a235e6a	Allow a TCP proc release during the create. This is mostly for error cases, where we need to release the newly created proc. Currently the code deadlocks because the endpoint lock is help at the release and the lock is not recursive. Aslo added some code to print the IP addresses that don't match during the TCP connection step. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-03-01 13:17:54 -05:00
George Bosilca	bc2890ed11	Upon a new connection go over all available ifaces. Add a verbose to show all the failed attempts to match the remote interfaces based on the modex info. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-02-07 19:15:49 -05:00
George Bosilca	cfeeecd381	Remove the tcp_local field from the TCP component. Instead use the OPAL process name to get the name of the local process. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-01-07 13:24:18 -05:00
George Bosilca	93fa94f96f	Re-enable support for local addresses. This patch is based on the "RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)". It removes the hardcoded exception for the local devices that has been enforced by the TCP BTL. Instead, we exclude the local interface only via the exclude MCA (both IPv4 and IPv6 local addresses are already in the default if_exclude), which is also the behavior currently described in our README file.	2016-09-23 13:04:33 -04:00
Gilles Gouaillardet	4b208e4463	btl/tcp: make mca_btl_tcp_proc_insert re-entrant otherwise bad things happen with --mca btl_tcp_progress_thread 1 (non default) and --mca mpi_add_procs_cutoff 0 (default)	2016-09-05 15:57:34 +09:00
George Bosilca	702f80ad7e	Remove "signed vs. unsigned" warnings.	2016-05-01 11:45:48 -04:00
Nathan Hjelm	03f4a854cb	btl/tcp: fix add_procs race condition This commit fixes a race between a thread calling the tcp btl's add_procs and a thread processing an incomming connection. The race occured because the add_procs thread adds a newly created proc object to the hash table before the object is fully initialized. The connection thread then attempts to use the object before the endpoints array on the object has beeen allocation. The fix is to only add the proc to the hash table after it has been completely initialized. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-27 10:24:39 -06:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
Nathan Hjelm	40067f7ec4	btl/tcp: add support for dynamic add_procs This commit makes two changes to the tcp btl: - If a tcp proc does not exist when handling a new connection create a new proc and use it. The current implementation uses the opal_proc_by_name() function to get the opal_proc_t then calls add_procs on all btl modules. It may be sufficient to just call add_procs until an endpoint is created so this may change somewhat. - In add_procs add a check for an existing endpoint before creating one. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:55 -06:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	4b59be4e4c	btl tcp: cosmetic changes and updates No logic changes. Update some stale/incorrect comments, fix some indenting and style.	2015-06-06 10:17:20 -07:00
Jeff Squyres	cddc8945e0	btl_tcp_proc.c: add missing "continue" Also add another (superflous but symmetric) continue statement. This missing "continue" statement allows IPv4 "private network" matches to fall through and allow IPv6 matches to be made -- thereby overriding the IPv4 match that was already made. Fixes #585 (although several of the other issues identified on #585 still exist, the primary / initial bug that was reported there is now fixed).	2015-06-06 10:17:12 -07:00
Gilles Gouaillardet	f33cd58ee9	btl/tcp: fix misc memory leaks as reported by Coverity with CIDs 710615, 710616 and 710618	2015-02-27 19:16:22 +09:00
George Bosilca	f87a4b691b	Solve another handshake problem, where one threads was calling del_event while cleaning up after receiving a zero byte on the connect socket (localyy started connection), while another was trying to accept a new connection from the same peer. Create a zero-timed event and delocalize the accept into a timer_event. Add support for registering an error callback, that can be used when a connection is discovered as failed during the initialization process.	2014-12-15 20:27:32 -05:00
George Bosilca	dee243c58d	ompi_proc_finalize has an interesting side effect. A proc is inserted in the ompi_proc_list as soon as it is created and it is removed only upon the call to the destructor. In ompi_proc_finalize we loop over all procs in ompi_proc_finalize and release them once. However, as a proc is not removed from this list right away, we decrease the ref count for each proc until it reach zero and the proc is finally removed. Thus, we cannot clean the BML/BTL after the call the ompi_proc_finalize. A quick fix is to delay the call to ompi_proc_finalize until all other frameworks have been finalized, and then the behavior depicted above will give the expected outcome.	2014-11-28 18:26:36 -05:00
Gilles Gouaillardet	a6744b8177	fix misc memory leaks specific to the master	2014-11-25 13:52:10 +09:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
George Bosilca	8da5dcc22e	Don't release the provided opal_proc in the error path.	2014-11-06 08:42:23 -08:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Ralph Castain	3fed455bbc	If something goes wrong in add_procs, let's not segfault during finalize This commit was SVN r32665.	2014-09-03 17:27:31 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00

1 2

51 Коммитов