openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	d672fad849	Repair rsh/ssh tree spawn Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn. Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 11:35:00 -08:00
Ralph Castain	466cbd4d29	Rework the threading in oob/tcp so that daemons (including mpirun) use multiple progress threads to get messages out to their children, and so that the oob/base uses a separate one to setup sends. This allows the daemon cmd processor to execute in parallel with relay of messages, which significantly reduces launch times at scale Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 13:26:19 -08:00
Gilles Gouaillardet	24c61b0625	oob/tcp: plug a memory leak in mca_oob_tcp_component_lost_connection() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Ralph Castain	188880be3f	Since static ports are only used by ORTE if the runtime option is given, there is no need for a configure option as well - so remove the --enable-orte-static-ports configure option. When decoding the daemon nidmap, mark new daemons as ALIVE by default - we will discover dead ones as we go. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-04 05:01:42 -07:00
Gilles Gouaillardet	30298cc83c	oob/tcp: remove debug that should have never been commited Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:41:14 +09:00
Gilles Gouaillardet	75e96004a4	oob/tcp: fix a typo in mca_oob_tcp_component_no_route() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:30:24 +09:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	a2919174d0	Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport. Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted. While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress	2016-10-11 16:01:02 -07:00
Gilles Gouaillardet	e84b35217f	oob/tcp: plug a memory leak as reported by Coverity with CID 1196711	2016-09-08 18:50:18 +09:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Ralph Castain	0a6b8d2c14	Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system	2015-12-30 07:16:43 -08:00
Ralph Castain	1cdc1c121c	Revert "Standardize the handling of shutdown in the OOB TCP component" This reverts commit open-mpi/ompi@12dccaa911.	2015-12-30 07:05:40 -08:00
Ralph Castain	12dccaa911	Standardize the handling of shutdown in the OOB TCP component	2015-12-29 07:57:22 -08:00
Ralph Castain	89c80b2294	Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.	2015-08-27 16:41:00 -07:00
Nathan Hjelm	156ce6af21	periodic whitespace purge Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 09:32:33 -06:00
Gilles Gouaillardet	429bdf1af7	oob/tcp: fix a race condition when finalizing the oob/tcp component	2015-07-28 09:16:13 +09:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	e43c8dc291	oob tcp: label a few #endif's Only bother labeling the ones that are a little far away from their corresponding #if statements.	2015-05-20 21:10:11 -04:00
Jeff Squyres	4b2f0d4827	oob tcp: reset MCA params from level 9 Set various MCA param levels	2015-05-20 21:10:11 -04:00
Jeff Squyres	1a4c9960e1	oob tcp: set KEEPALIVE timeout 60s, retry interval 5s The timeout is frequency at which to send keepalive pings; the retry interval is how often to send successive pings once a keepalive has not replied. Also update comments and MCA param help strings. 60 seconds -- squashme	2015-05-20 21:08:37 -04:00
Jeff Squyres	32d81af35f	oob tcp: re-enable keepalive option for Mac Plus very minor #if/#endif reduction.	2015-05-20 17:28:45 -04:00
Ralph Castain	d3d3e73099	Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket	2015-05-15 07:13:42 -06:00
Ralph Castain	8e30579e6e	The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac	2015-05-14 18:09:13 -06:00
Ralph Castain	b5382c9bf9	Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel. A few other minor cleanups included.	2015-05-08 11:15:21 -07:00
Ralph Castain	1f8de276de	Consolidate all the QOS changes into one clean commit	2015-05-06 19:48:42 -07:00
Nathan Hjelm	45e053dbce	orte: use C99 subobject naming for component initialization This commit helps future-proof orte components by initializing each component member by name. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-18 10:29:58 -06:00
Ralph Castain	0c043dbdc9	Fix typo in var name	2015-04-02 02:32:42 -07:00
Ralph Castain	a4b466efc4	Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.	2015-04-01 20:21:23 -07:00
rhc54	2ff7575dde	Merge pull request #497 from rhc54/topic/sec Allow for different security domains.	2015-03-25 21:01:29 -07:00
Ralph Castain	10cf455080	Tools need to use the TCP OOB component	2015-03-25 19:56:49 -07:00
Ralph Castain	1b24536941	Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.	2015-03-25 13:22:01 -07:00
Ralph Castain	a0487e014c	Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.	2015-03-16 19:42:05 -07:00
Ralph Castain	64d11f170a	Adjust the default keepalive interval. Refactor the code when setting keepalive options	2015-03-16 12:32:58 -07:00
Ralph Castain	4ded049cbc	Modify MCA param description	2015-03-16 11:57:32 -07:00
Ralph Castain	019bba5caf	Cleanup a bit - don't need to lookup the protocol number if we just use the right define	2015-03-16 11:54:51 -07:00
Ralph Castain	69ac25bf55	Add support for TCP keepalive on inter-node sockets	2015-03-16 09:59:44 -07:00
Gilles Gouaillardet	a69d935d55	oob/tcp: fix misc issues as reported by Coverity with CIDs 70726, 710564, 1196630, 1269805, 1269803, 1269932	2015-03-10 19:32:01 +09:00
Jeff Squyres	71ae0ad5ec	oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code This was CID 1196629	2015-02-24 15:24:11 -05:00
Ralph Castain	d2938a144f	Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch	2015-01-12 05:31:02 -08:00
Ralph Castain	4d186e6402	Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32662.	2014-09-02 14:53:00 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Ralph Castain	5db717f090	Some small leak cleanups cmr=v1.8.3:reviewer=artpol This commit was SVN r32358.	2014-07-30 15:46:02 +00:00
Adrian Reber	4aca7095dc	fix a syntax error in the FT code This commit was SVN r32087.	2014-06-25 20:35:50 +00:00
Ralph Castain	e21bfeadcd	Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed. This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier. cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component This commit was SVN r31956.	2014-06-06 02:24:17 +00:00
Ralph Castain	7df500ecf5	Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up. This commit was SVN r31928.	2014-06-02 15:00:33 +00:00
Nathan Hjelm	59d09ad9de	orte: fix several small memory leaks grpcomm: fix memory leaks We were leaking the caddy object used to pass data to the callback function. This commit fixes these leaks. oob,rml: fix memory leaks This commit fixes several leaks: - Both the oob/base and oob/tcp were leaking objects on their peer hash tables. Iterate on the hash tables and free any objects. - Leaked sent messages because of missing OBJ_RELEASE. I placed the release in ORTE_RML_SEND_COMPLETE to catch all the possible paths. ess/base: close the state framework cmr=v1.8.2:reviewer=rhc This commit was SVN r31776.	2014-05-15 15:06:27 +00:00
Ralph Castain	445b552d3a	Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it. The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes. Refs trac:4571 This commit was SVN r31589. The following Trac tickets were found above: Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571	2014-05-01 19:48:21 +00:00
Ralph Castain	3723b39f30	Ensure we don't silently fail when unable to make a connection - bark pleasantly first. Refs trac:4571 This commit was SVN r31537. The following Trac tickets were found above: Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571	2014-04-28 19:16:32 +00:00
Ralph Castain	d642babff6	Derived from patch provided by Artem, cleanup the "abnormal" code path for selecting TCP OOB modules to connect to a remote process. If we can't find a direct interface-to-address match, then assign all the provided addresses to the first available TCP module and let the normal failure process determine if the remote proc is truly reachable. cmr=v1.8.2:reviewer=artpol:subject=fix abnormal code connection path in tcp oob This commit was SVN r31536.	2014-04-28 19:05:14 +00:00

1 2

76 Коммитов