openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	da0c873e14	oob/tcp: enhance debugging output display the hop node used to send a message (if the message is sent directly, then the hop is the destination) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-04 14:16:06 +09:00
Gilles Gouaillardet	30298cc83c	oob/tcp: remove debug that should have never been commited Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:41:14 +09:00
Gilles Gouaillardet	75e96004a4	oob/tcp: fix a typo in mca_oob_tcp_component_no_route() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:30:24 +09:00
Gilles Gouaillardet	3d4285b04d	oob/tcp: silence valgrind warning fully initialize allocated memory to keep valgrind happy Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-27 17:12:46 +09:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	a2919174d0	Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport. Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted. While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress	2016-10-11 16:01:02 -07:00
Gilles Gouaillardet	c92e9a5406	use the new OPAL_HASH_TABLE_FOREACH convenience macro	2016-10-08 16:58:20 +09:00
Ralph Castain	de7b1494d9	Clean out old cruft from the ORCM project	2016-09-21 00:13:30 -07:00
Gilles Gouaillardet	e84b35217f	oob/tcp: plug a memory leak as reported by Coverity with CID 1196711	2016-09-08 18:50:18 +09:00
Artem Polyakov	9eba1b0b75	Merge pull request #2042 from artpol84/pmix_sdirs Several fixes related to session directories:	2016-09-07 14:15:47 +07:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Artem Polyakov	81195ab724	Several fixes related to session directories: * enable OMPI to retrieve paths from RM through PMIx * cleanups related to tempdirs.	2016-09-05 07:48:44 +03:00
Gilles Gouaillardet	0b8c58298d	oob/usock: fix handling of orte_process_name_t * orte_process_name_t is aligned on 32 bits, so it cannot simply be casted into an int64_t. use memcpy() instead Thanks Paul Hargrove for the report	2016-09-01 13:18:02 +09:00
Ralph Castain	ae2af61ee3	Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.	2016-08-15 22:46:46 -05:00
Karol Mroz	5c11bdb251	orte: fixup hostname max length usage Also removes orte specific max hostname value. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-25 07:08:23 +02:00
Gilles Gouaillardet	d757fbba5d	oob/usock: drop message to be sent in process_send()	2016-04-04 16:04:54 +09:00
Gilles Gouaillardet	170734182b	oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close() so usock_peer_create_socket know it must re-create the socket /* assuming it is ever supposed to occur */ also fix a typo (peer->sd >= 0) in usock_peer_create_socket	2016-04-04 16:02:05 +09:00
Ralph Castain	a3fea58d1c	Minor cleanups to prior PR commit	2016-03-24 15:55:14 -07:00
rhc54	6756e19aa2	Merge pull request #1457 from anandhis/master rml changes	2016-03-24 15:17:29 -07:00
Ralph Castain	6e6bbfda91	Very minor typo	2016-03-23 08:31:47 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Anandhi S Jayakumar	a31292abc7	fixes to ud for removing qos channel	2016-03-10 18:03:17 -08:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Nysal Jan K.A	cc9b1316a4	Make UD OOB memory registrations a multiple of page size If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK. If we only partially use a page, any data allocated on the remainder of the page will be inaccessible to the child process. Fixes open-mpi/ompi#1363	2016-02-17 22:19:49 -05:00
Ralph Castain	351070659e	Correct ordering when checking for privileged ports	2016-02-14 09:43:01 -08:00
Ralph Castain	233bd085ca	Protect against a non-privileged port connecting to us when we are running as root Don't close the listener socket upon error unless we are giving up Cleanup the incoming socket	2016-02-13 08:07:27 -08:00
Igor Ivanov	34d861dfe9	orte/oob: Fix issue #1301 Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>	2016-01-20 12:08:00 +02:00
Ralph Castain	0a6b8d2c14	Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system	2015-12-30 07:16:43 -08:00
Ralph Castain	1cdc1c121c	Revert "Standardize the handling of shutdown in the OOB TCP component" This reverts commit open-mpi/ompi@12dccaa911.	2015-12-30 07:05:40 -08:00
Ralph Castain	12dccaa911	Standardize the handling of shutdown in the OOB TCP component	2015-12-29 07:57:22 -08:00
Federico Reghenzani	6536a6a9f5	oob_tcp: fix peer->state wrong check	2015-10-29 16:43:58 +01:00
John Westlund	044fea8df7	re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure	2015-10-02 15:59:48 -07:00
John Westlund	6bfaa925ec	simplify use of sockaddr* structs to work around buffer overflow warning	2015-10-02 14:26:52 -07:00
Howard Pritchard	8d7e759b85	oob/alps: swat compiler warning swat some alps related compiler warnings when using --enable-picky Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-21 14:24:26 -07:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Ralph Castain	dc5796b8a1	Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"" Fix the locality computation by correctly computing the vpid of the local peer This reverts commit open-mpi/ompi@6a8fad49e5.	2015-09-11 08:29:51 -07:00
Ralph Castain	6a8fad49e5	Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local" This reverts commit `f94f3cda21`.	2015-09-11 02:01:25 -07:00
Ralph Castain	f94f3cda21	Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local	2015-09-10 10:25:30 -07:00
Ralph Castain	0d5814b5ca	Cleanup Coverity issues	2015-08-29 21:19:27 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	89c80b2294	Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.	2015-08-27 16:41:00 -07:00
Nathan Hjelm	156ce6af21	periodic whitespace purge Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 09:32:33 -06:00
Ralph Castain	023936e84b	Silence coverity warnings	2015-07-29 07:28:08 -07:00
Gilles Gouaillardet	429bdf1af7	oob/tcp: fix a race condition when finalizing the oob/tcp component	2015-07-28 09:16:13 +09:00
Ralph Castain	4352123c26	Protect the oob/tcp component from port scanners	2015-06-26 01:40:57 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	6b93db6a9a	Grrr...not sure how this slipped thru	2015-05-29 19:37:24 -07:00
Ralph Castain	bac308b184	Remove stale header	2015-05-29 19:24:51 -07:00
Ralph Castain	ea35e47228	Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.	2015-05-29 14:37:14 -07:00
Ralph Castain	bc7a7f3de5	Fix abnormal shutdown when a node dies	2015-05-22 17:29:06 -07:00
Jeff Squyres	3069daa015	oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK Have only a single level of "if" conditionals. Also, slightly change the logic such that we only die/break out of the loop if we get EMFILE -- all other errors are ok to go on to the next fd. Finally, use a real show_help() message to warn when other errors occur.	2015-05-20 21:10:11 -04:00
Jeff Squyres	e43c8dc291	oob tcp: label a few #endif's Only bother labeling the ones that are a little far away from their corresponding #if statements.	2015-05-20 21:10:11 -04:00
Jeff Squyres	4b2f0d4827	oob tcp: reset MCA params from level 9 Set various MCA param levels	2015-05-20 21:10:11 -04:00
Jeff Squyres	1a4c9960e1	oob tcp: set KEEPALIVE timeout 60s, retry interval 5s The timeout is frequency at which to send keepalive pings; the retry interval is how often to send successive pings once a keepalive has not replied. Also update comments and MCA param help strings. 60 seconds -- squashme	2015-05-20 21:08:37 -04:00
Jeff Squyres	c95215dfc2	oob_tcp: do not set KEEPALIVE on listening sockets	2015-05-20 17:28:45 -04:00
Jeff Squyres	32d81af35f	oob tcp: re-enable keepalive option for Mac Plus very minor #if/#endif reduction.	2015-05-20 17:28:45 -04:00
rhc54	95c40e64b9	Merge pull request #584 from nkogteva/oob_ud_stress_test oob ud: fixed a bug that prevented the work with QoS framework	2015-05-20 09:56:08 -06:00
Ralph Castain	d3d3e73099	Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket	2015-05-15 07:13:42 -06:00
Ralph Castain	0a345d34e6	Plug the memory leak identified by George	2015-05-14 21:33:48 -06:00
Howard Pritchard	578430c36d	oob/alps: remove comment with personal reference Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-14 20:06:21 -07:00
Ralph Castain	8e30579e6e	The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac	2015-05-14 18:09:13 -06:00
Nadezhda Kogteva	d9dcf8352e	oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)	2015-05-13 11:40:01 +03:00
Jeff Squyres	8e8d104520	oob ud: ibv_get_device_list()==NULL can mean no devices present ...which is not an error. Don't complain about it.	2015-05-12 10:54:39 -07:00
Jeff Squyres	8f941a6613	oob ud: better error msgs, tolerate systems without UD devices It is perfectly ok to be on a system without UD devices. Also, make some of the error messages better -- so that the user has a clue about where the error messages are coming from, and what they should do.	2015-05-11 13:11:51 -07:00
Mike Dubman	894ba28390	Merge pull request #559 from nkogteva/oob_ud oob ud: made component more user adaptive; opal outputs were replaced by...	2015-05-11 21:09:28 +03:00
Ralph Castain	3cee4152fc	Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them	2015-05-11 09:16:25 -07:00
Ralph Castain	b5382c9bf9	Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel. A few other minor cleanups included.	2015-05-08 11:15:21 -07:00
Ralph Castain	6e95bcd583	Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!	2015-05-07 21:05:08 -07:00
Gilles Gouaillardet	2e384a3b65	initialize common symbols from orte A few uninitialized common symbols are remaining (generated by flex) : * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text	2015-05-08 10:11:58 +09:00
Ralph Castain	01a9bdf4cf	Cleanup of ud/oob component	2015-05-06 19:48:42 -07:00
Ralph Castain	1f8de276de	Consolidate all the QOS changes into one clean commit	2015-05-06 19:48:42 -07:00
Nadezhda Kogteva	01ce58391e	oob ud: made component more user adaptive; opal outputs were replaced by help messages.	2015-04-28 15:36:32 +03:00
Jeff Squyres	8fbf34b196	oob ud: put call to ibv_fork_init() before all ibv calls Move the call to opal_common_verbs_fork_test() to up before the call to ibv_get_device_list() (just curious -- why not use opal_ibv_get_device_list()?). This ensures that the call to ibv_fork_init() is before all other ibv_* calls.	2015-04-24 14:19:06 -07:00
Nathan Hjelm	45e053dbce	orte: use C99 subobject naming for component initialization This commit helps future-proof orte components by initializing each component member by name. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-18 10:29:58 -06:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com	c2678b0cc9	oob ud: fixes and parameter adjustment	2015-04-17 16:22:43 +03:00
Nathan Hjelm	3436f2917d	Merge pull request #449 from hjelmn/mca_base_update mca/base update	2015-04-16 08:41:48 -06:00
Howard Pritchard	283ef4c05d	oob/config: if --with-verbs=no, no ud The oob/ud configure was not honoring the case if the ompi is configured with --with-verbs=no. This fixes that problems. Fixes #522 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-04-14 06:31:18 -07:00
Ralph Castain	3e44d3c9e3	Enable singletons to run without any active OOB module until they attempt to comm_spawn	2015-04-10 14:06:42 -07:00
Ralph Castain	0c043dbdc9	Fix typo in var name	2015-04-02 02:32:42 -07:00
Ralph Castain	a4b466efc4	Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.	2015-04-01 20:21:23 -07:00
Ralph Castain	d07dc362d5	Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common.	2015-03-28 20:34:26 -07:00
Ralph Castain	d2d02a1642	ckpt	2015-03-28 07:59:20 -07:00
Nathan Hjelm	b68d66bb9b	MCA: Add the project/project version to the MCA base component This commit adds support for project_framework_component_* parameter matching. This is the first step in allowing the same framework name in multiple projects. This change also bumps the MCA component version to 2.1.0. All master frameworks have been updated to use the new component versioning macro. An mca.h has been added to each project to add a project specific versioning macro of the form PROJECT_MCA_VERSION_2_1_0. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-03-27 10:59:04 -06:00
rhc54	2ff7575dde	Merge pull request #497 from rhc54/topic/sec Allow for different security domains.	2015-03-25 21:01:29 -07:00
Ralph Castain	10cf455080	Tools need to use the TCP OOB component	2015-03-25 19:56:49 -07:00
Ralph Castain	1b24536941	Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.	2015-03-25 13:22:01 -07:00
Ralph Castain	095a8fa684	We don't need to know about non-fatal errors from setting socket options	2015-03-20 07:16:31 -07:00
Howard Pritchard	6054975913	oob/alps: add configure file for alps oob Have to have alps rpms installed on a system for alps component to build, even if separated by a level of indirection. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-03-19 15:38:14 -07:00
Howard Pritchard	b1f31a4364	orte/oob: implement alps oob component Implement an almost-do-nothing alps oob component. When using aprun to launch a job on Cray system, there is no reason to need an oob system, since ompi relies on Cray PMI for oob communication. Fixes #484	2015-03-19 14:11:40 -07:00
Ralph Castain	a0487e014c	Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.	2015-03-16 19:42:05 -07:00
Ralph Castain	64d11f170a	Adjust the default keepalive interval. Refactor the code when setting keepalive options	2015-03-16 12:32:58 -07:00
Ralph Castain	4ded049cbc	Modify MCA param description	2015-03-16 11:57:32 -07:00
Ralph Castain	019bba5caf	Cleanup a bit - don't need to lookup the protocol number if we just use the right define	2015-03-16 11:54:51 -07:00
Ralph Castain	69ac25bf55	Add support for TCP keepalive on inter-node sockets	2015-03-16 09:59:44 -07:00
Nathan Hjelm	695dcd5a28	oob/ud: fix compiler warning	2015-03-11 10:53:32 -06:00
Gilles Gouaillardet	a69d935d55	oob/tcp: fix misc issues as reported by Coverity with CIDs 70726, 710564, 1196630, 1269805, 1269803, 1269932	2015-03-10 19:32:01 +09:00
Gilles Gouaillardet	d1b2f043ff	fix misc memory leaks as already reported by Coverity with CIDs 71818, 71819, 72250, 715767, 1196749 and 1274002	2015-03-05 13:58:05 +09:00
Gilles Gouaillardet	d8f3b378b3	orte/oob: fix misc memory leaks as reported by Coverity as CIDs 1196748, 1196749 and 1269895	2015-03-02 15:31:11 +09:00
Mike Dubman	dbc15009b6	Merge pull request #415 from alinask/topic/fix_fork_support_flow Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.	2015-02-26 21:50:11 +02:00

1 2 3 4 5 ...

610 Коммитов