openmpi

Автор	SHA1	Сообщение	Дата
Karol Mroz	5c11bdb251	orte: fixup hostname max length usage Also removes orte specific max hostname value. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-25 07:08:23 +02:00
Gilles Gouaillardet	d757fbba5d	oob/usock: drop message to be sent in process_send()	2016-04-04 16:04:54 +09:00
Gilles Gouaillardet	170734182b	oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close() so usock_peer_create_socket know it must re-create the socket /* assuming it is ever supposed to occur */ also fix a typo (peer->sd >= 0) in usock_peer_create_socket	2016-04-04 16:02:05 +09:00
Ralph Castain	a3fea58d1c	Minor cleanups to prior PR commit	2016-03-24 15:55:14 -07:00
rhc54	6756e19aa2	Merge pull request #1457 from anandhis/master rml changes	2016-03-24 15:17:29 -07:00
Ralph Castain	6e6bbfda91	Very minor typo	2016-03-23 08:31:47 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Anandhi S Jayakumar	a31292abc7	fixes to ud for removing qos channel	2016-03-10 18:03:17 -08:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Nysal Jan K.A	cc9b1316a4	Make UD OOB memory registrations a multiple of page size If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK. If we only partially use a page, any data allocated on the remainder of the page will be inaccessible to the child process. Fixes open-mpi/ompi#1363	2016-02-17 22:19:49 -05:00
Ralph Castain	351070659e	Correct ordering when checking for privileged ports	2016-02-14 09:43:01 -08:00
Ralph Castain	233bd085ca	Protect against a non-privileged port connecting to us when we are running as root Don't close the listener socket upon error unless we are giving up Cleanup the incoming socket	2016-02-13 08:07:27 -08:00
Igor Ivanov	34d861dfe9	orte/oob: Fix issue #1301 Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>	2016-01-20 12:08:00 +02:00
Ralph Castain	0a6b8d2c14	Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system	2015-12-30 07:16:43 -08:00
Ralph Castain	1cdc1c121c	Revert "Standardize the handling of shutdown in the OOB TCP component" This reverts commit open-mpi/ompi@12dccaa911.	2015-12-30 07:05:40 -08:00
Ralph Castain	12dccaa911	Standardize the handling of shutdown in the OOB TCP component	2015-12-29 07:57:22 -08:00
Federico Reghenzani	6536a6a9f5	oob_tcp: fix peer->state wrong check	2015-10-29 16:43:58 +01:00
John Westlund	044fea8df7	re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure	2015-10-02 15:59:48 -07:00
John Westlund	6bfaa925ec	simplify use of sockaddr* structs to work around buffer overflow warning	2015-10-02 14:26:52 -07:00
Howard Pritchard	8d7e759b85	oob/alps: swat compiler warning swat some alps related compiler warnings when using --enable-picky Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-21 14:24:26 -07:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Ralph Castain	dc5796b8a1	Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"" Fix the locality computation by correctly computing the vpid of the local peer This reverts commit open-mpi/ompi@6a8fad49e5.	2015-09-11 08:29:51 -07:00
Ralph Castain	6a8fad49e5	Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local" This reverts commit f94f3cda214ab937c46802896fb53b84bec6cc3a.	2015-09-11 02:01:25 -07:00
Ralph Castain	f94f3cda21	Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local	2015-09-10 10:25:30 -07:00
Ralph Castain	0d5814b5ca	Cleanup Coverity issues	2015-08-29 21:19:27 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	89c80b2294	Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.	2015-08-27 16:41:00 -07:00
Nathan Hjelm	156ce6af21	periodic whitespace purge Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 09:32:33 -06:00
Ralph Castain	023936e84b	Silence coverity warnings	2015-07-29 07:28:08 -07:00
Gilles Gouaillardet	429bdf1af7	oob/tcp: fix a race condition when finalizing the oob/tcp component	2015-07-28 09:16:13 +09:00
Ralph Castain	4352123c26	Protect the oob/tcp component from port scanners	2015-06-26 01:40:57 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	6b93db6a9a	Grrr...not sure how this slipped thru	2015-05-29 19:37:24 -07:00
Ralph Castain	bac308b184	Remove stale header	2015-05-29 19:24:51 -07:00
Ralph Castain	ea35e47228	Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.	2015-05-29 14:37:14 -07:00
Ralph Castain	bc7a7f3de5	Fix abnormal shutdown when a node dies	2015-05-22 17:29:06 -07:00
Jeff Squyres	3069daa015	oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK Have only a single level of "if" conditionals. Also, slightly change the logic such that we only die/break out of the loop if we get EMFILE -- all other errors are ok to go on to the next fd. Finally, use a real show_help() message to warn when other errors occur.	2015-05-20 21:10:11 -04:00
Jeff Squyres	e43c8dc291	oob tcp: label a few #endif's Only bother labeling the ones that are a little far away from their corresponding #if statements.	2015-05-20 21:10:11 -04:00
Jeff Squyres	4b2f0d4827	oob tcp: reset MCA params from level 9 Set various MCA param levels	2015-05-20 21:10:11 -04:00
Jeff Squyres	1a4c9960e1	oob tcp: set KEEPALIVE timeout 60s, retry interval 5s The timeout is frequency at which to send keepalive pings; the retry interval is how often to send successive pings once a keepalive has not replied. Also update comments and MCA param help strings. 60 seconds -- squashme	2015-05-20 21:08:37 -04:00
Jeff Squyres	c95215dfc2	oob_tcp: do not set KEEPALIVE on listening sockets	2015-05-20 17:28:45 -04:00
Jeff Squyres	32d81af35f	oob tcp: re-enable keepalive option for Mac Plus very minor #if/#endif reduction.	2015-05-20 17:28:45 -04:00
rhc54	95c40e64b9	Merge pull request #584 from nkogteva/oob_ud_stress_test oob ud: fixed a bug that prevented the work with QoS framework	2015-05-20 09:56:08 -06:00
Ralph Castain	d3d3e73099	Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket	2015-05-15 07:13:42 -06:00
Ralph Castain	0a345d34e6	Plug the memory leak identified by George	2015-05-14 21:33:48 -06:00
Howard Pritchard	578430c36d	oob/alps: remove comment with personal reference Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-14 20:06:21 -07:00
Ralph Castain	8e30579e6e	The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac	2015-05-14 18:09:13 -06:00
Nadezhda Kogteva	d9dcf8352e	oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)	2015-05-13 11:40:01 +03:00
Jeff Squyres	8e8d104520	oob ud: ibv_get_device_list()==NULL can mean no devices present ...which is not an error. Don't complain about it.	2015-05-12 10:54:39 -07:00

1 2 3 4 5 ...

546 Коммитов