1
1

546 Коммитов

Автор SHA1 Сообщение Дата
Karol Mroz
5c11bdb251 orte: fixup hostname max length usage
Also removes orte specific max hostname value.

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-25 07:08:23 +02:00
Gilles Gouaillardet
d757fbba5d oob/usock: drop message to be sent in process_send() 2016-04-04 16:04:54 +09:00
Gilles Gouaillardet
170734182b oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close()
so usock_peer_create_socket know it must re-create the socket
/* assuming it is ever supposed to occur */
also fix a typo (peer->sd >= 0) in usock_peer_create_socket
2016-04-04 16:02:05 +09:00
Ralph Castain
a3fea58d1c Minor cleanups to prior PR commit 2016-03-24 15:55:14 -07:00
rhc54
6756e19aa2 Merge pull request #1457 from anandhis/master
rml changes
2016-03-24 15:17:29 -07:00
Ralph Castain
6e6bbfda91 Very minor typo 2016-03-23 08:31:47 -07:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Anandhi S Jayakumar
a31292abc7 fixes to ud for removing qos channel 2016-03-10 18:03:17 -08:00
Ralph Castain
a4c8e8c28a Cleanup the proposed change:
* qos framework is moving to the scon layer and is no longer required in ORTE

* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought

* no need for separating the "base" from "API" module definition. The two are identical

* move the "stub" functions into their own file for cleanliness

* general cleanup to meet coding standards

* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Nysal Jan K.A
cc9b1316a4 Make UD OOB memory registrations a multiple of page size
If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK.
If we only partially use a page, any data allocated on the remainder of
the page will be inaccessible to the child process.

Fixes open-mpi/ompi#1363
2016-02-17 22:19:49 -05:00
Ralph Castain
351070659e Correct ordering when checking for privileged ports 2016-02-14 09:43:01 -08:00
Ralph Castain
233bd085ca Protect against a non-privileged port connecting to us when we are running as root
Don't close the listener socket upon error unless we are giving up

Cleanup the incoming socket
2016-02-13 08:07:27 -08:00
Igor Ivanov
34d861dfe9 orte/oob: Fix issue #1301
Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>
2016-01-20 12:08:00 +02:00
Ralph Castain
0a6b8d2c14 Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system 2015-12-30 07:16:43 -08:00
Ralph Castain
1cdc1c121c Revert "Standardize the handling of shutdown in the OOB TCP component"
This reverts commit open-mpi/ompi@12dccaa911.
2015-12-30 07:05:40 -08:00
Ralph Castain
12dccaa911 Standardize the handling of shutdown in the OOB TCP component 2015-12-29 07:57:22 -08:00
Federico Reghenzani
6536a6a9f5 oob_tcp: fix peer->state wrong check 2015-10-29 16:43:58 +01:00
John Westlund
044fea8df7 re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure 2015-10-02 15:59:48 -07:00
John Westlund
6bfaa925ec simplify use of sockaddr* structs to work around buffer overflow warning 2015-10-02 14:26:52 -07:00
Howard Pritchard
8d7e759b85 oob/alps: swat compiler warning
swat some alps related compiler warnings when using --enable-picky

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
1b7930ad52 Silence some warnings and address Coverity issues 2015-09-16 07:58:22 -07:00
Ralph Castain
c1bbbb5e2f Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds 2015-09-15 13:08:35 -07:00
Ralph Castain
dc5796b8a1 Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
Fix the locality computation by correctly computing the vpid of the local peer

This reverts commit open-mpi/ompi@6a8fad49e5.
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5 Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
This reverts commit f94f3cda214ab937c46802896fb53b84bec6cc3a.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21 Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local 2015-09-10 10:25:30 -07:00
Ralph Castain
0d5814b5ca Cleanup Coverity issues 2015-08-29 21:19:27 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
89c80b2294 Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener. 2015-08-27 16:41:00 -07:00
Nathan Hjelm
156ce6af21 periodic whitespace purge
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
023936e84b Silence coverity warnings 2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7 oob/tcp: fix a race condition when finalizing the oob/tcp component 2015-07-28 09:16:13 +09:00
Ralph Castain
4352123c26 Protect the oob/tcp component from port scanners 2015-06-26 01:40:57 -07:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Ralph Castain
bc7a7f3de5 Fix abnormal shutdown when a node dies 2015-05-22 17:29:06 -07:00
Jeff Squyres
3069daa015 oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
Have only a single level of "if" conditionals.  Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.

Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291 oob tcp: label a few #endif's
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827 oob tcp: reset MCA params from level 9
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1 oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.

Also update comments and MCA param help strings.

60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2 oob_tcp: do not set KEEPALIVE on listening sockets 2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f oob tcp: re-enable keepalive option for Mac
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9 Merge pull request #584 from nkogteva/oob_ud_stress_test
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00
Ralph Castain
d3d3e73099 Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket 2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6 Plug the memory leak identified by George 2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d oob/alps: remove comment with personal reference
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac 2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test) 2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520 oob ud: ibv_get_device_list()==NULL can mean no devices present
...which is not an error.  Don't complain about it.
2015-05-12 10:54:39 -07:00