1
1

402 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
156ce6af21 periodic whitespace purge
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
023936e84b Silence coverity warnings 2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7 oob/tcp: fix a race condition when finalizing the oob/tcp component 2015-07-28 09:16:13 +09:00
Ralph Castain
4352123c26 Protect the oob/tcp component from port scanners 2015-06-26 01:40:57 -07:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
bc7a7f3de5 Fix abnormal shutdown when a node dies 2015-05-22 17:29:06 -07:00
Jeff Squyres
3069daa015 oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
Have only a single level of "if" conditionals.  Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.

Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291 oob tcp: label a few #endif's
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827 oob tcp: reset MCA params from level 9
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1 oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.

Also update comments and MCA param help strings.

60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2 oob_tcp: do not set KEEPALIVE on listening sockets 2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f oob tcp: re-enable keepalive option for Mac
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
Ralph Castain
d3d3e73099 Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket 2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6 Plug the memory leak identified by George 2015-05-14 21:33:48 -06:00
Ralph Castain
8e30579e6e The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac 2015-05-14 18:09:13 -06:00
Ralph Castain
3cee4152fc Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them 2015-05-11 09:16:25 -07:00
Ralph Castain
b5382c9bf9 Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583 Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params! 2015-05-07 21:05:08 -07:00
Ralph Castain
1f8de276de Consolidate all the QOS changes into one clean commit 2015-05-06 19:48:42 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
0c043dbdc9 Fix typo in var name 2015-04-02 02:32:42 -07:00
Ralph Castain
a4b466efc4 Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts. 2015-04-01 20:21:23 -07:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Ralph Castain
d2d02a1642 ckpt 2015-03-28 07:59:20 -07:00
rhc54
2ff7575dde Merge pull request #497 from rhc54/topic/sec
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
10cf455080 Tools need to use the TCP OOB component 2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
095a8fa684 We don't need to know about non-fatal errors from setting socket options 2015-03-20 07:16:31 -07:00
Ralph Castain
a0487e014c Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used. 2015-03-16 19:42:05 -07:00
Ralph Castain
64d11f170a Adjust the default keepalive interval. Refactor the code when setting keepalive options 2015-03-16 12:32:58 -07:00
Ralph Castain
4ded049cbc Modify MCA param description 2015-03-16 11:57:32 -07:00
Ralph Castain
019bba5caf Cleanup a bit - don't need to lookup the protocol number if we just use the right define 2015-03-16 11:54:51 -07:00
Ralph Castain
69ac25bf55 Add support for TCP keepalive on inter-node sockets 2015-03-16 09:59:44 -07:00
Gilles Gouaillardet
a69d935d55 oob/tcp: fix misc issues
as reported by Coverity with CIDs 70726, 710564,
1196630, 1269805, 1269803, 1269932
2015-03-10 19:32:01 +09:00
Gilles Gouaillardet
d1b2f043ff fix misc memory leaks
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Gilles Gouaillardet
d8f3b378b3 orte/oob: fix misc memory leaks
as reported by Coverity as CIDs 1196748, 1196749 and 1269895
2015-03-02 15:31:11 +09:00
Jeff Squyres
71ae0ad5ec oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code
This was CID 1196629
2015-02-24 15:24:11 -05:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Ralph Castain
d2938a144f Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch 2015-01-12 05:31:02 -08:00
Ralph Castain
6c4d5a51c4 Close tcp sockets upon exec 2014-12-13 20:23:53 -08:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Gilles Gouaillardet
652ecdb888 oob/tcp: always include a missing header file
improve open-mpi/ompi@c9d1e16a9e
2014-10-29 13:39:23 +09:00
Gilles Gouaillardet
c9d1e16a9e oob/tcp: include a missing header file
warning can be seen under cygwin without the missing header file
2014-10-28 13:56:25 +09:00
Gilles Gouaillardet
9661e4537f oob/tcp: fix a race condition
Mimick the btl/tcp protocol to solve the race condition that happens
when two peers try to connect to each other at the same time

cmr=v1.8.4:reviewer=rhc

This commit was SVN r32799.
2014-09-26 06:54:30 +00:00
Gilles Gouaillardet
5fa2b6c59c oob/tcp: fix a race condition
Refs trac:4909

This commit was SVN r32754.

The following Trac tickets were found above:
  Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909
2014-09-18 08:17:25 +00:00
Ralph Castain
3a437cbdb3 Silence set-but-not-used warning when timing isn't enabled
This commit was SVN r32749.
2014-09-17 00:40:10 +00:00
Ralph Castain
414f4e9783 Try to provide a real hostname for the remote host to aid in debugging
Refs trac:4908

This commit was SVN r32748.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-17 00:39:49 +00:00
Jeff Squyres
9dc49c5f92 oob_tcp_connection: print "<unknown>" instead of "NULL"
"NULL" doesn't meany anything to the user, and is somewhat confusing
to see in an error message.  "<unknown>" at least indicates that
there's an error, and we know who the peer is.

This commit was SVN r32747.
2014-09-16 22:47:57 +00:00
Ralph Castain
09aecea55a Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message.
Refs trac:4908

This commit was SVN r32746.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-16 22:21:02 +00:00
Ralph Castain
4bbc9a28d6 Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it.
cmr=v1.8.3:reviewer=rhc

This commit was SVN r32744.
2014-09-16 18:59:36 +00:00