Ralph Castain
bc7a7f3de5
Fix abnormal shutdown when a node dies
2015-05-22 17:29:06 -07:00
Jeff Squyres
3069daa015
oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
...
Have only a single level of "if" conditionals. Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.
Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291
oob tcp: label a few #endif's
...
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827
oob tcp: reset MCA params from level 9
...
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1
oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
...
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.
Also update comments and MCA param help strings.
60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2
oob_tcp: do not set KEEPALIVE on listening sockets
2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f
oob tcp: re-enable keepalive option for Mac
...
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9
Merge pull request #584 from nkogteva/oob_ud_stress_test
...
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00
Ralph Castain
d3d3e73099
Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket
2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6
Plug the memory leak identified by George
2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d
oob/alps: remove comment with personal reference
...
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e
The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac
2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e
oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)
2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520
oob ud: ibv_get_device_list()==NULL can mean no devices present
...
...which is not an error. Don't complain about it.
2015-05-12 10:54:39 -07:00
Jeff Squyres
8f941a6613
oob ud: better error msgs, tolerate systems without UD devices
...
It is perfectly ok to be on a system without UD devices.
Also, make some of the error messages better -- so that the user has a
clue about where the error messages are coming from, and what they
should do.
2015-05-11 13:11:51 -07:00
Mike Dubman
894ba28390
Merge pull request #559 from nkogteva/oob_ud
...
oob ud: made component more user adaptive; opal outputs were replaced by...
2015-05-11 21:09:28 +03:00
Ralph Castain
3cee4152fc
Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them
2015-05-11 09:16:25 -07:00
Ralph Castain
b5382c9bf9
Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
...
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583
Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!
2015-05-07 21:05:08 -07:00
Gilles Gouaillardet
2e384a3b65
initialize common symbols from orte
...
A few uninitialized common symbols are remaining (generated by flex) :
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
01a9bdf4cf
Cleanup of ud/oob component
2015-05-06 19:48:42 -07:00
Ralph Castain
1f8de276de
Consolidate all the QOS changes into one clean commit
2015-05-06 19:48:42 -07:00
Nadezhda Kogteva
01ce58391e
oob ud: made component more user adaptive; opal outputs were replaced by help messages.
2015-04-28 15:36:32 +03:00
Jeff Squyres
8fbf34b196
oob ud: put call to ibv_fork_init() before *all* ibv calls
...
Move the call to opal_common_verbs_fork_test() to up before the call
to ibv_get_device_list() (just curious -- why not use
opal_ibv_get_device_list()?). This ensures that the call to
ibv_fork_init() is before *all* other ibv_* calls.
2015-04-24 14:19:06 -07:00
Nathan Hjelm
45e053dbce
orte: use C99 subobject naming for component initialization
...
This commit helps future-proof orte components by initializing each
component member by name.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com
c2678b0cc9
oob ud: fixes and parameter adjustment
2015-04-17 16:22:43 +03:00
Nathan Hjelm
3436f2917d
Merge pull request #449 from hjelmn/mca_base_update
...
mca/base update
2015-04-16 08:41:48 -06:00
Howard Pritchard
283ef4c05d
oob/config: if --with-verbs=no, no ud
...
The oob/ud configure was not honoring the case
if the ompi is configured with --with-verbs=no.
This fixes that problems.
Fixes #522
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-04-14 06:31:18 -07:00
Ralph Castain
3e44d3c9e3
Enable singletons to run without any active OOB module until they attempt to comm_spawn
2015-04-10 14:06:42 -07:00
Ralph Castain
0c043dbdc9
Fix typo in var name
2015-04-02 02:32:42 -07:00
Ralph Castain
a4b466efc4
Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.
2015-04-01 20:21:23 -07:00
Ralph Castain
d07dc362d5
Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common.
2015-03-28 20:34:26 -07:00
Ralph Castain
d2d02a1642
ckpt
2015-03-28 07:59:20 -07:00
Nathan Hjelm
b68d66bb9b
MCA: Add the project/project version to the MCA base component
...
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.
All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
rhc54
2ff7575dde
Merge pull request #497 from rhc54/topic/sec
...
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
10cf455080
Tools need to use the TCP OOB component
2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941
Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.
2015-03-25 13:22:01 -07:00
Ralph Castain
095a8fa684
We don't need to know about non-fatal errors from setting socket options
2015-03-20 07:16:31 -07:00
Howard Pritchard
6054975913
oob/alps: add configure file for alps oob
...
Have to have alps rpms installed on a system
for alps component to build, even if separated
by a level of indirection.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-03-19 15:38:14 -07:00
Howard Pritchard
b1f31a4364
orte/oob: implement alps oob component
...
Implement an almost-do-nothing alps oob component.
When using aprun to launch a job on Cray system,
there is no reason to need an oob system, since ompi
relies on Cray PMI for oob communication.
Fixes #484
2015-03-19 14:11:40 -07:00
Ralph Castain
a0487e014c
Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.
2015-03-16 19:42:05 -07:00
Ralph Castain
64d11f170a
Adjust the default keepalive interval. Refactor the code when setting keepalive options
2015-03-16 12:32:58 -07:00
Ralph Castain
4ded049cbc
Modify MCA param description
2015-03-16 11:57:32 -07:00
Ralph Castain
019bba5caf
Cleanup a bit - don't need to lookup the protocol number if we just use the right define
2015-03-16 11:54:51 -07:00
Ralph Castain
69ac25bf55
Add support for TCP keepalive on inter-node sockets
2015-03-16 09:59:44 -07:00
Nathan Hjelm
695dcd5a28
oob/ud: fix compiler warning
2015-03-11 10:53:32 -06:00
Gilles Gouaillardet
a69d935d55
oob/tcp: fix misc issues
...
as reported by Coverity with CIDs 70726, 710564,
1196630, 1269805, 1269803, 1269932
2015-03-10 19:32:01 +09:00
Gilles Gouaillardet
d1b2f043ff
fix misc memory leaks
...
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Gilles Gouaillardet
d8f3b378b3
orte/oob: fix misc memory leaks
...
as reported by Coverity as CIDs 1196748, 1196749 and 1269895
2015-03-02 15:31:11 +09:00
Mike Dubman
dbc15009b6
Merge pull request #415 from alinask/topic/fix_fork_support_flow
...
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00