rhc54
c6bb227073
Merge pull request #692 from rhc54/topic/mapper
...
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43
Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.
2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9
Merge pull request #691 from rhc54/topic/mapper
...
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36
Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy
2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47
remove trigraph '??)' from comment
...
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
836f49597d
There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
...
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad
Okay, that's it - trash it
2015-07-01 05:27:30 -05:00
Ralph Castain
6cd3ccd305
Update the OMP support per request from IBM and LLNL
2015-06-30 10:24:34 -05:00
Ralph Castain
a58171a974
Add some debug
2015-06-29 14:51:41 -05:00
Ralph Castain
a4557d4ed2
Add new component to support OpenMP envars per request from IBM and LLNL
2015-06-27 17:57:04 -07:00
Ralph Castain
4352123c26
Protect the oob/tcp component from port scanners
2015-06-26 01:40:57 -07:00
Nathan Hjelm
ee36d813dc
Merge pull request #657 from hjelmn/c99
...
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e
more c99 updates
...
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034
ownership: update ownership files
...
per discussions at OMPI devel workshop
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
014a6a5969
Initialize variable to make clang happy
2015-06-24 22:01:09 -07:00
Ralph Castain
869041f770
Purge whitespace from the repo
2015-06-23 20:59:57 -07:00
Ralph Castain
db3c59b943
Silence a warning by converting the bitmap to a string prior to printing the error
2015-06-23 11:49:11 -07:00
Ralph Castain
706884652f
Silence Coverity warning about failing to check return code
2015-06-17 19:24:51 -07:00
Ralph Castain
869b2891c4
When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary
2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
b72e9288bc
rmaps: fix a misc memory leak
...
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
a43abceb88
fix dfs misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709
2015-06-17 11:17:54 +09:00
rhc54
adbff46a13
Merge pull request #642 from rhc54/topic/hwloc
...
Update hwloc to 1.11.0
2015-06-13 12:09:58 -07:00
Ralph Castain
ff92781ec4
Replace hwloc191 with hwloc1110
...
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Ralph Castain
cebdf0b7c0
Add missing include
2015-06-09 22:08:05 -07:00
Howard Pritchard
05325b113e
odls/alps: fix busted build for cray.
...
This commit fixes things broken by commit
ea35e47
.
Fixes #616
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Ralph Castain
6b93db6a9a
Grrr...not sure how this slipped thru
2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184
Remove stale header
2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228
Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
...
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Ralph Castain
c21cd1c91e
Ensure the ssh session is dead
2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4
Ensure that all ssh sessions are terminated when abnormally terminating the job
2015-05-23 08:14:29 -07:00
Jeff Squyres
5e52ce26b5
help-errmgr-base.txt: remove trailing newline
...
Removed spurrious newline at end of file so that the emitted help
message doesn't contain a blank line before the final "-----" output.
2015-05-23 03:33:23 -07:00
Ralph Castain
55cd2a07f6
Update exit code
2015-05-22 21:06:43 -07:00
Ralph Castain
3510bb4ced
Set the exit code when a daemon fails
2015-05-22 21:05:23 -07:00
Ralph Castain
bc7a7f3de5
Fix abnormal shutdown when a node dies
2015-05-22 17:29:06 -07:00
Ralph Castain
96cd42699e
Cleanup warnings for uninitialized vars and convert bare debug output to verbose
2015-05-21 07:41:26 -07:00
Jeff Squyres
3069daa015
oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
...
Have only a single level of "if" conditionals. Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.
Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291
oob tcp: label a few #endif's
...
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827
oob tcp: reset MCA params from level 9
...
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1
oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
...
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.
Also update comments and MCA param help strings.
60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2
oob_tcp: do not set KEEPALIVE on listening sockets
2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f
oob tcp: re-enable keepalive option for Mac
...
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9
Merge pull request #584 from nkogteva/oob_ud_stress_test
...
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00
Gilles Gouaillardet
dd28b1f680
orted/dfs: fix misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709 and 1269849
2015-05-20 13:09:46 +09:00
Ralph Castain
d3d3e73099
Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket
2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6
Plug the memory leak identified by George
2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d
oob/alps: remove comment with personal reference
...
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e
The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac
2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e
oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)
2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520
oob ud: ibv_get_device_list()==NULL can mean no devices present
...
...which is not an error. Don't complain about it.
2015-05-12 10:54:39 -07:00
Jeff Squyres
8f941a6613
oob ud: better error msgs, tolerate systems without UD devices
...
It is perfectly ok to be on a system without UD devices.
Also, make some of the error messages better -- so that the user has a
clue about where the error messages are coming from, and what they
should do.
2015-05-11 13:11:51 -07:00
Mike Dubman
894ba28390
Merge pull request #559 from nkogteva/oob_ud
...
oob ud: made component more user adaptive; opal outputs were replaced by...
2015-05-11 21:09:28 +03:00
Ralph Castain
3cee4152fc
Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them
2015-05-11 09:16:25 -07:00
Howard Pritchard
3382d3ce61
ess/alps: remove unnecessary vpid calc
...
There was a redundant computation of the vpid
for orted's happening in ess/alps rte_init
method. Keep the more efficient alps based
method.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-09 20:07:38 -07:00
Ralph Castain
b5382c9bf9
Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
...
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583
Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!
2015-05-07 21:05:08 -07:00
Gilles Gouaillardet
2e384a3b65
initialize common symbols from orte
...
A few uninitialized common symbols are remaining (generated by flex) :
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
9cb2fcfa5c
Cleanup the qos code when --enable-timings is given
2015-05-06 20:24:27 -07:00
Ralph Castain
01a9bdf4cf
Cleanup of ud/oob component
2015-05-06 19:48:42 -07:00
Ralph Castain
1f8de276de
Consolidate all the QOS changes into one clean commit
2015-05-06 19:48:42 -07:00
Ralph Castain
8e3f0b1d33
Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support
2015-05-06 15:18:15 -07:00
Ralph Castain
0bb73645f0
Silence Coverity warning
2015-04-30 20:49:28 -07:00
Ralph Castain
7d1980ba83
Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param.
2015-04-30 20:35:23 -07:00
Ralph Castain
e26e7ad736
Better support automated tests for map, rank, and bind options
2015-04-30 14:01:13 -07:00
Ralph Castain
7d4f9970d8
Minor cleanup
2015-04-29 17:49:35 -07:00
Nadezhda Kogteva
01ce58391e
oob ud: made component more user adaptive; opal outputs were replaced by help messages.
2015-04-28 15:36:32 +03:00
Jeff Squyres
8fbf34b196
oob ud: put call to ibv_fork_init() before *all* ibv calls
...
Move the call to opal_common_verbs_fork_test() to up before the call
to ibv_get_device_list() (just curious -- why not use
opal_ibv_get_device_list()?). This ensures that the call to
ibv_fork_init() is before *all* other ibv_* calls.
2015-04-24 14:19:06 -07:00
Ralph Castain
9104e81958
When --map-by node, we should be unbound. Also remove dead code due to copy/paste error.
2015-04-23 20:35:54 -07:00
Ralph Castain
5003be5c5c
If the user specifies a --map-by <foo> option, then default to bind-to <foo> unless they specify a bind-to option. If they map-by slot/node, then use the default policy based on num_procs.
2015-04-23 13:30:21 -07:00
Ralph Castain
43229d056e
Protect one more place from a NULL object
2015-04-20 18:45:57 -07:00
Jeff Squyres
11e8c2096b
plm rsh: assign some levels to the rsh PLM MCA params
2015-04-20 16:18:57 -07:00
Nathan Hjelm
359a282e7d
ess/singleton: MCA variable synonyms can not currently have NULL for both framework and component
...
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-20 16:50:52 -06:00
Nathan Hjelm
45e053dbce
orte: use C99 subobject naming for component initialization
...
This commit helps future-proof orte components by initializing each
component member by name.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
34b53ac3dc
Silence Coverity warnings
2015-04-18 07:48:22 -07:00
Ralph Castain
12bfb27161
Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command
2015-04-17 16:11:37 -07:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com
c2678b0cc9
oob ud: fixes and parameter adjustment
2015-04-17 16:22:43 +03:00
Nathan Hjelm
3436f2917d
Merge pull request #449 from hjelmn/mca_base_update
...
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
d9c555b547
Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command"
...
This reverts commit open-mpi/ompi@278324c52a .
Revert "Add the ability to pass args to the rsh/ssh command line"
This reverts commit open-mpi/ompi@6f227f8564 .
2015-04-16 08:03:14 -06:00
Ralph Castain
278324c52a
Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command
2015-04-15 20:30:04 -06:00
Ralph Castain
6f227f8564
Add the ability to pass args to the rsh/ssh command line
2015-04-15 20:07:13 -06:00
Howard Pritchard
283ef4c05d
oob/config: if --with-verbs=no, no ud
...
The oob/ud configure was not honoring the case
if the ompi is configured with --with-verbs=no.
This fixes that problems.
Fixes #522
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-04-14 06:31:18 -07:00
Ralph Castain
9c6d452d6b
If we are using HT cpus and have <= 2 procs, then map-by hwthread by default
2015-04-11 21:18:05 -07:00
Ralph Castain
cd686057f6
If the HNP is on a coprocessor, record it so we don't get an error log later
2015-04-11 15:30:15 -07:00
Ralph Castain
91e1cbf284
Init variable
2015-04-11 07:44:57 -07:00
Ralph Castain
033418f62a
Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so.
2015-04-10 15:58:35 -07:00
Ralph Castain
3e44d3c9e3
Enable singletons to run without any active OOB module until they attempt to comm_spawn
2015-04-10 14:06:42 -07:00
Ralph Castain
e4f6f83b9d
Attempt to silence new Coverity complaint by ensuring the string read from file is NULL terminated.
2015-04-10 07:54:37 -07:00
Ralph Castain
396700ad8b
Protect the notifier macro's against NULL job objects
2015-04-09 16:04:43 -07:00
Nathan Hjelm
c416c423bb
ess/singleton: do not put component strings into the environment
...
putenv requires that any string put into the environment is not
changed or freed. That is not the case with constant strings as they
will go away when dlclose is called on the component. Instead, just
use opal_setenv which does not have this restriction.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-09 11:00:47 -06:00
Ralph Castain
0c043dbdc9
Fix typo in var name
2015-04-02 02:32:42 -07:00
Ralph Castain
a4b466efc4
Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.
2015-04-01 20:21:23 -07:00
Ralph Castain
9f8ae59162
Properly enclose the different && clauses
2015-04-01 18:48:25 -07:00
Ralph Castain
57c21d5209
Ensure the DVM flows thru the "daemons reported" state
2015-04-01 16:47:34 -07:00
Mike Dubman
8914a9c070
Merge pull request #494 from elenash/modifiers
...
changed mindist mapping policy specifier
2015-04-01 16:31:46 +03:00
Elena
1e913c76c4
changed mindist mapping policy specifier from map-bt dist:device,modifiers to --map-by dist:modifiers -mca rmaps_dist_device device
2015-04-01 15:07:35 +03:00
Nadezhda Kogteva
2d49d9bd45
grpcomm rcd: remove unnecessary malloc warning for case when number of daemons == 1
2015-04-01 11:07:44 +03:00
Mike Dubman
58d002098b
Merge pull request #474 from elenash/master
...
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Ralph Castain
6f9140a341
Add a little more debug to launch
2015-03-31 20:10:21 -07:00
Ralph Castain
b209c9efa5
Move the "dvm ready" message to stdout so it is easier to trap
2015-03-30 20:12:56 -07:00
Ralph Castain
6d205a3c80
Ensure that singletons pickup the oob/tcp component
2015-03-30 18:10:08 -07:00
rhc54
bc016617a0
Merge pull request #501 from rhc54/topic/sec2
...
Support authentication across security domains
2015-03-30 09:59:43 -07:00