Ralph Castain
219c4dfba5
Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.
2015-07-12 08:23:34 -07:00
rhc54
bd91225cb5
Merge pull request #716 from rhc54/topic/alloc
...
Default allocated nodes to the UP state
2015-07-11 12:30:32 -07:00
Ralph Castain
2c896c5a2d
Default allocated nodes to the UP state
2015-07-11 10:43:11 -07:00
Ralph Castain
683efcb850
Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.
2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c
Merge pull request #713 from rhc54/topic/errhandler
...
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14
Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base
2015-07-11 06:42:23 -07:00
rhc54
c6bb227073
Merge pull request #692 from rhc54/topic/mapper
...
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43
Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.
2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9
Merge pull request #691 from rhc54/topic/mapper
...
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36
Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy
2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47
remove trigraph '??)' from comment
...
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
eb582b8276
Minor whitespace cleanups
2015-07-06 09:38:33 -07:00
Ralph Castain
836f49597d
There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
...
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad
Okay, that's it - trash it
2015-07-01 05:27:30 -05:00
Ralph Castain
6cd3ccd305
Update the OMP support per request from IBM and LLNL
2015-06-30 10:24:34 -05:00
Ralph Castain
a58171a974
Add some debug
2015-06-29 14:51:41 -05:00
Ralph Castain
a4557d4ed2
Add new component to support OpenMP envars per request from IBM and LLNL
2015-06-27 17:57:04 -07:00
Ralph Castain
4352123c26
Protect the oob/tcp component from port scanners
2015-06-26 01:40:57 -07:00
Nathan Hjelm
ee36d813dc
Merge pull request #657 from hjelmn/c99
...
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e
more c99 updates
...
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034
ownership: update ownership files
...
per discussions at OMPI devel workshop
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
014a6a5969
Initialize variable to make clang happy
2015-06-24 22:01:09 -07:00
Ralph Castain
869041f770
Purge whitespace from the repo
2015-06-23 20:59:57 -07:00
Ralph Castain
db3c59b943
Silence a warning by converting the bitmap to a string prior to printing the error
2015-06-23 11:49:11 -07:00
Ralph Castain
706884652f
Silence Coverity warning about failing to check return code
2015-06-17 19:24:51 -07:00
Ralph Castain
869b2891c4
When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary
2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
ec679b3fc2
orte/orted: fix misc memory leaks
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
b72e9288bc
rmaps: fix a misc memory leak
...
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
27b4727fcf
orte/orted: fix misc memory leak
...
as reported by Coverity with CID 743448
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
ac5921d7da
orte/util: fix misc memory leak
...
as reported by Coverity with CID 1196738-1196739
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
e77d3057d6
orte-submit: fix a misc memory leak
...
as reported by Coverity with CID 710651
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
67638690ea
orte/util: fix a misc memory leak
...
as reported by Coverity with CID 710652
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
a43abceb88
fix dfs misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709
2015-06-17 11:17:54 +09:00
rhc54
adbff46a13
Merge pull request #642 from rhc54/topic/hwloc
...
Update hwloc to 1.11.0
2015-06-13 12:09:58 -07:00
Ralph Castain
ff92781ec4
Replace hwloc191 with hwloc1110
...
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Ralph Castain
cebdf0b7c0
Add missing include
2015-06-09 22:08:05 -07:00
Howard Pritchard
05325b113e
odls/alps: fix busted build for cray.
...
This commit fixes things broken by commit
ea35e47.
Fixes #616
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Ralph Castain
6b93db6a9a
Grrr...not sure how this slipped thru
2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184
Remove stale header
2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228
Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
...
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
7db48c581d
orte_quit: Remove logically dead code
...
CID 71993 Logically dead code (DEADCODE)
As indicated by coverity proc can not be NULL at any point after the
continue. Removed dead code.
CID 1269682 Unchecked return value (CHECKED_RETURN)
Check the return code of orte_get_attribute. I assume we still need to
check for a NULL proc in case the aborted proc attribute is set to
NULL. This might be better as an assert ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-26 12:16:12 -06:00
Ralph Castain
c21cd1c91e
Ensure the ssh session is dead
2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4
Ensure that all ssh sessions are terminated when abnormally terminating the job
2015-05-23 08:14:29 -07:00
Jeff Squyres
5e52ce26b5
help-errmgr-base.txt: remove trailing newline
...
Removed spurrious newline at end of file so that the emitted help
message doesn't contain a blank line before the final "-----" output.
2015-05-23 03:33:23 -07:00
Ralph Castain
55cd2a07f6
Update exit code
2015-05-22 21:06:43 -07:00
Ralph Castain
3510bb4ced
Set the exit code when a daemon fails
2015-05-22 21:05:23 -07:00
Ralph Castain
bc7a7f3de5
Fix abnormal shutdown when a node dies
2015-05-22 17:29:06 -07:00
Ralph Castain
96cd42699e
Cleanup warnings for uninitialized vars and convert bare debug output to verbose
2015-05-21 07:41:26 -07:00
Jeff Squyres
3069daa015
oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
...
Have only a single level of "if" conditionals. Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.
Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00