Ralph Castain
0523f60479
Remove debug from orte-submit help output
2015-11-01 09:19:07 -08:00
rhc54
1fe27bf1dd
Merge pull request #1084 from rhc54/topic/dashhost
...
Fix relative node syntax for dash-host option
2015-10-31 21:24:39 -07:00
Ralph Castain
8bfbe7f16c
Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile
2015-10-31 19:09:54 -07:00
Ralph Castain
24419b6523
Fix relative node syntax for dash-host option
2015-10-31 19:00:46 -07:00
rhc54
b23f1f3578
Merge pull request #1080 from federeghe/bugfixes
...
oob_tcp: fix peer->state wrong check
2015-10-31 16:09:23 -07:00
Ralph Castain
22dc05194e
Minor cleanup - explicitly NULL the last member of a function pointer module. Should default to that anyway, but this is cosmetically nicer.
2015-10-30 08:19:55 -07:00
Federico Reghenzani
6536a6a9f5
oob_tcp: fix peer->state wrong check
2015-10-29 16:43:58 +01:00
Ralph Castain
267ca8fcd3
Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now
...
to continue current default behavior.
Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
2015-10-27 17:31:56 -07:00
rhc54
3ffbf08283
Merge pull request #1068 from marksantcroos/master
...
Make odsl debug message consistent.
2015-10-24 08:11:11 -07:00
Mark Santcroos
30aab75b86
Make message consistent.
2015-10-24 13:40:03 +02:00
Ralph Castain
6506b0a5e5
Resolve a race condition that prevented the sigchild callback from being registered before short-lived apps terminated
...
Thanks to Mark Santcroos for the assistance in tracking it down.
2015-10-23 21:02:31 -07:00
Nathan Hjelm
9602484568
Merge pull request #1040 from hjelmn/mtl_priority
...
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7
mca/base: add priority output to mca_base_select
...
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506
Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem
2015-10-17 20:24:03 -07:00
Jeff Squyres
62351f442a
help: remove stale help messages and files
...
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Jeff Squyres
f9e9b69d93
Merge pull request #1001 from igor-ivanov/master
...
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
2015-10-09 14:07:47 -04:00
Igor Ivanov
489f27f8e9
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
...
See: https://github.com/open-mpi/ompi/issues/953
2015-10-09 17:58:07 +03:00
rhc54
232f97a80c
Merge pull request #968 from JohnWestlund/master
...
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-07 17:42:19 -07:00
Howard Pritchard
d899320574
odls/alps: close the directory
...
Close the /proc/self/fd dir after checking for open fds.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-10-06 11:13:44 -07:00
Igor Ivanov
d379873443
oshmem: Add man.1 pages for oshmem tools
...
This changes add man pages for oshrun, oshcc and oshfort as well as
depricated shmemrun, shmemcc and shmemfort.
2015-10-05 15:41:28 +03:00
John Westlund
044fea8df7
re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure
2015-10-02 15:59:48 -07:00
John Westlund
6bfaa925ec
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-02 14:26:52 -07:00
Ralph Castain
8f6855459d
Cleanup some coverity warnings
2015-09-30 10:33:53 -07:00
Gilles Gouaillardet
0445484820
ras: remove orte_ras_proc_t and associated code
2015-09-30 08:52:52 +09:00
Gilles Gouaillardet
7cc14ee6f6
orte/rmaps: silence warning
2015-09-29 16:05:52 +09:00
Ralph Castain
fad5638596
Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked.
2015-09-27 09:57:59 -07:00
Ralph Castain
0140ff048d
Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
...
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.
Also do a little cleanup to avoid bombarding the user with multiple error messages.
Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe
Plug a few memory leaks identified by valgrind
2015-09-23 15:21:04 -07:00
Ralph Castain
f28448702a
Eliminate malloc by utilizing /proc/self/fd - optimization
2015-09-22 07:24:54 -07:00
Ralph Castain
f872e99315
Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize.
2015-09-21 20:31:57 -07:00
Howard Pritchard
ef6cf50687
Merge pull request #917 from hppritcha/topic/alps_warning_swat
...
oob/alps: swat compiler warning
2015-09-21 16:17:30 -06:00
Howard Pritchard
8d7e759b85
oob/alps: swat compiler warning
...
swat some alps related compiler warnings when using --enable-picky
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
92ae386a34
As Jeff proposed, change the check to looking for the filename's first character to be a digit
2015-09-21 08:22:58 -07:00
rhc54
13def2a69b
Merge pull request #911 from rhc54/topic/cleanup
...
Cleanup the odls "close file descriptor" commit to conform to OMPI co…
2015-09-20 07:01:39 -07:00
Howard Pritchard
1367a442b6
Merge pull request #910 from hppritcha/topic/odls_alps_use_907_stuff
...
odls/alps: do smarter close of fds in child
2015-09-20 07:37:55 -06:00
Ralph Castain
c167acc5a7
Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks
2015-09-19 20:46:36 -07:00
Howard Pritchard
a31cc21bea
odls/alps: do smarter close of fds in child
...
Use a modified variant of #907 . Thanks to plesn
for noticing this.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-19 14:17:05 -07:00
Piotr Lesnicki
1dd5487fae
odls: close only used file descriptors at fork/exec
2015-09-18 16:44:57 +02:00
Ralph Castain
1b7930ad52
Silence some warnings and address Coverity issues
2015-09-16 07:58:22 -07:00
Ralph Castain
8b88ea9b13
Fix singletons by removing stale code
2015-09-16 00:58:05 -07:00
Ralph Castain
c1bbbb5e2f
Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds
2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a
Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer.
2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1
Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
...
Fix the locality computation by correctly computing the vpid of the local peer
This reverts commit open-mpi/ompi@6a8fad49e5 .
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5
Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
...
This reverts commit f94f3cda21
.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21
Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local
2015-09-10 10:25:30 -07:00
rhc54
f6b6b9a9ca
Merge pull request #877 from rhc54/topic/s1s2
...
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7
Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.
2015-09-08 18:37:09 -07:00
Ralph Castain
459f169e06
Fix segfault upon job error
...
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Jeff Squyres
bc9e5652ff
whitespace: purge whitespace at end of lines
...
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f
Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.
2015-09-07 09:19:24 -07:00
Ralph Castain
37c3ed68e7
Cleanup connect/disconnect and bring comm_spawn back online!
2015-09-06 10:27:39 -07:00
rhc54
665b30376a
Merge pull request #868 from rhc54/topic/hwloc
...
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4
Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working
2015-09-04 10:07:17 -07:00
Ralph Castain
a772b46c15
Bring the MPI_Publish and friends online
2015-09-02 12:04:07 -07:00
Ralph Castain
38ba54366c
Fix shared memory operations by resolving local peers
2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca
Cleanup Coverity issues
2015-08-29 21:19:27 -07:00
Ralph Castain
cf6137b530
Integrate PMIx 1.0 with OMPI.
...
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
89c80b2294
Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.
2015-08-27 16:41:00 -07:00
Nathan Hjelm
156ce6af21
periodic whitespace purge
...
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
bc7815e178
Adjust the process type flags to remove confusion between orted and dvm state machines
2015-08-21 07:50:08 -07:00
Ralph Castain
5040f47ef3
Use the correct verbosity in an output_verbose
2015-08-13 22:33:25 -07:00
Ralph Castain
a2a049a612
Update test to match the one in MTT
2015-08-13 11:12:34 -07:00
Ralph Castain
0b1d4b62be
Cleanup some cruft and update to coordinate with CM operations:
...
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Jeff Squyres
31b329e585
odls default: ensure to initialize opts
...
This fixes CID 71127.
2015-08-12 05:27:37 -07:00
Howard Pritchard
8e7e4ca7f4
Merge pull request #780 from hppritcha/topic/plm_alps_minor_cleanup
...
plm/alps: remove unneded env. variable setting
2015-08-07 15:03:45 -06:00
Jeff Squyres
09f7434491
ORTE: update for the new opal_progress_thread API
2015-08-07 10:13:40 -07:00
Howard Pritchard
1b55d14dff
plm/alps: remove unneded env. variable setting
...
In order to address issue #741 , the orted's now are
always launched with the Cray PMI environment variables
PMI_NO_FORK
PMI_NO_PREINITIALIZE
set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-08-05 13:27:18 -07:00
Ralph Castain
9bc384282a
Fix an annoying segfault caused by incorrect indentation in a loop that causes the buffer to not be created prior to packing.
2015-08-01 10:01:47 -07:00
Ralph Castain
023936e84b
Silence coverity warnings
2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7
oob/tcp: fix a race condition when finalizing the oob/tcp component
2015-07-28 09:16:13 +09:00
Ralph Castain
93f7a51275
Update the orte/system/opal_hotel test
2015-07-24 07:34:59 -07:00
Howard Pritchard
70096d3753
plm/alps: fix orted based launch failures.
...
Turns out that when one builds Open MPI with --disable-dlopen
for Cray, a whole bunch of cray specific libraries get linked
in to the orted executable. One of these is Cray PMI. The
Cray PMI has a ctor which, if run, causes job launches using
mpirun to fail. This commit suppresses the running of the
ctor and thus prevents failure to launch.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-07-23 15:07:57 -07:00
Jeff Squyres
60609cbb79
orte/test/system: fix compiler warnings
...
Note that the opal_hotel test still doesn't compile; it looks like it
needs to be updated to the new requirement to pass an event base.
2015-07-23 06:19:33 -07:00
Ralph Castain
4853457b93
The RML posted recvs are controlled by the async progress thread when in an application process. The call to finalize and close the RML is done from the main thread, and so we need to shift the actual destruct of the posted recv list to the async thread for handling or else we encounter a race condition when accessing the posted recvs.
...
Thanks to Gilles for providing the required debug info
2015-07-21 08:44:23 -07:00
Ralph Castain
219c4dfba5
Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.
2015-07-12 08:23:34 -07:00
rhc54
bd91225cb5
Merge pull request #716 from rhc54/topic/alloc
...
Default allocated nodes to the UP state
2015-07-11 12:30:32 -07:00
Ralph Castain
2c896c5a2d
Default allocated nodes to the UP state
2015-07-11 10:43:11 -07:00
Ralph Castain
683efcb850
Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.
2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c
Merge pull request #713 from rhc54/topic/errhandler
...
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14
Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base
2015-07-11 06:42:23 -07:00
rhc54
c6bb227073
Merge pull request #692 from rhc54/topic/mapper
...
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43
Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.
2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9
Merge pull request #691 from rhc54/topic/mapper
...
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36
Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy
2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47
remove trigraph '??)' from comment
...
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
eb582b8276
Minor whitespace cleanups
2015-07-06 09:38:33 -07:00
Ralph Castain
836f49597d
There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
...
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad
Okay, that's it - trash it
2015-07-01 05:27:30 -05:00
Ralph Castain
6cd3ccd305
Update the OMP support per request from IBM and LLNL
2015-06-30 10:24:34 -05:00
Ralph Castain
a58171a974
Add some debug
2015-06-29 14:51:41 -05:00
Ralph Castain
a4557d4ed2
Add new component to support OpenMP envars per request from IBM and LLNL
2015-06-27 17:57:04 -07:00
Ralph Castain
4352123c26
Protect the oob/tcp component from port scanners
2015-06-26 01:40:57 -07:00
Nathan Hjelm
ee36d813dc
Merge pull request #657 from hjelmn/c99
...
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e
more c99 updates
...
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034
ownership: update ownership files
...
per discussions at OMPI devel workshop
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
014a6a5969
Initialize variable to make clang happy
2015-06-24 22:01:09 -07:00
Ralph Castain
869041f770
Purge whitespace from the repo
2015-06-23 20:59:57 -07:00
Ralph Castain
db3c59b943
Silence a warning by converting the bitmap to a string prior to printing the error
2015-06-23 11:49:11 -07:00
Ralph Castain
706884652f
Silence Coverity warning about failing to check return code
2015-06-17 19:24:51 -07:00
Ralph Castain
869b2891c4
When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary
2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
ec679b3fc2
orte/orted: fix misc memory leaks
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
b72e9288bc
rmaps: fix a misc memory leak
...
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
27b4727fcf
orte/orted: fix misc memory leak
...
as reported by Coverity with CID 743448
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
ac5921d7da
orte/util: fix misc memory leak
...
as reported by Coverity with CID 1196738-1196739
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
e77d3057d6
orte-submit: fix a misc memory leak
...
as reported by Coverity with CID 710651
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
67638690ea
orte/util: fix a misc memory leak
...
as reported by Coverity with CID 710652
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
a43abceb88
fix dfs misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709
2015-06-17 11:17:54 +09:00
rhc54
adbff46a13
Merge pull request #642 from rhc54/topic/hwloc
...
Update hwloc to 1.11.0
2015-06-13 12:09:58 -07:00
Ralph Castain
ff92781ec4
Replace hwloc191 with hwloc1110
...
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Ralph Castain
cebdf0b7c0
Add missing include
2015-06-09 22:08:05 -07:00
Howard Pritchard
05325b113e
odls/alps: fix busted build for cray.
...
This commit fixes things broken by commit
ea35e47
.
Fixes #616
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Ralph Castain
6b93db6a9a
Grrr...not sure how this slipped thru
2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184
Remove stale header
2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228
Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
...
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
7db48c581d
orte_quit: Remove logically dead code
...
CID 71993 Logically dead code (DEADCODE)
As indicated by coverity proc can not be NULL at any point after the
continue. Removed dead code.
CID 1269682 Unchecked return value (CHECKED_RETURN)
Check the return code of orte_get_attribute. I assume we still need to
check for a NULL proc in case the aborted proc attribute is set to
NULL. This might be better as an assert ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-26 12:16:12 -06:00
Ralph Castain
c21cd1c91e
Ensure the ssh session is dead
2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4
Ensure that all ssh sessions are terminated when abnormally terminating the job
2015-05-23 08:14:29 -07:00
Jeff Squyres
5e52ce26b5
help-errmgr-base.txt: remove trailing newline
...
Removed spurrious newline at end of file so that the emitted help
message doesn't contain a blank line before the final "-----" output.
2015-05-23 03:33:23 -07:00
Ralph Castain
55cd2a07f6
Update exit code
2015-05-22 21:06:43 -07:00
Ralph Castain
3510bb4ced
Set the exit code when a daemon fails
2015-05-22 21:05:23 -07:00
Ralph Castain
bc7a7f3de5
Fix abnormal shutdown when a node dies
2015-05-22 17:29:06 -07:00
Ralph Castain
96cd42699e
Cleanup warnings for uninitialized vars and convert bare debug output to verbose
2015-05-21 07:41:26 -07:00
Jeff Squyres
3069daa015
oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
...
Have only a single level of "if" conditionals. Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.
Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291
oob tcp: label a few #endif's
...
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827
oob tcp: reset MCA params from level 9
...
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1
oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
...
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.
Also update comments and MCA param help strings.
60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2
oob_tcp: do not set KEEPALIVE on listening sockets
2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f
oob tcp: re-enable keepalive option for Mac
...
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9
Merge pull request #584 from nkogteva/oob_ud_stress_test
...
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00
Gilles Gouaillardet
dd28b1f680
orted/dfs: fix misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709 and 1269849
2015-05-20 13:09:46 +09:00
Ralph Castain
d3d3e73099
Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket
2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6
Plug the memory leak identified by George
2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d
oob/alps: remove comment with personal reference
...
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e
The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac
2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e
oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)
2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520
oob ud: ibv_get_device_list()==NULL can mean no devices present
...
...which is not an error. Don't complain about it.
2015-05-12 10:54:39 -07:00
Jeff Squyres
8f941a6613
oob ud: better error msgs, tolerate systems without UD devices
...
It is perfectly ok to be on a system without UD devices.
Also, make some of the error messages better -- so that the user has a
clue about where the error messages are coming from, and what they
should do.
2015-05-11 13:11:51 -07:00
Mike Dubman
894ba28390
Merge pull request #559 from nkogteva/oob_ud
...
oob ud: made component more user adaptive; opal outputs were replaced by...
2015-05-11 21:09:28 +03:00
Ralph Castain
3cee4152fc
Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them
2015-05-11 09:16:25 -07:00
Howard Pritchard
3382d3ce61
ess/alps: remove unnecessary vpid calc
...
There was a redundant computation of the vpid
for orted's happening in ess/alps rte_init
method. Keep the more efficient alps based
method.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-09 20:07:38 -07:00
Ralph Castain
b5382c9bf9
Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
...
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583
Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!
2015-05-07 21:05:08 -07:00
Gilles Gouaillardet
a80fda25d8
orte: rename the global variable component_map into orte_component_map
...
Thanks @goodell for pointing this !
2015-05-08 10:11:59 +09:00
Gilles Gouaillardet
2e384a3b65
initialize common symbols from orte
...
A few uninitialized common symbols are remaining (generated by flex) :
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
9cb2fcfa5c
Cleanup the qos code when --enable-timings is given
2015-05-06 20:24:27 -07:00
Ralph Castain
01a9bdf4cf
Cleanup of ud/oob component
2015-05-06 19:48:42 -07:00
Ralph Castain
1f8de276de
Consolidate all the QOS changes into one clean commit
2015-05-06 19:48:42 -07:00
Ralph Castain
8e3f0b1d33
Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support
2015-05-06 15:18:15 -07:00