1
1
Граф коммитов

5073 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
8d7e759b85 oob/alps: swat compiler warning
swat some alps related compiler warnings when using --enable-picky

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
92ae386a34 As Jeff proposed, change the check to looking for the filename's first character to be a digit 2015-09-21 08:22:58 -07:00
rhc54
13def2a69b Merge pull request #911 from rhc54/topic/cleanup
Cleanup the odls "close file descriptor" commit to conform to OMPI co…
2015-09-20 07:01:39 -07:00
Howard Pritchard
1367a442b6 Merge pull request #910 from hppritcha/topic/odls_alps_use_907_stuff
odls/alps: do smarter close of fds in child
2015-09-20 07:37:55 -06:00
Ralph Castain
c167acc5a7 Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks 2015-09-19 20:46:36 -07:00
Howard Pritchard
a31cc21bea odls/alps: do smarter close of fds in child
Use a modified variant of #907.  Thanks to plesn
for noticing this.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-19 14:17:05 -07:00
Piotr Lesnicki
1dd5487fae odls: close only used file descriptors at fork/exec 2015-09-18 16:44:57 +02:00
Ralph Castain
1b7930ad52 Silence some warnings and address Coverity issues 2015-09-16 07:58:22 -07:00
Ralph Castain
8b88ea9b13 Fix singletons by removing stale code 2015-09-16 00:58:05 -07:00
Ralph Castain
c1bbbb5e2f Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds 2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer. 2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1 Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
Fix the locality computation by correctly computing the vpid of the local peer

This reverts commit open-mpi/ompi@6a8fad49e5.
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5 Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
This reverts commit f94f3cda21.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21 Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local 2015-09-10 10:25:30 -07:00
rhc54
f6b6b9a9ca Merge pull request #877 from rhc54/topic/s1s2
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7 Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components. 2015-09-08 18:37:09 -07:00
Ralph Castain
459f169e06 Fix segfault upon job error
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Jeff Squyres
bc9e5652ff whitespace: purge whitespace at end of lines
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues. 2015-09-07 09:19:24 -07:00
Ralph Castain
37c3ed68e7 Cleanup connect/disconnect and bring comm_spawn back online! 2015-09-06 10:27:39 -07:00
rhc54
665b30376a Merge pull request #868 from rhc54/topic/hwloc
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4 Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working 2015-09-04 10:07:17 -07:00
Ralph Castain
a772b46c15 Bring the MPI_Publish and friends online 2015-09-02 12:04:07 -07:00
Ralph Castain
38ba54366c Fix shared memory operations by resolving local peers 2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca Cleanup Coverity issues 2015-08-29 21:19:27 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
89c80b2294 Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener. 2015-08-27 16:41:00 -07:00
Nathan Hjelm
156ce6af21 periodic whitespace purge
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
bc7815e178 Adjust the process type flags to remove confusion between orted and dvm state machines 2015-08-21 07:50:08 -07:00
Ralph Castain
5040f47ef3 Use the correct verbosity in an output_verbose 2015-08-13 22:33:25 -07:00
Ralph Castain
a2a049a612 Update test to match the one in MTT 2015-08-13 11:12:34 -07:00
Ralph Castain
0b1d4b62be Cleanup some cruft and update to coordinate with CM operations:
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Jeff Squyres
31b329e585 odls default: ensure to initialize opts
This fixes CID 71127.
2015-08-12 05:27:37 -07:00
Howard Pritchard
8e7e4ca7f4 Merge pull request #780 from hppritcha/topic/plm_alps_minor_cleanup
plm/alps: remove unneded env. variable setting
2015-08-07 15:03:45 -06:00
Jeff Squyres
09f7434491 ORTE: update for the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
Howard Pritchard
1b55d14dff plm/alps: remove unneded env. variable setting
In order to address issue #741, the orted's now are
always launched with the Cray PMI environment variables

PMI_NO_FORK
PMI_NO_PREINITIALIZE

set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-08-05 13:27:18 -07:00
Ralph Castain
9bc384282a Fix an annoying segfault caused by incorrect indentation in a loop that causes the buffer to not be created prior to packing. 2015-08-01 10:01:47 -07:00
Ralph Castain
023936e84b Silence coverity warnings 2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7 oob/tcp: fix a race condition when finalizing the oob/tcp component 2015-07-28 09:16:13 +09:00
Ralph Castain
93f7a51275 Update the orte/system/opal_hotel test 2015-07-24 07:34:59 -07:00
Howard Pritchard
70096d3753 plm/alps: fix orted based launch failures.
Turns out that when one builds Open MPI with --disable-dlopen
for Cray, a whole bunch of cray specific libraries get linked
in to the orted executable.  One of these is Cray PMI.  The
Cray PMI has a ctor which, if run, causes job launches using
mpirun to fail.  This commit suppresses the running of the
ctor and thus prevents failure to launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-07-23 15:07:57 -07:00
Jeff Squyres
60609cbb79 orte/test/system: fix compiler warnings
Note that the opal_hotel test still doesn't compile; it looks like it
needs to be updated to the new requirement to pass an event base.
2015-07-23 06:19:33 -07:00
Ralph Castain
4853457b93 The RML posted recvs are controlled by the async progress thread when in an application process. The call to finalize and close the RML is done from the main thread, and so we need to shift the actual destruct of the posted recv list to the async thread for handling or else we encounter a race condition when accessing the posted recvs.
Thanks to Gilles for providing the required debug info
2015-07-21 08:44:23 -07:00
Ralph Castain
219c4dfba5 Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one. 2015-07-12 08:23:34 -07:00
rhc54
bd91225cb5 Merge pull request #716 from rhc54/topic/alloc
Default allocated nodes to the UP state
2015-07-11 12:30:32 -07:00
Ralph Castain
2c896c5a2d Default allocated nodes to the UP state 2015-07-11 10:43:11 -07:00
Ralph Castain
683efcb850 Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename. 2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c Merge pull request #713 from rhc54/topic/errhandler
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd Add an opal/errhandler so opal-level errors can be up-leveled 2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14 Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base 2015-07-11 06:42:23 -07:00
rhc54
c6bb227073 Merge pull request #692 from rhc54/topic/mapper
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43 Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line. 2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9 Merge pull request #691 from rhc54/topic/mapper
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36 Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy 2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47 remove trigraph '??)' from comment
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
eb582b8276 Minor whitespace cleanups 2015-07-06 09:38:33 -07:00
Ralph Castain
836f49597d There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad Okay, that's it - trash it 2015-07-01 05:27:30 -05:00
Ralph Castain
6cd3ccd305 Update the OMP support per request from IBM and LLNL 2015-06-30 10:24:34 -05:00
Ralph Castain
a58171a974 Add some debug 2015-06-29 14:51:41 -05:00
Ralph Castain
a4557d4ed2 Add new component to support OpenMP envars per request from IBM and LLNL 2015-06-27 17:57:04 -07:00
Ralph Castain
4352123c26 Protect the oob/tcp component from port scanners 2015-06-26 01:40:57 -07:00
Nathan Hjelm
ee36d813dc Merge pull request #657 from hjelmn/c99
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034 ownership: update ownership files
per discussions at OMPI devel workshop

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
014a6a5969 Initialize variable to make clang happy 2015-06-24 22:01:09 -07:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
db3c59b943 Silence a warning by converting the bitmap to a string prior to printing the error 2015-06-23 11:49:11 -07:00
Ralph Castain
706884652f Silence Coverity warning about failing to check return code 2015-06-17 19:24:51 -07:00
Ralph Castain
869b2891c4 When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary 2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
ec679b3fc2 orte/orted: fix misc memory leaks 2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
b72e9288bc rmaps: fix a misc memory leak
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
27b4727fcf orte/orted: fix misc memory leak
as reported by Coverity with CID 743448
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
ac5921d7da orte/util: fix misc memory leak
as reported by Coverity with CID 1196738-1196739
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
e77d3057d6 orte-submit: fix a misc memory leak
as reported by Coverity with CID 710651
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
67638690ea orte/util: fix a misc memory leak
as reported by Coverity with CID 710652
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
a43abceb88 fix dfs misc memory leaks
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709
2015-06-17 11:17:54 +09:00
rhc54
adbff46a13 Merge pull request #642 from rhc54/topic/hwloc
Update hwloc to 1.11.0
2015-06-13 12:09:58 -07:00
Ralph Castain
ff92781ec4 Replace hwloc191 with hwloc1110
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Ralph Castain
cebdf0b7c0 Add missing include 2015-06-09 22:08:05 -07:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
7db48c581d orte_quit: Remove logically dead code
CID 71993 Logically dead code (DEADCODE)

As indicated by coverity proc can not be NULL at any point after the
continue. Removed dead code.

CID 1269682 Unchecked return value (CHECKED_RETURN)

Check the return code of orte_get_attribute. I assume we still need to
check for a NULL proc in case the aborted proc attribute is set to
NULL. This might be better as an assert ().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-26 12:16:12 -06:00
Ralph Castain
c21cd1c91e Ensure the ssh session is dead 2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4 Ensure that all ssh sessions are terminated when abnormally terminating the job 2015-05-23 08:14:29 -07:00
Jeff Squyres
5e52ce26b5 help-errmgr-base.txt: remove trailing newline
Removed spurrious newline at end of file so that the emitted help
message doesn't contain a blank line before the final "-----" output.
2015-05-23 03:33:23 -07:00
Ralph Castain
55cd2a07f6 Update exit code 2015-05-22 21:06:43 -07:00
Ralph Castain
3510bb4ced Set the exit code when a daemon fails 2015-05-22 21:05:23 -07:00
Ralph Castain
bc7a7f3de5 Fix abnormal shutdown when a node dies 2015-05-22 17:29:06 -07:00
Ralph Castain
96cd42699e Cleanup warnings for uninitialized vars and convert bare debug output to verbose 2015-05-21 07:41:26 -07:00
Jeff Squyres
3069daa015 oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
Have only a single level of "if" conditionals.  Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.

Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291 oob tcp: label a few #endif's
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827 oob tcp: reset MCA params from level 9
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1 oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.

Also update comments and MCA param help strings.

60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2 oob_tcp: do not set KEEPALIVE on listening sockets 2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f oob tcp: re-enable keepalive option for Mac
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9 Merge pull request #584 from nkogteva/oob_ud_stress_test
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00
Gilles Gouaillardet
dd28b1f680 orted/dfs: fix misc memory leaks
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709 and 1269849
2015-05-20 13:09:46 +09:00
Ralph Castain
d3d3e73099 Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket 2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6 Plug the memory leak identified by George 2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d oob/alps: remove comment with personal reference
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac 2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test) 2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520 oob ud: ibv_get_device_list()==NULL can mean no devices present
...which is not an error.  Don't complain about it.
2015-05-12 10:54:39 -07:00
Jeff Squyres
8f941a6613 oob ud: better error msgs, tolerate systems without UD devices
It is perfectly ok to be on a system without UD devices.

Also, make some of the error messages better -- so that the user has a
clue about where the error messages are coming from, and what they
should do.
2015-05-11 13:11:51 -07:00
Mike Dubman
894ba28390 Merge pull request #559 from nkogteva/oob_ud
oob ud: made component more user adaptive; opal outputs were replaced by...
2015-05-11 21:09:28 +03:00
Ralph Castain
3cee4152fc Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them 2015-05-11 09:16:25 -07:00
Howard Pritchard
3382d3ce61 ess/alps: remove unnecessary vpid calc
There was a redundant computation of the vpid
for orted's happening in ess/alps rte_init
method.  Keep the more efficient alps based
method.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-09 20:07:38 -07:00
Ralph Castain
b5382c9bf9 Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583 Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params! 2015-05-07 21:05:08 -07:00
Gilles Gouaillardet
a80fda25d8 orte: rename the global variable component_map into orte_component_map
Thanks @goodell for pointing this !
2015-05-08 10:11:59 +09:00
Gilles Gouaillardet
2e384a3b65 initialize common symbols from orte
A few uninitialized common symbols are remaining (generated by flex) :
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
9cb2fcfa5c Cleanup the qos code when --enable-timings is given 2015-05-06 20:24:27 -07:00
Ralph Castain
01a9bdf4cf Cleanup of ud/oob component 2015-05-06 19:48:42 -07:00
Ralph Castain
1f8de276de Consolidate all the QOS changes into one clean commit 2015-05-06 19:48:42 -07:00
Ralph Castain
8e3f0b1d33 Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support 2015-05-06 15:18:15 -07:00
Ralph Castain
0bb73645f0 Silence Coverity warning 2015-04-30 20:49:28 -07:00
Ralph Castain
7d1980ba83 Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param. 2015-04-30 20:35:23 -07:00
Ralph Castain
e26e7ad736 Better support automated tests for map, rank, and bind options 2015-04-30 14:01:13 -07:00
Ralph Castain
7d4f9970d8 Minor cleanup 2015-04-29 17:49:35 -07:00
Nadezhda Kogteva
01ce58391e oob ud: made component more user adaptive; opal outputs were replaced by help messages. 2015-04-28 15:36:32 +03:00
Jeff Squyres
8fbf34b196 oob ud: put call to ibv_fork_init() before *all* ibv calls
Move the call to opal_common_verbs_fork_test() to up before the call
to ibv_get_device_list() (just curious -- why not use
opal_ibv_get_device_list()?).  This ensures that the call to
ibv_fork_init() is before *all* other ibv_* calls.
2015-04-24 14:19:06 -07:00
Ralph Castain
9104e81958 When --map-by node, we should be unbound. Also remove dead code due to copy/paste error. 2015-04-23 20:35:54 -07:00
Ralph Castain
5003be5c5c If the user specifies a --map-by <foo> option, then default to bind-to <foo> unless they specify a bind-to option. If they map-by slot/node, then use the default policy based on num_procs. 2015-04-23 13:30:21 -07:00
Ralph Castain
d5e4fd059f Ensure the binding and locale strings are always defined 2015-04-23 07:43:37 -07:00
Ralph Castain
cb7330a543 Get the output to lineup properly 2015-04-23 07:38:51 -07:00
Jeff Squyres
79243aca4e display-devel-map: minor output tweak
hwloc output can get fairly long, especially on machines with lots of
cores and/or hyperthreads.  So put the Locale and Binding output on
separate lines.
2015-04-23 06:14:57 -07:00
Ralph Castain
58e646ccfd Reduce confusion by having the devel-map display in the same format as report-bindings 2015-04-23 04:30:00 -07:00
Ralph Castain
43229d056e Protect one more place from a NULL object 2015-04-20 18:45:57 -07:00
Jeff Squyres
11e8c2096b plm rsh: assign some levels to the rsh PLM MCA params 2015-04-20 16:18:57 -07:00
Nathan Hjelm
359a282e7d ess/singleton: MCA variable synonyms can not currently have NULL for both framework and component
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-20 16:50:52 -06:00
Ralph Castain
e8387fcf88 Protect tools that can never run in distributed mode from getting confused by PMI. 2015-04-20 15:42:57 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
34b53ac3dc Silence Coverity warnings 2015-04-18 07:48:22 -07:00
Ralph Castain
12bfb27161 Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command 2015-04-17 16:11:37 -07:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com
c2678b0cc9 oob ud: fixes and parameter adjustment 2015-04-17 16:22:43 +03:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
d9c555b547 Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command"
This reverts commit open-mpi/ompi@278324c52a.

Revert "Add the ability to pass args to the rsh/ssh command line"

This reverts commit open-mpi/ompi@6f227f8564.
2015-04-16 08:03:14 -06:00
rhc54
79b9c50717 Merge pull request #535 from rhc54/topic/rsh
Add the ability to pass args to the rsh/ssh command line
2015-04-15 21:11:46 -06:00
Ralph Castain
278324c52a Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command 2015-04-15 20:30:04 -06:00
Ralph Castain
0e23f76eee Fix comment 2015-04-15 20:09:14 -06:00
Ralph Castain
6f227f8564 Add the ability to pass args to the rsh/ssh command line 2015-04-15 20:07:13 -06:00
Howard Pritchard
283ef4c05d oob/config: if --with-verbs=no, no ud
The oob/ud configure was not honoring the case
if the ompi is configured with --with-verbs=no.
This fixes that problems.

Fixes #522

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-04-14 06:31:18 -07:00
Nathan Hjelm
113c890ccf Merge pull request #520 from hjelmn/valgrind_cleanness
fix memory leaks and valgrind errors
2015-04-13 10:09:34 -06:00
Ralph Castain
9c6d452d6b If we are using HT cpus and have <= 2 procs, then map-by hwthread by default 2015-04-11 21:18:05 -07:00
Ralph Castain
cd686057f6 If the HNP is on a coprocessor, record it so we don't get an error log later 2015-04-11 15:30:15 -07:00
Nathan Hjelm
a7b0c00ab6 fix memory leaks and valgrind errors
This commit fixes several vagrind errors. Included:

 - installdirs did not correctly reinitialize all pointers to NULL
   at close. This causes valgrind errors on a subsequent call to
   opal_init_tool.

 - several opal strings were leaked by opal_deregister_params which
   was setting them to NULL instead of letting them be freed by the
   MCA variable system.

 - move opal_net_init to AFTER the variable system is initialized and
   opal's MCA variables have been registered. opal_net_init uses a
   variable registered by opal_register_params!

 - do not leak ompi_mpi_main_thread when it is allocated by
   MPI_T_init_thread.

 - do not overwrite ompi_mpi_main_thread if it is already set (by
   MPI_T_init_thread).

 - mca_base_var: read_files was overwritting mca_base_var_file_list
   even if it was non-NULL.

 - mca_base_var: set all file global variables to initial states on
   finalize.

 - btl/vader: decrement enumerator reference count to ensure that it
   is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-11 09:28:35 -06:00
Ralph Castain
91e1cbf284 Init variable 2015-04-11 07:44:57 -07:00
Ralph Castain
033418f62a Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so. 2015-04-10 15:58:35 -07:00
Ralph Castain
3e44d3c9e3 Enable singletons to run without any active OOB module until they attempt to comm_spawn 2015-04-10 14:06:42 -07:00
Ralph Castain
e4f6f83b9d Attempt to silence new Coverity complaint by ensuring the string read from file is NULL terminated. 2015-04-10 07:54:37 -07:00
Ralph Castain
396700ad8b Protect the notifier macro's against NULL job objects 2015-04-09 16:04:43 -07:00
Nathan Hjelm
c416c423bb ess/singleton: do not put component strings into the environment
putenv requires that any string put into the environment is not
changed or freed. That is not the case with constant strings as they
will go away when dlclose is called on the component. Instead, just
use opal_setenv which does not have this restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-09 11:00:47 -06:00
Nathan Hjelm
9cd955badf opal: fix multiple bugs in MCA and opal
This commit fixes the following bugs:

 - opal_output_finalize did not properly set internal state. This
   caused problems when calling the sequence opal_output_init (),
   opal_output_finalize (), opal_output_init ().

 - opal_info support called mca_base_open () but never called the
   matching mca_base_close (). mca_base_open () and mca_base_close ()
   have been updated to use a open count instead of an open flag to
   allow mca_base_open to be called through multiple paths (as may be
   the case when MPI_T is in use).

 - orte_info support did not register opal variables. This can cause
   orte-info to not return opal variables.

 - opal_info, orte_info, and ompi_info support have been updated to
   use a register count.

 - When opening the dl framework the reference count was added to
   ensure the framework stuck around. The framework being closed
   prematurely was a bug in the MCA base that has since been
   corrected. The increment (and associated decrement) have been
   removed.

 - dl/dlopen did not set the value of
   mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
   to register. Instead the value was set in the component
   structure. This caused the value to be lost when re-loading the
   component. Fixed by setting the default value in register.

 - Reset shmem framework state on close to avoid returning a stale
   component after reloading opal/shmem.

 - MCA base parameters were not properly deregistered when the MCA
   base was closed.

This commit may fix #374.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-07 19:13:20 -06:00
Ralph Castain
0c043dbdc9 Fix typo in var name 2015-04-02 02:32:42 -07:00
Ralph Castain
a4b466efc4 Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts. 2015-04-01 20:21:23 -07:00
Ralph Castain
9f8ae59162 Properly enclose the different && clauses 2015-04-01 18:48:25 -07:00
Ralph Castain
57c21d5209 Ensure the DVM flows thru the "daemons reported" state 2015-04-01 16:47:34 -07:00
Jeff Squyres
99754afd25 orterun.c: re-justify the output message text
The type-A personality / english lit major in me compells me to
re-justify the text.  :-)
2015-04-01 10:57:23 -07:00
Mike Dubman
8914a9c070 Merge pull request #494 from elenash/modifiers
changed mindist mapping policy specifier
2015-04-01 16:31:46 +03:00
Elena
1e913c76c4 changed mindist mapping policy specifier from map-bt dist:device,modifiers to --map-by dist:modifiers -mca rmaps_dist_device device 2015-04-01 15:07:35 +03:00
Nadezhda Kogteva
2d49d9bd45 grpcomm rcd: remove unnecessary malloc warning for case when number of daemons == 1 2015-04-01 11:07:44 +03:00
Mike Dubman
58d002098b Merge pull request #474 from elenash/master
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Ralph Castain
b468f6a503 Okay, Jeff - use opal_setenv 2015-03-31 20:34:02 -07:00
Ralph Castain
6f9140a341 Add a little more debug to launch 2015-03-31 20:10:21 -07:00
Ralph Castain
e5d96417e7 Update warnings for run-as-root 2015-03-31 17:55:28 -07:00
Ralph Castain
41dd65d6cd Per Jeff's request, tone down the comments and "standardize" the warning 2015-03-31 17:54:54 -07:00
Ralph Castain
f04eb6a9c0 Extend the root-user protection to some more ORTE tools 2015-03-31 10:34:35 -07:00
Ralph Castain
f863147b05 Per the telecon and chat with Jeff, let root only do the version option without warning. Otherwise, require that the user specifically indicate allow-use-as-root 2015-03-31 10:34:35 -07:00
Ralph Castain
b209c9efa5 Move the "dvm ready" message to stdout so it is easier to trap 2015-03-30 20:12:56 -07:00
Ralph Castain
6d205a3c80 Ensure that singletons pickup the oob/tcp component 2015-03-30 18:10:08 -07:00
Ralph Castain
2fa56fb329 Ensure that orte-submit picks the correct ess module as it is -never- allowed to be used as a distributed tool
Thanks to Mark Santcroos for diagnosing this one.
2015-03-30 18:08:34 -07:00
rhc54
bc016617a0 Merge pull request #501 from rhc54/topic/sec2
Support authentication across security domains
2015-03-30 09:59:43 -07:00
Nadezhda Kogteva
a828eada98 sm dstore: set pmix segment size to proper value 2015-03-30 13:34:25 +03:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Ralph Castain
b67b3619fc If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind.
Thanks to Annu Desari for pointing out the problem.
2015-03-28 08:20:24 -07:00
Ralph Castain
2f365720b0 Allow root to request the version and help from mpirun without having to override the run-as-root protection.
Thanks to Robert McLay for pointing this out
2015-03-28 08:17:44 -07:00
Ralph Castain
d2d02a1642 ckpt 2015-03-28 07:59:20 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
rhc54
2ff7575dde Merge pull request #497 from rhc54/topic/sec
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
6aa33deafb Remove debug 2015-03-25 19:58:51 -07:00
Ralph Castain
10cf455080 Tools need to use the TCP OOB component 2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
6ba76ed8d8 Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed. 2015-03-25 06:16:01 -07:00
rhc54
df24816d64 Merge pull request #488 from lrrajesh/master
Notification msg add severity to the message header.
2015-03-20 09:45:46 -07:00
Ralph Castain
095a8fa684 We don't need to know about non-fatal errors from setting socket options 2015-03-20 07:16:31 -07:00
Ralph Castain
a013f3059f For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid. 2015-03-19 21:24:13 -07:00
Howard Pritchard
990e9b47e0 Merge pull request #486 from hppritcha/topic/issue_484
orte/oob: implement alps oob component
2015-03-19 19:40:40 -06:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Howard Pritchard
6054975913 oob/alps: add configure file for alps oob
Have to have alps rpms installed on a system
for alps component to build, even if separated
by a level of indirection.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-03-19 15:38:14 -07:00
Howard Pritchard
b1f31a4364 orte/oob: implement alps oob component
Implement an almost-do-nothing alps oob component.
When using aprun to launch a job on Cray system,
there is no reason to need an oob system, since ompi
relies on Cray PMI for oob communication.

Fixes #484
2015-03-19 14:11:40 -07:00
lrrajesh
4dc75687e2 Notification msg add severity to the output 2015-03-18 13:55:03 -07:00
Nadezhda Kogteva
7c25b4cea6 grpcomm: fixed brks and rcd algorithms - added enough space for masks in order to get them working in the large scale. 2015-03-18 14:33:04 +02:00
Ralph Castain
50277fec76 Adjust MCA param 2015-03-17 19:46:31 -07:00
rhc54
b41d2ad6c4 Merge pull request #481 from rhc54/topic/slurm
Add new MCA parameter to support edge case with debugger at LLNL
2015-03-17 07:40:55 -07:00
Ralph Castain
b01e8c1063 Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see. 2015-03-17 06:26:26 -07:00