Ralph Castain
e6c72bfd53
Ensure we can forcibly exit even when we are stuck inside of an event by replacing the libevent signal handler with a POSIX one that (a) attempts to trip a libevent termination event and (b) if anothe ctrl-c hits within 5 seconds, just calls exit.
...
This commit was SVN r26943.
2012-08-02 21:15:35 +00:00
Ralph Castain
d818c9d407
Includes a patch from Jeff and Josh: update the simulator module to allow specification of multiple slot and max_slot counts for each node group (but don't require it). Remove the requirement that each node group provide its own topology. Adjust verbosities to allow showing some light debug output to see what nodes have been added without getting a bunch of other stuff.
...
This commit was SVN r26936.
2012-08-02 04:57:13 +00:00
Jeff Squyres
62c2ff7ee7
It's actually ''not'' an error to exit if all routes and children are
...
gone. So exit with 0, not ORTE_ERROR_DEFAULT_EXIT_CODE (which is 1).
This fixes a race condition in the rsh launcher upon termination,
where ORTE would sometimes think that a daemon failed to launch.
This commit was SVN r26935.
2012-08-01 19:49:19 +00:00
Nathan Hjelm
4557e15c18
oob/ud fix compile error
...
This commit was SVN r26933.
2012-07-31 21:50:34 +00:00
Ralph Castain
6ee35e4977
Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
...
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Jeff Squyres
88cbe9c780
.ompi_ignore this component until it can be fixed.
...
This commit was SVN r26930.
2012-07-31 21:02:06 +00:00
Nathan Hjelm
980692804d
oob/ud: don't start listening for ud requests unless we have one usable port
...
This commit was SVN r26929.
2012-07-31 19:00:18 +00:00
Ralph Castain
23c2a315a9
Add missing line to set flag indicating at least one port found
...
This commit was SVN r26914.
2012-07-30 17:54:38 +00:00
Ralph Castain
6285f7d8c0
Per request of Shiqing, restore the ccp components
...
This commit was SVN r26904.
2012-07-29 23:49:59 +00:00
Ralph Castain
94d11e04fd
Add an intermediate state when the VM is ready so that third party tools can take action prior to mapping/launching apps
...
This commit was SVN r26902.
2012-07-28 15:33:09 +00:00
Ralph Castain
8bc6694a62
Ensure the daemons don't incorrectly declare a failed launch
...
This commit was SVN r26875.
2012-07-26 19:05:06 +00:00
Ralph Castain
07846f12ae
Reconnect the rsh/ssh error reporting code for remote spawns to report failure to launch. Ensure the HNP correctly reports non-zero exit status when ssh encounters a problem.
...
Thanks to Terry for spotting it!
This commit was SVN r26868.
2012-07-25 21:46:45 +00:00
Jeff Squyres
e5cfad0c1a
This variable is only used in FT builds.
...
This commit was SVN r26854.
2012-07-24 12:48:47 +00:00
Shiqing Fan
12d99a9ebb
Update the hwloc build on Windows and related files.
...
This commit was SVN r26818.
2012-07-20 12:14:28 +00:00
Abhishek Kulkarni
1ce378b5c6
Make C/R work with nodes > 1. This fix makes sure that the app coordinators send
...
the "ready-to-checkpoint" signal to the global coordinator only after ORTE has
initialized.
This commit was SVN r26795.
2012-07-13 23:37:29 +00:00
Abhishek Kulkarni
1878f276cd
Replace the pattern while(flag) { opal_progress() }; in the C/R code
...
with the ORTE_WAIT_FOR_COMPLETION macro.
This commit was SVN r26794.
2012-07-13 23:31:56 +00:00
George Bosilca
772ec212eb
Fix another compiler warning.
...
This commit was SVN r26775.
2012-07-10 15:57:42 +00:00
Abhishek Kulkarni
eec5a28aa4
More C/R fixes.
...
* Fix a typo introduced by the removal of the notifier framework
* Fix to flush the modex cached data correctly using the orte DB API.
This commit was SVN r26773.
2012-07-10 01:19:46 +00:00
Abhishek Kulkarni
5c58a1c9c1
Fix C/R support in the trunk.
...
Among other things, this patch deals with the following issues:
* fix ompi-checkpoint argument parsing
* ompi-restart -showme prints an extraneous "Restarted child with PID"
message. Move around the debug statement to avoid this.
* fixes for the state machine changes
This commit was SVN r26770.
2012-07-09 23:34:13 +00:00
George Bosilca
ec760454a6
Cleaning ...
...
This commit was SVN r26747.
2012-07-04 21:22:13 +00:00
Ralph Castain
6ae5776904
Cleanup IPV6 build
...
This commit was SVN r26738.
2012-07-04 00:03:50 +00:00
Ralph Castain
1a90471374
Drat - missed the other one
...
This commit was SVN r26718.
2012-07-02 22:18:31 +00:00
Ralph Castain
9a6a969f60
Remove debug
...
This commit was SVN r26717.
2012-07-02 22:18:08 +00:00
Ralph Castain
b83fc41d54
Add a state that allows mpirun or other tools to be notified of a job completion prior to terminating so that alternative actions can be performed.
...
This commit was SVN r26716.
2012-07-02 22:16:32 +00:00
Ralph Castain
8bebf2fa47
Ensure we don't build the MR iof components unless hadoop support is enabled
...
This commit was SVN r26694.
2012-06-28 18:20:15 +00:00
Ralph Castain
9aa821d8b4
Add missing file to tarball
...
This commit was SVN r26688.
2012-06-28 02:57:10 +00:00
Ralph Castain
0dfe29b1a6
Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
...
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.
Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.
This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Brian Barrett
b22faedd9d
Remove the Portals4 SHMEM reference implementation runtime support, as we're
...
no longer using the runtime provided by the reference implementation.
Remove the Catamount support from ORTE, since we're no longer supporting
Catamount. Left the Catamount timer component, because I'm not sure whether
it's used on the XTs running CNL.
This commit was SVN r26677.
2012-06-27 14:17:43 +00:00
Josh Hursey
28681deffa
Backout the ORCA commit. :(
...
There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk.
This commit was SVN r26676.
2012-06-27 01:28:28 +00:00
Josh Hursey
542330e3a7
Commit of ORCA: Open MPI Runtime Collaborative Abstraction
...
This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI.
The project is described on the wiki:
https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition
And on this email thread:
http://www.open-mpi.org/community/lists/devel/2012/06/11109.php
This commit was SVN r26670.
2012-06-26 21:42:16 +00:00
Ralph Castain
a34f09e67a
Ensure common port is off when not being used
...
This commit was SVN r26666.
2012-06-26 16:09:58 +00:00
Ralph Castain
92527da4e3
Remove unused component
...
This commit was SVN r26660.
2012-06-26 00:49:28 +00:00
Ralph Castain
0103f82918
Turn off the common port for slurm for now
...
This commit was SVN r26656.
2012-06-25 21:55:51 +00:00
Shiqing Fan
6f746cdb33
remove a unused file.
...
This commit was SVN r26645.
2012-06-25 10:17:21 +00:00
Jeff Squyres
148ae6d6e3
This commit unifies the configury of some verbs-lovin' components.
...
* Add new configure command line options and deprecate some old ones:
* --with-verbs replaces --with-openib
* --with-verbs-libdir replaces --with-openib-libdir
* If you specify --with-openib[-libdir] without
--with-verbs[-libdir], you'll get a "these options have been
deprecated!" warning, but then they'll act just like
--with-verbs[--libdir].
'''Sidenote:''' Note that we are not renaming any components at this
time, nor are we renaming the top-level OMPI_CHECK_OPENIB m4 macro
(which is pretty strongly tied to the openib BTL and is bastaridzed
by the ofud BTL). Note that there will likely be more changes in
this area coming soon (next week?) when some long-standing changes
move to the SVN trunk: some openib BTL infrastructure will move to
ompi/mca/common, and its configury gets split up / refactored.
We extend our philosophy of other --with-<foo> configure options of
--with-verbs to ''all'' verbs-lovin components:
* If you specify --with-verbs, then all verbs-lovin' components must
configure successfully (or abort). This currently means: OOB ud,
BTL ofud, BTL openib.
* If you specify --with-verbs=DIR, then all verbs-lovin' component
must configure successfully (or abort), and will use DIR to find
verbs headers and libraries.
* If you specify --without-verbs, then all verbs-lovin' components
will be ignored.
This commit also fixes a problem where the --with-openib=DIR form
would not use DIR for ''all'' verbs-lovin' components (I think only
BTL openib and BTL ofud used that DIR). Now all of them do, as does
hwloc (because hwloc has some !OpenFabrics helper functions that
require ibv types from verbs.h).
There's a little new m4 infrastructure worth mentioning:
* If you create a new verbs-lovin' component (i.e., a component that
need verbs), your configure.m4 should
AC_REQUIRE([OPAL_CHECK_VERBS_DIR]).
* You can then use three global shell variables: $opal_want_verbs,
$opal_verbs_dir, $opal_verbs_libdir, which will be set as follows:
* opal_want_verbs will be "yes" and opal_verbs_dir and
opal_verbs_libdir will both be set to directory values, '''OR'''
* opal_want_verbs will be "no" and opal_verbs_dir and
opal_verbs_libdir will both be set empty
This commit was SVN r26640.
2012-06-22 19:53:56 +00:00
Ralph Castain
e6f3586415
Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item)
...
This commit was SVN r26637.
2012-06-22 18:09:23 +00:00
Ralph Castain
60758faa55
Fix data type
...
This commit was SVN r26633.
2012-06-21 23:48:55 +00:00
Ralph Castain
e9591f2563
Fix tree spawn in the rsh/qrsh environment
...
This commit was SVN r26631.
2012-06-21 21:29:28 +00:00
Ralph Castain
0a713cd27e
Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later.
...
Modex is still a blocking operation at this point.
This commit was SVN r26618.
2012-06-19 13:38:42 +00:00
Ralph Castain
9e0bb6ae28
Revert r26600 and r26601 for a couple of reasons:
...
1. they modified the OMPI-ORTE interface, which is something I promised to avoid doing unless absolutely necessary, and
2. the framework ident is already in the component name key provided to the modex db. What is missing is the project ident, but as Jeff and I discussed last week, we really need to add that field to the component struct anyway to avoid multi-project collisions on framework names. That will be done over the next couple of weeks as a separate effort.
This commit was SVN r26613.
The following SVN revision numbers were found above:
r26600 --> open-mpi/ompi@5ba4deff07
r26601 --> open-mpi/ompi@0e3094c318
2012-06-16 09:11:03 +00:00
Ralph Castain
3c2a03b16d
Update the other routed components to use common ports. Per conversation with Josh, remove the "cm" component.
...
This commit was SVN r26608.
2012-06-15 15:36:08 +00:00
Ralph Castain
96c778656a
Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP.
...
Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build.
This commit was SVN r26606.
2012-06-15 10:15:07 +00:00
Ralph Castain
0e3094c318
Update the other grpcomm modules to new API
...
This commit was SVN r26601.
2012-06-14 03:28:48 +00:00
Ralph Castain
5ba4deff07
Extend the modex database to support multiple projects and frameworks that might have duplicate component names. No visible API change in the BTL's as it was executed solely in the ompi modex code.
...
This commit was SVN r26600.
2012-06-14 02:55:06 +00:00
Ralph Castain
ecc51d8583
Add missing endif
...
This commit was SVN r26596.
2012-06-12 15:07:09 +00:00
Ralph Castain
078a4667e4
Some more cleanup on direct routed when daemons are involved
...
This commit was SVN r26594.
2012-06-11 23:46:22 +00:00
Ralph Castain
269cb2b8d9
Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base.
...
This commit was SVN r26591.
2012-06-11 19:59:53 +00:00
Ralph Castain
75e66ad51e
Restore the direct routed component
...
This commit was SVN r26590.
2012-06-11 17:16:02 +00:00
Brian Barrett
7406ef1241
Make all the PMI components depend on the common pmi library and properly
...
install the common pmi library
This commit was SVN r26588.
2012-06-11 15:58:09 +00:00
Ralph Castain
2812579246
Just because we find an IB device does not mean we can get a QP on it. Check to see if we can before we select the UD OOB module for use.
...
This commit was SVN r26587.
2012-06-10 01:42:51 +00:00
Ralph Castain
0442a807c0
Default the OOB to the "ud" component IFF the HNP finds itself on a node with a supported Infiniband device. Ensure that the daemons all pick the matching component by dictating the selection via mca param on the orted cmd line.
...
This commit was SVN r26582.
2012-06-08 01:23:08 +00:00
Ralph Castain
05122a2f93
Make debruijn the default routed component. Update the radix component to "short-circuit" the tree when the job size permits
...
This commit was SVN r26580.
2012-06-08 00:35:36 +00:00
Ralph Castain
ffcca0185a
Remove no longer needed component
...
This commit was SVN r26578.
2012-06-08 00:18:59 +00:00
Ralph Castain
980768965f
Remove unused and unsupported component
...
This commit was SVN r26577.
2012-06-07 23:48:06 +00:00
Ralph Castain
350900f70e
Remove unused and unsupported component
...
This commit was SVN r26576.
2012-06-07 23:47:35 +00:00
Nathan Hjelm
625c8078c3
oob/ud: fix typo
...
This commit was SVN r26569.
2012-06-07 19:21:23 +00:00
Ralph Castain
7a94a52420
No reason not to build this
...
This commit was SVN r26568.
2012-06-07 19:11:44 +00:00
Shiqing Fan
2abf783fa0
Remove a unnecessary definition before the real one.
...
This commit was SVN r26562.
2012-06-06 14:15:39 +00:00
Ralph Castain
166d254d4e
Add new routed component
...
This commit was SVN r26557.
2012-06-06 11:53:12 +00:00
Ralph Castain
d6279fc971
Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position.
...
This commit was SVN r26556.
2012-06-06 02:01:23 +00:00
Jeff Squyres
0b8849e2c4
Make "mpirun --report-bindings" have a user-friendly output (i.e.,
...
readable by normal human beings, vs. having a bitmap of physical
PU's). Use the new hwloc base prettyprint functions to generate the
output.
This commit was SVN r26533.
2012-06-01 16:35:31 +00:00
Jeff Squyres
99c5afb397
Remove clang compiler warnings.
...
This commit was SVN r26523.
2012-05-29 23:36:06 +00:00
Ralph Castain
b0938a254e
Dont use mutex where it isn't needed
...
This commit was SVN r26521.
2012-05-29 20:21:11 +00:00
Ralph Castain
32b66c166b
Missed one blasted spot
...
This commit was SVN r26520.
2012-05-29 20:20:10 +00:00
Ralph Castain
9bedb25dda
Cleanup some compiler warnings, some of which are actual logic errors
...
This commit was SVN r26519.
2012-05-29 20:11:51 +00:00
Ralph Castain
d7ac424d8d
Silence optimized build warnings
...
This commit was SVN r26518.
2012-05-29 19:55:47 +00:00
Ralph Castain
bf5ec1ac0c
Silence optimized build warnings
...
This commit was SVN r26517.
2012-05-29 19:55:31 +00:00
Ralph Castain
be6ed9c2df
Allow partial use of allocations by specifying the max number of daemons (i.e., max VM size) for the job
...
This commit was SVN r26499.
2012-05-27 16:48:19 +00:00
Ralph Castain
7fb49b1559
Silence warning
...
This commit was SVN r26480.
2012-05-23 13:59:41 +00:00
Ralph Castain
da28a4b0e6
Silence warning
...
This commit was SVN r26479.
2012-05-23 13:59:22 +00:00
Nathan Hjelm
b9959a95cd
ack! one more
...
This commit was SVN r26472.
2012-05-22 20:52:52 +00:00
Nathan Hjelm
f2d4e95429
doh! add missing include
...
This commit was SVN r26471.
2012-05-22 20:49:13 +00:00
Nathan Hjelm
cdc3c87ba6
move pmi init/finalize into a common component
...
This commit was SVN r26470.
2012-05-22 15:15:39 +00:00
Nathan Hjelm
78b8b3cf76
bug fix: actually close ess components
...
This commit was SVN r26469.
2012-05-22 15:09:18 +00:00
Nathan Hjelm
6eeca66475
add an option to enable static ports. diabled by default
...
This commit was SVN r26462.
2012-05-21 19:56:15 +00:00
Ralph Castain
83d69b6c95
Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development.
...
In order to use the progress thread, you must configure with:
--enable-orte-progress-threads --enable-event-thread-support
This commit was SVN r26457.
2012-05-20 15:14:43 +00:00
Ralph Castain
c4f8043064
Per Nathan, with a little cleanup by me: update the PMI support to aggregate modex info, thus reducing the number of keys required so it fits within Cray default constraints
...
This commit was SVN r26456.
2012-05-19 16:12:52 +00:00
Jeff Squyres
cab31eafce
Revert r26413: it was causing too much confusion. When an MPI proc
...
exits with status 77, the whole job will be killed, but mpirun will
still return an exit status of 77, so MTT will report it as a skip
anyway.
This commit was SVN r26445.
The following SVN revision numbers were found above:
r26413 --> open-mpi/ompi@02aa36f2e5
2012-05-16 14:45:58 +00:00
Jeff Squyres
02aa36f2e5
ORTE defaults to killing the entire job when any process exits with a
...
nonzero status (we polled other MPI implementations since one one in
the OMPI community had a concrete opinion on what behavior to do here
-- all other MPI's seem to adhere to this behavior, too).
This commit adds an MCA parameter that allows us to tell ORTE to
''not'' kill jobs when a process exits with a status of 77, meaning
the GNU testing standard of "this test was skipped". In all the OMPI
tests, all procs will either return 77 or not. So if they all return
77, mpirun won't consider it an error, but will still return an exit
status of 77 (so that MTT can know that the test was cleanly skipped).
This commit was SVN r26413.
2012-05-08 21:49:05 +00:00
Ralph Castain
70a106fa71
Fix binding on remote nodes - need to pass the binding bitmap!
...
This commit was SVN r26403.
2012-05-08 03:52:39 +00:00
Jeff Squyres
2ba10c37fe
Per RFC, bring in the following changes:
...
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
Ralph Castain
44b8608f0a
Convert debug to verbose
...
This commit was SVN r26384.
2012-05-05 17:46:10 +00:00
Ralph Castain
96bfeb591c
Ensure flag is passed to remote daemons
...
This commit was SVN r26383.
2012-05-03 22:31:25 +00:00
Ralph Castain
45fee2b491
Resolve the case where only the HNP is in the system (i.e., single-node operation)
...
This commit was SVN r26382.
2012-05-03 18:00:01 +00:00
Ralph Castain
c352ca36c2
Minor cleanup
...
This commit was SVN r26381.
2012-05-02 21:23:37 +00:00
Ralph Castain
b2f77bf08f
Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
...
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.
This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
c5da4f24d7
Fix stupid singletons - get the pidmap message correct
...
This commit was SVN r26378.
2012-05-02 17:48:02 +00:00
Ralph Castain
9f724db182
Remove duplicate event assignment
...
This commit was SVN r26360.
2012-04-30 16:06:20 +00:00
Ralph Castain
289f9f41ec
From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work.
...
This commit was SVN r26359.
2012-04-29 00:10:01 +00:00
Ralph Castain
f3e3704c9e
Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param.
...
This commit was SVN r26352.
2012-04-27 14:39:34 +00:00
Jeff Squyres
46f47e08b6
Remove typo/extra brackets and parens.
...
This commit was SVN r26351.
2012-04-27 13:48:43 +00:00
Jeff Squyres
9d0df5a9a6
Update configury in the new oob ud component: actually check to see if
...
it succeeds and run $1 or $2, accordingly. This allows "make dist" to
run properly on machines that do not have OpenFabrics stuff installed
(e.g., the nightly tarball build machine).
There's still more to be done here -- it doesn't check for non-uniform
directories where the OpenFabrics headers/libraries might be
installed. We might need to re-tool/combine
ompi/config/ompi_check_openib.m4 (which checks for way more than
oob/ud needs) and move it up to config/ompi_check_ofa.m4, or
something...?
This commit was SVN r26350.
2012-04-27 11:32:56 +00:00
Jeff Squyres
9829d2279f
System-level includes should be at the top of the file, before most
...
OPAL/ORTE/OMPI includes.
This commit was SVN r26349.
2012-04-27 11:29:22 +00:00
Ralph Castain
38af7db183
Ensure the progress message comes out right away. Otherwise, on a large system where proc state messages are arriving frequently, the message doesn't get printed until the launch is done!
...
This commit was SVN r26346.
2012-04-26 23:41:03 +00:00
Nathan Hjelm
e1e0d466e5
Merge ssh://ct-fe1/usr/projects/hpctools/hjelmn/ompi-trunk-git into HEAD
...
This commit was SVN r26344.
2012-04-26 22:06:12 +00:00
Ralph Castain
3461809341
Fix reporting of launch progress so the numbers are correct and appear when they should
...
This commit was SVN r26342.
2012-04-26 00:10:09 +00:00
Ralph Castain
71805bf7e4
Clearout the startup_timeout event if the job did in fact start. Have ORTE_TERMINATE use the job state macro so debug will show where it was called
...
This commit was SVN r26334.
2012-04-25 01:05:17 +00:00
Jeff Squyres
708b497968
Ensure to unset the iof "active" flag after the libevent read callback
...
fires (it's already reset once we queue up the read event again). Failure
to unset the active flag would cause other logic to not queue up the
read event again, because it thought the read event was still active).
This commit was SVN r26311.
2012-04-23 15:58:12 +00:00
Ralph Castain
7999266f99
Silence warning by removing unused var
...
This commit was SVN r26275.
2012-04-17 22:34:48 +00:00
Ralph Castain
f68487016c
Add test code from Terry. Properly terminate if we don't abort on non-zero exit
...
This commit was SVN r26271.
2012-04-16 16:44:23 +00:00
Ralph Castain
ddfbde587f
Change the default to "abort" the job when any process exits with a non-zero status. Add the required code to ensure the orted tells the HNP about the problem.
...
This commit was SVN r26270.
2012-04-13 21:19:46 +00:00
Ralph Castain
7741ba47be
Fix comm_spawn that spans multiple nodes
...
This commit was SVN r26268.
2012-04-13 01:59:07 +00:00
Ralph Castain
4d16790836
Fix collectives for jobs running across partial allocations
...
This commit was SVN r26267.
2012-04-13 00:38:47 +00:00
Ralph Castain
5d14fa7546
Fix mpi_abort, minimize error output.
...
This commit was SVN r26266.
2012-04-11 14:37:08 +00:00
Ralph Castain
d3dfba3872
Fix the scenario where an MPI error handler causes a proc to exit after finalize, but with non-zero status to indicate an error occurred.
...
This commit was SVN r26265.
2012-04-11 02:23:46 +00:00
Ralph Castain
9cd4c06488
Get things to build and run when --disable-orte is specified
...
This commit was SVN r26263.
2012-04-10 21:50:01 +00:00
Ralph Castain
14d5525fb1
Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified.
...
This commit was SVN r26261.
2012-04-10 19:08:54 +00:00
Ralph Castain
53bbcf4b5b
Plug slot allocation leak
...
This commit was SVN r26260.
2012-04-10 14:56:24 +00:00
Ralph Castain
f5cd996b91
Fix the case where n=1
...
This commit was SVN r26258.
2012-04-09 22:44:56 +00:00
Ralph Castain
a34be856aa
Now that we have PMI support, this is no longer needed
...
This commit was SVN r26254.
2012-04-07 13:36:24 +00:00
Ralph Castain
71f9e69c62
Remove stale code
...
This commit was SVN r26253.
2012-04-07 13:34:12 +00:00
Ralph Castain
19630ca28d
Remove stale code
...
This commit was SVN r26252.
2012-04-07 13:33:40 +00:00
Ralph Castain
93bbeabc55
Remove stale code
...
This commit was SVN r26251.
2012-04-07 13:33:30 +00:00
Ralph Castain
b6cde9a8d1
Remove stale code
...
This commit was SVN r26250.
2012-04-07 13:33:18 +00:00
George Bosilca
319f76d66a
Low hanging fruit. Remove a declared but not defined function.
...
This commit was SVN r26245.
2012-04-06 15:43:28 +00:00
Ralph Castain
ed197acaa2
Eliminate stale code
...
This commit was SVN r26244.
2012-04-06 15:31:13 +00:00
Ralph Castain
bd8b4f7f1e
Sorry for mid-day commit, but I had promised on the call to do this upon my return.
...
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
ca3ff58c76
Ensure we get a non-zero exit status when we can't find the specified fork agent. Output a better error message, and ensure we don't multiply report the problem.
...
This commit was SVN r26191.
2012-03-24 00:49:38 +00:00
Ralph Castain
46b040c79f
Fix typo
...
This commit was SVN r26189.
2012-03-24 00:31:05 +00:00
Ralph Castain
2bd75ec7e3
Fix Cray XE builds - the priority here needs to equal that of the HNP component so that both build. Otherwise, mpirun tries to use PMI for its basis, and that doesn't work!
...
This commit was SVN r26188.
2012-03-23 20:06:34 +00:00
Ralph Castain
811413e9bc
Correctly handle multiple cpu-set ranges. Correctly support optional binding directives combined with cpu-set.
...
This commit was SVN r26187.
2012-03-23 14:50:41 +00:00
Ralph Castain
ce0caf7567
Support -cpu-set by binding to the specified cpus in the absence of any other binding directive. Allows users to subdivide nodes for multiple parallel mpirun invocations.
...
This commit was SVN r26186.
2012-03-23 14:05:52 +00:00
Ralph Castain
33ed3cda07
Update the gridengine allocator to support data from multiple queues by checking for duplicate node entries
...
This commit was SVN r26148.
2012-03-15 17:45:50 +00:00
Josh Hursey
4dd9f89a99
Create an MCA parameter (ess_base_stream_buffering) that allows the user to override the system default for buffering of stdout/stderr streams. See 'man setvbuf' for more information.
...
Note: I am working on a system that buffered all output until the application fishished due to a default of 'fully buffered.' This makes debugging painful. This switch fixed the problem by allowing me to adjust the buffering.
This commit was SVN r26119.
2012-03-08 22:02:28 +00:00
Ralph Castain
e71e871bae
Initialize sink location when stdin is forwarded to all ranks
...
This commit was SVN r26107.
2012-03-06 15:47:04 +00:00
Ralph Castain
366f9d1518
Add some missing localities to the hwloc pretty-print, fix pmi modex
...
This commit was SVN r26105.
2012-03-06 06:21:10 +00:00
Ralph Castain
834a86420b
Ensure we use the slurm module for slurm environments, and correct init order in pmi module when used by daemons
...
This commit was SVN r26089.
2012-03-02 23:10:48 +00:00
Ralph Castain
ceb34ed0c9
Fix typo
...
This commit was SVN r26079.
2012-03-02 09:58:09 +00:00
Ralph Castain
b2f1bade37
Fix the -H localhost issue
...
This commit was SVN r26071.
2012-02-29 16:56:00 +00:00
Jeff Squyres
81dc6a11ee
Fix typo in copyright notice, found by Paul Hargrove
...
This commit was SVN r26070.
2012-02-29 02:02:54 +00:00
Ralph Castain
a83da303c5
When using PMI, we know the ranks that share our node and their relative local/node ranks. Save that info in the pidmap array so that BTLs that require early knowledge of local ranks can access it.
...
This commit was SVN r25992.
2012-02-21 16:43:17 +00:00
Jeff Squyres
b6a90434e4
Fix some include file header ordering issues for some BSDs, suggested
...
by Paul Hargrove.
This commit was SVN r25984.
2012-02-21 13:32:14 +00:00
Jeff Squyres
b295a01d8e
Fix another configury error found by Paul Hargrove. Thanks, Paul!
...
This commit was SVN r25971.
2012-02-20 21:38:27 +00:00
Jeff Squyres
cdc783925e
(Re-)Add oob_tcp_if_(in|ex)clude functionality to allow CIDR notation,
...
just like the btl_tcp_if_(in|ex)clude MCA param.
This commit was SVN r25953.
2012-02-17 15:38:42 +00:00
Jeff Squyres
3e22450345
Fix the oob_tcp_verbose MCA param; make it actually apply to the OOB
...
TCP verbose handle (not the generic/0 handle).
This commit was SVN r25942.
2012-02-16 22:28:11 +00:00
Ralph Castain
b3aabf1565
Cleanup the --without-hwloc build. Thanks to Paul Hargrove for reporting it broken.
...
This commit was SVN r25931.
2012-02-15 11:08:57 +00:00
Ralph Castain
91977444af
Silence warnings
...
This commit was SVN r25929.
2012-02-15 03:42:27 +00:00
Ralph Castain
bba6508b4b
Handle the default hostfile case a little better...
...
This commit was SVN r25928.
2012-02-15 03:33:49 +00:00
Ralph Castain
f14c4be580
Correct the ordering logic so the list gets correctly built in daemon vpid order
...
This commit was SVN r25818.
2012-01-30 16:25:07 +00:00
Shiqing Fan
bfbd3c67a5
Add a windows file into the tarball.
...
This commit was SVN r25811.
2012-01-29 10:12:02 +00:00
Ralph Castain
a0edae52f2
Ensure the wrapper flags get entered in the right order, with -lpmi coming before the alps util libs
...
This commit was SVN r25809.
2012-01-27 20:56:21 +00:00
Ralph Castain
3f31feee6f
Handle the case where a user's rankfile specifies only cpus, and not socket:cpu pairs.
...
This commit was SVN r25803.
2012-01-27 12:21:45 +00:00
Ralph Castain
07f3a91075
Okay, get srun to play nice. Problem was that everything worked fine so long as the user did "salloc" with an argument requesting a specific number of nodes. However, if the user specified instead a number of processes, then we launched that number of daemons - resulting in multiple daemons/node. Not good.
...
So force things to behave correctly either way.
This commit was SVN r25792.
2012-01-26 19:58:57 +00:00
Ralph Castain
ef94e606c7
Add some debug
...
This commit was SVN r25791.
2012-01-26 19:23:32 +00:00
Ralph Castain
1449b27e9f
Ensure that slurm only launches one orted/node, regardless of how the allocation was obtained.
...
This commit was SVN r25790.
2012-01-26 19:23:15 +00:00
Jeff Squyres
64165ce758
r25775 removed the .windows from this directory, but left it in the
...
Makefile.am.
This commit was SVN r25782.
The following SVN revision numbers were found above:
r25775 --> open-mpi/ompi@2c9a4beffd
2012-01-26 10:45:06 +00:00
Jeff Squyres
3751495443
Add missing arguments for the new DYLD_LIBRARY_PATH stuff.
...
This commit was SVN r25780.
2012-01-26 00:35:48 +00:00
Ralph Castain
079e4d9156
Per George's comment, just duplicate the lib path envars to provide both Linux and Mac compatible values
...
This commit was SVN r25776.
2012-01-25 14:37:36 +00:00
Shiqing Fan
2c9a4beffd
Add and remove a few components for windows build.
...
This commit was SVN r25775.
2012-01-25 09:01:27 +00:00
Ralph Castain
8b115754e6
Fix typo
...
This commit was SVN r25763.
2012-01-21 23:50:39 +00:00
Ralph Castain
469e40ace2
Expand the coverage a little when looking at remote shells for rsh. Prior patch (r25758) works only if both ends of the rsh/ssh connection are Mac. What we really want is to use the Mac version of ld_library_path when the remote end is Mac, regardless of the OS where mpirun is executing. So add a test for system type to the remote_shell test, and set the ld_library_path name to match the remote system type.
...
This commit was SVN r25762.
The following SVN revision numbers were found above:
r25758 --> open-mpi/ompi@1afb77e603
2012-01-21 23:48:42 +00:00
Ralph Castain
1afb77e603
Mac requires setting DYLD_LIBRARY_PATH instead of the Linux standard LD_LIBRARY_PATH, so ensure we set that when using rsh to launch in Mac environments.
...
Thanks to Teng Lin for the patch!
This commit was SVN r25758.
2012-01-20 19:14:32 +00:00
Ralph Castain
be3dfb6a1a
Ensure that we only add -lpmi once to the wrapper compilers, no matter how many components might use it.
...
This commit was SVN r25753.
2012-01-20 04:56:38 +00:00
Ralph Castain
d7fe1615b6
Add missing dollar sign on variable
...
This commit was SVN r25745.
2012-01-19 20:45:22 +00:00
Ralph Castain
0d20f745e2
Remove stale function def
...
This commit was SVN r25744.
2012-01-19 20:40:48 +00:00
Nathan Hjelm
6d0e7a0a0e
don't enable ess/alps unless cnos is available
...
This commit was SVN r25743.
2012-01-19 19:36:00 +00:00
Ralph Castain
9d556e2f17
Allow daemons to use PMI to get their name where PMI support is available while using the standard grpcomm and other capabilities. Remove the GNI code from the alps ess component as that component should only be for alps/cnos installations.
...
This commit was SVN r25737.
2012-01-18 20:56:53 +00:00
Ralph Castain
6235a355de
Correctly handle co-spawning of daemons when attaching to a running job. We cannot use the general process mappers as we only want debugger daemons spawned on nodes where application procs already exist. So custom build the map for the debugger daemon job, and have the plm just launch that job without doing its usual vm-spawn step.
...
This commit was SVN r25736.
2012-01-18 00:19:49 +00:00
Nathan Hjelm
a2437feba7
removed debug message
...
This commit was SVN r25722.
2012-01-12 20:23:59 +00:00
Nathan Hjelm
5ab1674138
fixed de bruijn copyrights
...
This commit was SVN r25720.
2012-01-12 17:18:08 +00:00
Nathan Hjelm
c57f18999d
added Debruijn routed component
...
This commit was SVN r25717.
2012-01-12 17:11:03 +00:00
Ralph Castain
477582abef
Grrrr....fix ALL the cases where the membind warning occurs.
...
This commit was SVN r25715.
2012-01-11 23:51:18 +00:00
Ralph Castain
bf103de66c
My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress.
...
Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts.
This commit was SVN r25713.
2012-01-11 15:53:09 +00:00
Ralph Castain
167ad944c4
Surprise, surprise - hwloc treats memory binding as at the thread, not process, level. Thus, hwloc always sets the membind proc-level support flag to false, and indicates actual memory binding support via the thread-level flag. So...just to be safe, test -both- flags and issue the "no support" warning ONLY if both are false.
...
This commit was SVN r25709.
2012-01-11 01:12:57 +00:00
Shiqing Fan
e3dfc49ced
make correct use of the newly updated structures in the Windows module.
...
This commit was SVN r25699.
2012-01-09 11:08:34 +00:00
Ralph Castain
840841bb8f
Missed a couple
...
This commit was SVN r25686.
2011-12-29 23:30:19 +00:00
Ralph Castain
af7fb68cfb
If we forward envars in rsh, then we have to be very careful about both duplicate entries and disallowed characters on the cmd line. To aid with detecting duplicates, make all cmd line options be given in their mca variant. Check anything we might add for semi-colons and protect those values with quotes.
...
This commit was SVN r25685.
2011-12-29 23:25:25 +00:00
Jeff Squyres
a4c8bb27fa
Pull in the MPIR_Breakpoint symbol via a dummy function in
...
debuggers_base_fns.c: orte_debugger_base_pull_mpir_breakpoint().
This commit was SVN r25660.
2011-12-15 18:39:34 +00:00
Ralph Castain
2dd2694f25
Fix comm_spawn in oversubscribed conditions. IF oversubscription is allowed, let nodes flow into the mapper even if they are oversubscribed, constrained by the slots_max absolute ceiling. Cleanup error messages when comm_spawn fails so it correctly and succintly reports the ereror.
...
This commit was SVN r25659.
2011-12-15 18:04:48 +00:00
Ralph Castain
1adefcc176
When routing is not enabled, all routes must go direct
...
This commit was SVN r25656.
2011-12-15 15:32:09 +00:00
Ralph Castain
e683b2f9c7
Minor touchup - reset the pointer to the end of the list each time to ensure we get the nodes in correct daemon order
...
This commit was SVN r25651.
2011-12-14 22:16:52 +00:00
Ralph Castain
912abe8a6c
Catch one more use-case
...
This commit was SVN r25649.
2011-12-14 21:03:19 +00:00
Ralph Castain
f531b09a8d
Correctly handle -host and -hostfile options. Ensure the initial vm launch constrains itself to the union of specified hosts if those options are given. Get oversubscribe set correctly for that case.
...
This commit was SVN r25648.
2011-12-14 20:01:15 +00:00
George Bosilca
ac26f58bd7
I guess this wasn't yet ready for prime time.
...
This commit was SVN r25624.
2011-12-12 23:55:11 +00:00
Nathan Hjelm
885d5cbcf8
enable ptmalloc with using uGNI
...
This commit was SVN r25621.
2011-12-12 20:52:51 +00:00
Nathan Hjelm
be11acf727
bug fix. don't add node to allocated_nodes twice
...
This commit was SVN r25619.
2011-12-12 19:14:41 +00:00
Ralph Castain
3f1ae5d89b
No longer need this include
...
This commit was SVN r25606.
2011-12-09 00:40:07 +00:00
Ralph Castain
44094cd5b3
Remove compiler warning
...
This commit was SVN r25601.
2011-12-08 16:35:41 +00:00
Samuel Gutierrez
0a922dcb3e
fixes XE6 build.
...
This commit was SVN r25600.
2011-12-08 16:13:58 +00:00
Samuel Gutierrez
0588e9ba36
add Cray XK6 support to ras alps. the configuration file is a different format and is in a different place.
...
This commit was SVN r25599.
2011-12-08 14:05:02 +00:00
Ralph Castain
7180ad40ad
Fix a copule of minor buglets
...
This commit was SVN r25589.
2011-12-07 21:08:35 +00:00
Ralph Castain
3e7ab1212a
Since this has come up a number of times, have the rsh launcher add MCA params from the environment by default. If it finds that the cmd line is too long, error out with a message directing the user to set a param to ignore the environmental MCA params.
...
This commit was SVN r25581.
2011-12-07 01:24:36 +00:00
Ralph Castain
7510339725
Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line.
...
This commit was SVN r25580.
2011-12-06 21:31:22 +00:00
Ralph Castain
15facc4ba6
Fix comm_spawn yet again...add another test
...
This commit was SVN r25579.
2011-12-06 20:15:40 +00:00
Ralph Castain
90b7f2a7bf
The rest of the multi app_context fix. Remove the restriction on number of app_contexts that can have zero np specified as multiple mappers now support that use-case. Update the ranking algorithms to respect and track bookmarks. Ensure we properly set the oversubscribed flag on a per-node basis.
...
This commit was SVN r25578.
2011-12-06 17:28:29 +00:00
Ralph Castain
d9c7764e9b
Remove some debug
...
This commit was SVN r25575.
2011-12-05 22:04:50 +00:00
Ralph Castain
df2f594aa8
Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription.
...
Still need to resolve the ranking issue so it starts at the bookmark, but that will come next.
This commit was SVN r25574.
2011-12-05 22:01:08 +00:00
Abhishek Kulkarni
0b7c51fae2
Correct an invalid reference to a missing help file.
...
This commit was SVN r25573.
2011-12-05 21:29:07 +00:00
Josh Hursey
b5ac320826
* If not able to checkpoint at this time (say because we are already checkpointing or restarting) then make sure to re-set the listener so that we can checkpoint later.
...
* Work around duplicate node names in the map. It should not happen normally, but if the rmaps component gets this wrong provide a work around. Ralph is working on a rmaps fix for this, so we will likely remove/comment out the fix later.
This commit was SVN r25572.
2011-12-05 19:29:26 +00:00
Josh Hursey
cc57840b53
Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components.
...
This commit was SVN r25571.
2011-12-05 15:40:46 +00:00
Ralph Castain
6cbd8fa6c9
Keep everyone in sync with new job state
...
This commit was SVN r25563.
2011-12-02 14:12:40 +00:00
Ralph Castain
07655e2945
Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location.
...
So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages.
This commit was SVN r25562.
2011-12-02 14:10:08 +00:00
Jeff Squyres
ecf6ba910c
Silence a few icc warnings and about mixing enums with other types.
...
This commit was SVN r25560.
2011-12-02 13:18:54 +00:00
Ralph Castain
641e17f26c
A better way of handling fqdn allocations. Prior method was wrong as it equated "node1" with "node10", which definitely caused problems.
...
Detect the addition of fqdn nodes in the allocation. If not found, then strip all incoming hostnames from daemons of any domain info when matching those names against the names in the node pool.
Leave some protection and "live" diagnostic output in place so we can continue to detect problems across all environments.
This commit was SVN r25557.
2011-12-01 14:24:43 +00:00
Ralph Castain
512aea79bc
Print the right nodename value, fix the strange case
...
This commit was SVN r25556.
2011-12-01 02:31:56 +00:00
Ralph Castain
44394c6b34
Add a little more protection
...
This commit was SVN r25555.
2011-12-01 00:30:56 +00:00
Ralph Castain
c4ea7a252a
Add a little protection against badly formed node names so we don't segfault if they are encountered
...
This commit was SVN r25554.
2011-11-30 23:33:59 +00:00
Ralph Castain
fa9e99454a
Don't divide by cpus-per-task - we'll deal with that at binding time.
...
This commit was SVN r25552.
2011-11-30 21:35:25 +00:00
Ralph Castain
c56acf60ca
Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order.
...
Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use.
Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact.
Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code.
This commit was SVN r25551.
2011-11-30 19:58:24 +00:00
George Bosilca
7a238933b6
Silence a compiler warning.
...
This commit was SVN r25543.
2011-11-29 20:53:08 +00:00