Ralph Castain
f7df960198
Silence warning
...
This commit was SVN r31139.
2014-03-18 23:15:29 +00:00
Jeff Squyres
3da579139b
More corrections w.r.t. process groups
...
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
2014-03-18 21:31:01 +00:00
Ralph Castain
518ba55cf4
Ensure MPIEXEC_TIMEOUT calls the correct state to exit
...
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r31125.
2014-03-18 20:12:02 +00:00
Ralph Castain
554da83865
Set the locality for remote procs even after a comm_spawn. Ensure we store our own local cpuset upon launch so it will be shared during comm_join.
...
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
2014-03-18 14:51:07 +00:00
Ralph Castain
0aa23cdc35
Cleanup copy/paste errors to ensure we progress the launch
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
545ac7dc58
Remove the job_control_forwarding logic as we want *any* signal to go to all members of the process group
...
Refs trac:4404
This commit was SVN r31094.
The following Trac tickets were found above:
Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404
2014-03-17 22:45:33 +00:00
Ralph Castain
5a868028a8
Revert r31091 - the functionality didn't disappear, but moved into the MPI layer :-(
...
This commit was SVN r31093.
The following SVN revision numbers were found above:
r31091 --> open-mpi/ompi@edf680855e
2014-03-17 22:30:03 +00:00
Ralph Castain
99c9ecaed0
Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31092.
2014-03-17 22:12:15 +00:00
Ralph Castain
edf680855e
Restore locality computation to the nidmap code - don't know how/when it was removed, but that was not good
...
cmr=v1.7.5:reviewer=hjelmn
This commit was SVN r31091.
2014-03-17 21:59:25 +00:00
Ralph Castain
45196d222b
Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
...
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
796dfe5ada
Do a little cleanup - only resusage needs the node/proc info, so remove it from the sensor base
...
This commit was SVN r31089.
2014-03-17 21:26:46 +00:00
Ralph Castain
7bb8dbade6
Extend the regular expression parsing support
...
This commit was SVN r31088.
2014-03-17 21:25:05 +00:00
Ralph Castain
0257d32eeb
There is no OOB component object - it is a simple struct with an opal_list_item_t element at the beginning
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31087.
2014-03-17 21:23:59 +00:00
Ralph Castain
38e02890aa
ORTE doesn't care about cxx flags
...
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31086.
2014-03-17 21:21:54 +00:00
Ralph Castain
f259d50ed7
Fully fix the PMI2 warning - turned out to be larger than originally thought due to the way the function was being handled across multiple files. Properly resolve the problem by not compiling the file if PMI2 is not desired, and then appropriately setting the visibility of the function within the module
...
Refs trac:4400
This commit was SVN r31084.
The following Trac tickets were found above:
Ticket 4400 --> https://svn.open-mpi.org/trac/ompi/ticket/4400
2014-03-17 17:36:37 +00:00
Ralph Castain
e152449be4
Silence warning
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r31083.
2014-03-17 17:05:24 +00:00
Ralph Castain
b248b27637
Remove a check that prevented mpirun from exiting when it should in the single-node case
...
Refs trac:4393
This commit was SVN r31080.
The following Trac tickets were found above:
Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773
Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Ralph Castain
2abed09d7c
Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating.
...
Jeff: please test a variety of conditions to ensure we get this right
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31058.
2014-03-13 04:02:24 +00:00
Ralph Castain
ac421c931d
The random number generator changes were incomplete (typo errors) in some places, and is missing the required declspec's for visibility.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31053.
2014-03-12 22:37:27 +00:00
Ralph Castain
f56f37d364
Shifting to an event-driven RTE raises some interesting issues during shutdown. We want the last messages to get thru, but also need to correctly shutdown the virtual machine. This requires a delicate balancing act across event priorities, and the need to check for termination conditions in places where related events get processed.
...
Change the priority of comm_failure and job_termination events to ensure we process final messages prior to terminating. Check for termination conditions when processing proc termination events as we may order proc termination when the daemon gets an exit command, but we can't see the proc actually terminate until we get out of that message event.
Jeff: probably easiest to review this by testing. I tested it under both Slurm and rsh on v1.7.5 as well as trunk
cmr=v1.7.5:reviewer=jsquyres:subject=resolve event priorities during VM shutdown
This commit was SVN r31042.
2014-03-12 16:49:58 +00:00
Ralph Castain
a254d2db34
Silence warning when CR is not enabled
...
This commit was SVN r31025.
2014-03-12 13:47:03 +00:00
Adrian Reber
4512b3375e
OOB/TCP: wire up the existing ft_event() function
...
This commit was SVN r31022.
2014-03-12 12:47:20 +00:00
Adrian Reber
34625b360b
use the newly created JOB_STATE_FT_* events
...
This commit was SVN r31021.
2014-03-12 12:37:14 +00:00
Adrian Reber
8d40cd53ae
use the existing pretty-print function for information about the job state
...
This commit was SVN r31020.
2014-03-12 12:34:25 +00:00
Ralph Castain
7869402f5f
Sigh - looks like I did too good a job of turning things off. Back some of it out in favor of trying again when more time is available
...
Refs trac:4368
This commit was SVN r31017.
The following Trac tickets were found above:
Ticket 4368 --> https://svn.open-mpi.org/trac/ompi/ticket/4368
2014-03-12 02:10:35 +00:00
Ralph Castain
dc28015bcb
Something funny is going on when --without-orte, so revert the orte/Makefile.am for now while we try to figure it out
...
Refs trac:4368
This commit was SVN r31011.
The following Trac tickets were found above:
Ticket 4368 --> https://svn.open-mpi.org/trac/ompi/ticket/4368
2014-03-11 23:07:21 +00:00
Ralph Castain
9c66c4f439
Correctly implement --disable-oshmem and --without-orte so we don't build the disabled section of code. Fix a bunch of code rot in the PMI rte component, and add several missing headers when building --without-orte.
...
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31006.
2014-03-11 22:02:40 +00:00
Adrian Reber
49173ccd61
add debug output for the ft_event handler
...
This commit was SVN r30990.
2014-03-11 15:39:16 +00:00
Adrian Reber
7304b700e1
Fix the newly added FT event state when compiling --with-ft
...
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
8e080fb95e
Need a slightly different header
...
This commit was SVN r30986.
2014-03-11 03:03:12 +00:00
Ralph Castain
2cd1cfc7fe
Remove this ignore for now
...
This commit was SVN r30985.
2014-03-11 03:02:13 +00:00
Ralph Castain
103a5c6df1
Output the bindings if ess verbosity is high enough
...
Refs trac:4356
This commit was SVN r30982.
The following Trac tickets were found above:
Ticket 4356 --> https://svn.open-mpi.org/trac/ompi/ticket/4356
2014-03-11 01:21:14 +00:00
Ralph Castain
176b326c27
Add a comment to make Jeff happier...
...
Refs trac:4340
This commit was SVN r30980.
The following Trac tickets were found above:
Ticket 4340 --> https://svn.open-mpi.org/trac/ompi/ticket/4340
2014-03-10 23:02:04 +00:00
Ralph Castain
081669b440
When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings
This commit was SVN r30968.
2014-03-10 15:53:07 +00:00
Adrian Reber
b51733c456
fix "warning: 'sstore_stage_select' defined but not used"
...
In the function sstore_stage_select() the local variables
were set up and defined. Unfortunately this function was
never called. This patch moves variable set up to the
sstore_stage_register() function and checks the return
values of the variable initialization.
This commit was SVN r30958.
2014-03-06 16:53:27 +00:00
Ralph Castain
7a44af375c
Add an FT event state and set the state machine to callback to the OOB base ft event when activated
...
This commit was SVN r30950.
2014-03-06 02:44:29 +00:00
Ralph Castain
9793909988
Correct the constant we check for an error. Thanks to George for noticing it.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30949.
2014-03-06 02:21:27 +00:00
Ralph Castain
fc2dd6ac48
Per Jeff's request, add a more detailed comment as to why we are turning off the warning at this time.
...
Refs trac:4339
This commit was SVN r30948.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-06 02:17:25 +00:00
Ralph Castain
c9465d97b4
Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
...
Thanks to Paul Kapinos for reporting the problem.
cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition
This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
a2b539c763
Per the telecon, silence the warning for 1.7.5 to give us time to consider a better permanent solution
...
Refs trac:4339
This commit was SVN r30941.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-05 03:02:29 +00:00
Ralph Castain
50c30d62ca
Repair builds without hwloc
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30940.
2014-03-05 02:48:15 +00:00
Adrian Reber
e5bef82ee1
OPAL_ENABLE_FT_CR: remove compiler warnings
...
When compiling --with-ft there are a few compiler warnings about
unused variables. This patch fixes those compiler warnings.
This commit was SVN r30927.
2014-03-04 15:28:07 +00:00
Ralph Castain
da4cb39683
If we can't find a route to communicate, emit an error message rather than just exiting with a non-zero status
...
cmr=v1.7.5:reviewer=jsquyres:subject=print error if cannot communicate
This commit was SVN r30922.
2014-03-04 04:57:53 +00:00
Ralph Castain
0ac97761cc
Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
...
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.
In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.
Also cleanup some a couple of issues in the mapping/binding system:
* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed
* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch
* minor cleanup to the warning message when oversubscribed and binding was requested
cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system
This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
88b0e0cc6d
Allow the user to turn off the oversubscribed-binding warning if overload-allowed has been provided
...
Refs trac:4317
This commit was SVN r30892.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 17:55:53 +00:00
Ralph Castain
4a645f0342
Add detection of oversubscription with binding requested - if binding requested to core or hwt, warn and do not bind or else we will hurt performance. Also, if no binding directive was given, turn off the default binding
...
Refs trac:4317
This commit was SVN r30888.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 16:08:52 +00:00
Ralph Castain
8500247c7b
Fix the by-obj mapper in the case where slots are not specified, and so we are in a perpetual oversubscribed state
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30887.
2014-02-28 05:21:46 +00:00
Ralph Castain
a4c3d0a5a0
Add some more debug to the by-obj mapper
...
This commit was SVN r30884.
2014-02-28 02:52:53 +00:00
Ralph Castain
d109c523b9
Per patch from Tetsuya Mishima, complete the overhaul of the round-robin mappers
...
Refs trac:4296
This commit was SVN r30861.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-27 00:43:53 +00:00