Jeff Squyres
ea4c916096
plm_slurm_module.c: don't leave the extra fd to /dev/null open
...
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed). It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.
This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31469.
The following SVN revision numbers were found above:
r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Ralph Castain
8594f5d738
Correctly set a non-zero exit status when mpirun is terminated by signal
...
Fixes trac:4537
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31423.
The following Trac tickets were found above:
Ticket 4537 --> https://svn.open-mpi.org/trac/ompi/ticket/4537
2014-04-18 16:39:08 +00:00
Ralph Castain
8d72633acf
Ensure that the session directory fields of orte_process_info have been initialized prior to cleaning up those directories as part of the initialization process that deals with stale session directory trees.
...
Fixes trac:4534
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31421.
The following Trac tickets were found above:
Ticket 4534 --> https://svn.open-mpi.org/trac/ompi/ticket/4534
2014-04-18 14:25:48 +00:00
Ralph Castain
a368e84e70
Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
...
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Ralph Castain
bbdbc5f8a8
Per suggestion from George, use a pipe for terminating the thread.
...
Refs trac:4510
This commit was SVN r31381.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-14 01:02:46 +00:00
Ralph Castain
deff85ffc3
Prevent a segfault if we encounter an error while parsing a hostfile. Don't issue and error_log output as the hostfile code already prints an error message
...
Thanks to Tetsuya Mishima for the patch. Reviewed ok by rhc.
RM-approved
cmr=v1.8.1:reviewer=ompi-gk1.8
This commit was SVN r31377.
2014-04-12 21:32:10 +00:00
Ralph Castain
2d8dff837c
Ensure we properly terminate the listening thread prior to exiting, but do so in a way that doesn't make us wait for select to timeout.
...
Refs trac:4510
This commit was SVN r31376.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-12 15:01:24 +00:00
Ralph Castain
9b30b2b783
Shave some time off of mpirun's operation by not waiting for the listener thread to terminate before exiting
...
cmr=v1.8.1:reviewer=rhc
This commit was SVN r31368.
2014-04-11 04:16:28 +00:00
Nathan Hjelm
9df795d1dd
plm/alps: silence annoying warning message when using Cray PMI 3.x or
...
newer
This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540
plm/slurm: tweak /dev/null usage ( #4489 )
...
See the ticket for more details.
cmr=v1.8.1:reviewer=rhc:ticket=4489
This commit was SVN r31351.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53
Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
...
Refs trac:4489
This commit was SVN r31348.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
92ca647d3d
Fix copy error in file name
...
This commit was SVN r31344.
2014-04-08 15:31:55 +00:00
Ralph Castain
61d94fcee2
Fix the sequential mapper - it was out-of-sync with the hostfile changes, and we missed the "seq" policy when parsing the --map-by option. Thanks to Bill Chen for reporting it
...
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31333.
2014-04-08 03:38:25 +00:00
Nathan Hjelm
155130fbfc
ras/alps: fix alps support for CLE 5.x
...
Cray moved the apstat command on CLE 5.x to /opt/cray/alps/../bin and
moved a configuration file. This commit adds support for both of these
changes.
cmr=v1.8.1
This commit was SVN r31329.
2014-04-07 22:51:21 +00:00
Jeff Squyres
82e104719a
hwloc/rmaps base: Add missing help message.
...
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).
This commit was SVN r31321.
2014-04-07 15:39:54 +00:00
Ralph Castain
8ce98ccc8d
Not sure when this got messed up, but correct the stdout/stderr redirection on the srun command so we don't get all those slurm warnings
...
cmr=v1.8.1:reviewer=dgoodell:subject=silence srun warning output
This commit was SVN r31308.
2014-04-04 04:23:31 +00:00
Ralph Castain
3fdcaeab97
Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
...
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.
cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup
This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Ralph Castain
9d2f5f6b1f
Silence warning
...
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31294.
2014-03-29 19:10:26 +00:00
Ralph Castain
70ee3fb000
Ensure that orted's are not bound to single processors if the TaskAffinity option is set by default. Thanks to Artem Polyakov for the patch, and for his patience in explaining the situation.
...
Reviewed with Moe Jette to ensure this was correct, and confirmed by me.
RM-approved
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31288.
2014-03-29 18:30:38 +00:00
Jeff Squyres
173c046617
build: add Automake-like silent/verbose macros for "ln -s ..." operations
...
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
2014-03-28 18:24:32 +00:00
Ralph Castain
714cb8f573
Silence warnings
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31248.
2014-03-27 14:16:54 +00:00
Ralph Castain
390645ac2a
Per patch from Tetsuya Mishima, do a nicer job of warning the user that we need to map to a higher level to get the number of requested cpus/rank. Also, change the mapping policy to "byslot" when falling back to that option.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31196.
2014-03-24 15:47:29 +00:00
Ralph Castain
bd9bd2ff16
Be consistent in our handling of the "only HNP in allocation" case when setting up the VM. Thanks to Tetsuya Mishima for the suggestion.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31195.
2014-03-24 15:28:09 +00:00
Dave Goodell
5f3b81e291
oob: delete events when destroying a peer
...
Without this patch running ring_c with the usnic BTL under valgrind will
cause the orteds to segfault.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Ralph Castain <rhc@open-mpi.org>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31161.
2014-03-19 22:15:49 +00:00
Ralph Castain
d17f811ff5
Surrender to the tyranny of C++ and give up on enum for node states, as nice as that would be, in favor of retaining memory footprint constraints.
...
This commit was SVN r31149.
2014-03-19 16:15:24 +00:00
Jeff Squyres
3da579139b
More corrections w.r.t. process groups
...
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
2014-03-18 21:31:01 +00:00
Ralph Castain
554da83865
Set the locality for remote procs even after a comm_spawn. Ensure we store our own local cpuset upon launch so it will be shared during comm_join.
...
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
2014-03-18 14:51:07 +00:00
Ralph Castain
0aa23cdc35
Cleanup copy/paste errors to ensure we progress the launch
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
545ac7dc58
Remove the job_control_forwarding logic as we want *any* signal to go to all members of the process group
...
Refs trac:4404
This commit was SVN r31094.
The following Trac tickets were found above:
Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404
2014-03-17 22:45:33 +00:00
Ralph Castain
99c9ecaed0
Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31092.
2014-03-17 22:12:15 +00:00
Ralph Castain
45196d222b
Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
...
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
796dfe5ada
Do a little cleanup - only resusage needs the node/proc info, so remove it from the sensor base
...
This commit was SVN r31089.
2014-03-17 21:26:46 +00:00
Ralph Castain
0257d32eeb
There is no OOB component object - it is a simple struct with an opal_list_item_t element at the beginning
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31087.
2014-03-17 21:23:59 +00:00
Ralph Castain
f259d50ed7
Fully fix the PMI2 warning - turned out to be larger than originally thought due to the way the function was being handled across multiple files. Properly resolve the problem by not compiling the file if PMI2 is not desired, and then appropriately setting the visibility of the function within the module
...
Refs trac:4400
This commit was SVN r31084.
The following Trac tickets were found above:
Ticket 4400 --> https://svn.open-mpi.org/trac/ompi/ticket/4400
2014-03-17 17:36:37 +00:00
Ralph Castain
e152449be4
Silence warning
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r31083.
2014-03-17 17:05:24 +00:00
Ralph Castain
b248b27637
Remove a check that prevented mpirun from exiting when it should in the single-node case
...
Refs trac:4393
This commit was SVN r31080.
The following Trac tickets were found above:
Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773
Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Ralph Castain
2abed09d7c
Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating.
...
Jeff: please test a variety of conditions to ensure we get this right
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31058.
2014-03-13 04:02:24 +00:00
Ralph Castain
ac421c931d
The random number generator changes were incomplete (typo errors) in some places, and is missing the required declspec's for visibility.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31053.
2014-03-12 22:37:27 +00:00
Ralph Castain
f56f37d364
Shifting to an event-driven RTE raises some interesting issues during shutdown. We want the last messages to get thru, but also need to correctly shutdown the virtual machine. This requires a delicate balancing act across event priorities, and the need to check for termination conditions in places where related events get processed.
...
Change the priority of comm_failure and job_termination events to ensure we process final messages prior to terminating. Check for termination conditions when processing proc termination events as we may order proc termination when the daemon gets an exit command, but we can't see the proc actually terminate until we get out of that message event.
Jeff: probably easiest to review this by testing. I tested it under both Slurm and rsh on v1.7.5 as well as trunk
cmr=v1.7.5:reviewer=jsquyres:subject=resolve event priorities during VM shutdown
This commit was SVN r31042.
2014-03-12 16:49:58 +00:00
Ralph Castain
a254d2db34
Silence warning when CR is not enabled
...
This commit was SVN r31025.
2014-03-12 13:47:03 +00:00
Adrian Reber
4512b3375e
OOB/TCP: wire up the existing ft_event() function
...
This commit was SVN r31022.
2014-03-12 12:47:20 +00:00
Adrian Reber
34625b360b
use the newly created JOB_STATE_FT_* events
...
This commit was SVN r31021.
2014-03-12 12:37:14 +00:00
Adrian Reber
8d40cd53ae
use the existing pretty-print function for information about the job state
...
This commit was SVN r31020.
2014-03-12 12:34:25 +00:00
Adrian Reber
49173ccd61
add debug output for the ft_event handler
...
This commit was SVN r30990.
2014-03-11 15:39:16 +00:00
Adrian Reber
7304b700e1
Fix the newly added FT event state when compiling --with-ft
...
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
8e080fb95e
Need a slightly different header
...
This commit was SVN r30986.
2014-03-11 03:03:12 +00:00
Ralph Castain
2cd1cfc7fe
Remove this ignore for now
...
This commit was SVN r30985.
2014-03-11 03:02:13 +00:00
Ralph Castain
103a5c6df1
Output the bindings if ess verbosity is high enough
...
Refs trac:4356
This commit was SVN r30982.
The following Trac tickets were found above:
Ticket 4356 --> https://svn.open-mpi.org/trac/ompi/ticket/4356
2014-03-11 01:21:14 +00:00
Ralph Castain
081669b440
When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings
This commit was SVN r30968.
2014-03-10 15:53:07 +00:00