Jeff Squyres
482b465c05
Trivial format change: use the same length of lines and \n offsets as
...
opal_show_help().
Refs trac:4536
This commit was SVN r31437.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:14:45 +00:00
Jeff Squyres
530f22c403
proc_info.c: uncomment C99 struct member initialization usage
...
The C99 usage to initialize via struct member names was already there,
but commented out. This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.
This commit was SVN r31425.
2014-04-18 17:26:07 +00:00
Ralph Castain
8594f5d738
Correctly set a non-zero exit status when mpirun is terminated by signal
...
Fixes trac:4537
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31423.
The following Trac tickets were found above:
Ticket 4537 --> https://svn.open-mpi.org/trac/ompi/ticket/4537
2014-04-18 16:39:08 +00:00
Ralph Castain
12094eb7b2
Add some further protections after discussion with Jeff
...
Refs trac:4536
This commit was SVN r31422.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 16:21:55 +00:00
Ralph Castain
8d72633acf
Ensure that the session directory fields of orte_process_info have been initialized prior to cleaning up those directories as part of the initialization process that deals with stale session directory trees.
...
Fixes trac:4534
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31421.
The following Trac tickets were found above:
Ticket 4534 --> https://svn.open-mpi.org/trac/ompi/ticket/4534
2014-04-18 14:25:48 +00:00
Ralph Castain
a368e84e70
Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
...
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Ralph Castain
bbdbc5f8a8
Per suggestion from George, use a pipe for terminating the thread.
...
Refs trac:4510
This commit was SVN r31381.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-14 01:02:46 +00:00
Ralph Castain
deff85ffc3
Prevent a segfault if we encounter an error while parsing a hostfile. Don't issue and error_log output as the hostfile code already prints an error message
...
Thanks to Tetsuya Mishima for the patch. Reviewed ok by rhc.
RM-approved
cmr=v1.8.1:reviewer=ompi-gk1.8
This commit was SVN r31377.
2014-04-12 21:32:10 +00:00
Ralph Castain
2d8dff837c
Ensure we properly terminate the listening thread prior to exiting, but do so in a way that doesn't make us wait for select to timeout.
...
Refs trac:4510
This commit was SVN r31376.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-12 15:01:24 +00:00
Ralph Castain
9b30b2b783
Shave some time off of mpirun's operation by not waiting for the listener thread to terminate before exiting
...
cmr=v1.8.1:reviewer=rhc
This commit was SVN r31368.
2014-04-11 04:16:28 +00:00
Nathan Hjelm
9df795d1dd
plm/alps: silence annoying warning message when using Cray PMI 3.x or
...
newer
This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540
plm/slurm: tweak /dev/null usage ( #4489 )
...
See the ticket for more details.
cmr=v1.8.1:reviewer=rhc:ticket=4489
This commit was SVN r31351.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53
Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
...
Refs trac:4489
This commit was SVN r31348.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
7c4fa3446c
Per the telecon, revert r31302 for now pending an RFC review on the idea of setting app proc envar's using an MCA param
...
This commit was SVN r31345.
The following SVN revision numbers were found above:
r31302 --> open-mpi/ompi@6a1b78e26b
2014-04-08 15:47:12 +00:00
Ralph Castain
92ca647d3d
Fix copy error in file name
...
This commit was SVN r31344.
2014-04-08 15:31:55 +00:00
Ralph Castain
61d94fcee2
Fix the sequential mapper - it was out-of-sync with the hostfile changes, and we missed the "seq" policy when parsing the --map-by option. Thanks to Bill Chen for reporting it
...
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31333.
2014-04-08 03:38:25 +00:00
Nathan Hjelm
155130fbfc
ras/alps: fix alps support for CLE 5.x
...
Cray moved the apstat command on CLE 5.x to /opt/cray/alps/../bin and
moved a configuration file. This commit adds support for both of these
changes.
cmr=v1.8.1
This commit was SVN r31329.
2014-04-07 22:51:21 +00:00
Jeff Squyres
82e104719a
hwloc/rmaps base: Add missing help message.
...
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).
This commit was SVN r31321.
2014-04-07 15:39:54 +00:00
Ralph Castain
8ce98ccc8d
Not sure when this got messed up, but correct the stdout/stderr redirection on the srun command so we don't get all those slurm warnings
...
cmr=v1.8.1:reviewer=dgoodell:subject=silence srun warning output
This commit was SVN r31308.
2014-04-04 04:23:31 +00:00
Ralph Castain
3fdcaeab97
Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
...
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.
cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup
This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Mike Dubman
6a1b78e26b
opal: add mca param to control ranks env variables
...
add -mca base_env_list "var1=val1 var2=val2 ..." mca parameter that can be used in mca param files
or with -am app.conf mpirun commandline to set rank env variables with mca mechanism
fixed by Elena, reviewed by Miked
cmr=v1.8.1:reviewer=ompi-rm1.8
This commit was SVN r31302.
2014-04-01 21:14:31 +00:00
Ralph Castain
9d2f5f6b1f
Silence warning
...
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31294.
2014-03-29 19:10:26 +00:00
Ralph Castain
70ee3fb000
Ensure that orted's are not bound to single processors if the TaskAffinity option is set by default. Thanks to Artem Polyakov for the patch, and for his patience in explaining the situation.
...
Reviewed with Moe Jette to ensure this was correct, and confirmed by me.
RM-approved
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31288.
2014-03-29 18:30:38 +00:00
Jeff Squyres
173c046617
build: add Automake-like silent/verbose macros for "ln -s ..." operations
...
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
2014-03-28 18:24:32 +00:00
Ralph Castain
714cb8f573
Silence warnings
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31248.
2014-03-27 14:16:54 +00:00
Ralph Castain
390645ac2a
Per patch from Tetsuya Mishima, do a nicer job of warning the user that we need to map to a higher level to get the number of requested cpus/rank. Also, change the mapping policy to "byslot" when falling back to that option.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31196.
2014-03-24 15:47:29 +00:00
Ralph Castain
bd9bd2ff16
Be consistent in our handling of the "only HNP in allocation" case when setting up the VM. Thanks to Tetsuya Mishima for the suggestion.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31195.
2014-03-24 15:28:09 +00:00
Dave Goodell
5f3b81e291
oob: delete events when destroying a peer
...
Without this patch running ring_c with the usnic BTL under valgrind will
cause the orteds to segfault.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Ralph Castain <rhc@open-mpi.org>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31161.
2014-03-19 22:15:49 +00:00
Ralph Castain
d17f811ff5
Surrender to the tyranny of C++ and give up on enum for node states, as nice as that would be, in favor of retaining memory footprint constraints.
...
This commit was SVN r31149.
2014-03-19 16:15:24 +00:00
Ralph Castain
f7df960198
Silence warning
...
This commit was SVN r31139.
2014-03-18 23:15:29 +00:00
Jeff Squyres
3da579139b
More corrections w.r.t. process groups
...
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
2014-03-18 21:31:01 +00:00
Ralph Castain
518ba55cf4
Ensure MPIEXEC_TIMEOUT calls the correct state to exit
...
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r31125.
2014-03-18 20:12:02 +00:00
Ralph Castain
554da83865
Set the locality for remote procs even after a comm_spawn. Ensure we store our own local cpuset upon launch so it will be shared during comm_join.
...
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
2014-03-18 14:51:07 +00:00
Ralph Castain
0aa23cdc35
Cleanup copy/paste errors to ensure we progress the launch
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
545ac7dc58
Remove the job_control_forwarding logic as we want *any* signal to go to all members of the process group
...
Refs trac:4404
This commit was SVN r31094.
The following Trac tickets were found above:
Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404
2014-03-17 22:45:33 +00:00
Ralph Castain
5a868028a8
Revert r31091 - the functionality didn't disappear, but moved into the MPI layer :-(
...
This commit was SVN r31093.
The following SVN revision numbers were found above:
r31091 --> open-mpi/ompi@edf680855e
2014-03-17 22:30:03 +00:00
Ralph Castain
99c9ecaed0
Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31092.
2014-03-17 22:12:15 +00:00
Ralph Castain
edf680855e
Restore locality computation to the nidmap code - don't know how/when it was removed, but that was not good
...
cmr=v1.7.5:reviewer=hjelmn
This commit was SVN r31091.
2014-03-17 21:59:25 +00:00
Ralph Castain
45196d222b
Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
...
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
796dfe5ada
Do a little cleanup - only resusage needs the node/proc info, so remove it from the sensor base
...
This commit was SVN r31089.
2014-03-17 21:26:46 +00:00
Ralph Castain
7bb8dbade6
Extend the regular expression parsing support
...
This commit was SVN r31088.
2014-03-17 21:25:05 +00:00
Ralph Castain
0257d32eeb
There is no OOB component object - it is a simple struct with an opal_list_item_t element at the beginning
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31087.
2014-03-17 21:23:59 +00:00
Ralph Castain
38e02890aa
ORTE doesn't care about cxx flags
...
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31086.
2014-03-17 21:21:54 +00:00
Ralph Castain
f259d50ed7
Fully fix the PMI2 warning - turned out to be larger than originally thought due to the way the function was being handled across multiple files. Properly resolve the problem by not compiling the file if PMI2 is not desired, and then appropriately setting the visibility of the function within the module
...
Refs trac:4400
This commit was SVN r31084.
The following Trac tickets were found above:
Ticket 4400 --> https://svn.open-mpi.org/trac/ompi/ticket/4400
2014-03-17 17:36:37 +00:00
Ralph Castain
e152449be4
Silence warning
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r31083.
2014-03-17 17:05:24 +00:00
Ralph Castain
b248b27637
Remove a check that prevented mpirun from exiting when it should in the single-node case
...
Refs trac:4393
This commit was SVN r31080.
The following Trac tickets were found above:
Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773
Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Ralph Castain
2abed09d7c
Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating.
...
Jeff: please test a variety of conditions to ensure we get this right
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31058.
2014-03-13 04:02:24 +00:00
Ralph Castain
ac421c931d
The random number generator changes were incomplete (typo errors) in some places, and is missing the required declspec's for visibility.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31053.
2014-03-12 22:37:27 +00:00
Ralph Castain
f56f37d364
Shifting to an event-driven RTE raises some interesting issues during shutdown. We want the last messages to get thru, but also need to correctly shutdown the virtual machine. This requires a delicate balancing act across event priorities, and the need to check for termination conditions in places where related events get processed.
...
Change the priority of comm_failure and job_termination events to ensure we process final messages prior to terminating. Check for termination conditions when processing proc termination events as we may order proc termination when the daemon gets an exit command, but we can't see the proc actually terminate until we get out of that message event.
Jeff: probably easiest to review this by testing. I tested it under both Slurm and rsh on v1.7.5 as well as trunk
cmr=v1.7.5:reviewer=jsquyres:subject=resolve event priorities during VM shutdown
This commit was SVN r31042.
2014-03-12 16:49:58 +00:00