Ralph Castain
4def94900a
Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
...
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Ralph Castain
87d809eefe
Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow
...
This commit was SVN r31633.
2014-05-05 19:22:06 +00:00
Ralph Castain
fae39a658d
Add third flag for open when using O_CREAT. Thanks to "robi" for reporting it and providing a patch.
...
Fixes trac:4596
Reviewed by rhc, RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31626.
The following Trac tickets were found above:
Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
2014-05-02 21:58:38 +00:00
Ralph Castain
60c554e097
Ugh - protect that --display-devel print with some NULL checks
...
This commit was SVN r31604.
2014-05-02 14:28:45 +00:00
Ralph Castain
c7f55be387
Per a user request, add binding info to the simple --diplay-map option
...
This commit was SVN r31603.
2014-05-02 14:25:59 +00:00
Ralph Castain
ccd33a17b8
Since we cannot block when calling abort, and we want to ensure any "show_help" message at least has a chance to get out before we exit, introduce a slight delay into the abort procedure.
...
Refs trac:4576
This commit was SVN r31601.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:46:25 +00:00
Ralph Castain
c1383ca1f3
Protect against NULL cpuset when not bound
...
This commit was SVN r31600.
2014-05-02 10:45:11 +00:00
Ralph Castain
0209cddb5b
Revert r31596 and r31595 as they recreate the "abort" problem - all they did was move the blocking send to another point in the code. An alternative solution to the "show_help and abort" problem. will come in another commit
...
Refs trac:4576
This commit was SVN r31599.
The following SVN revision numbers were found above:
r31595 --> open-mpi/ompi@2b61f22973
r31596 --> open-mpi/ompi@712634efd3
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:38:30 +00:00
Ralph Castain
6545e6e9a8
Add one more check for failed mapping that rarely occurs, but results in a hang when it does
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31598.
2014-05-02 10:35:14 +00:00
Ralph Castain
712634efd3
Silence warning
...
Refs trac:4576
This commit was SVN r31596.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-01 23:58:03 +00:00
Ralph Castain
2b61f22973
Now that the abort code no longer involves a blocking rml send section, apps that call show_help followed by abort are not printing their error message. So block them in show_help until that message gets out.
...
This commit was SVN r31595.
2014-05-01 22:57:17 +00:00
Ralph Castain
445b552d3a
Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
...
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.
Refs trac:4571
This commit was SVN r31589.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
567ed25938
As per the earlier RFC, move the DB framework to orcm, thus removing it from the OMPI code repo
...
This commit was SVN r31586.
2014-05-01 15:43:32 +00:00
Ralph Castain
3b64c603b4
First stage of RFC to rename OMPI_foo build system support: change OMPI_CHECK_PACKAGE -> OPAL_CHECK_PACKAGE
...
This commit was SVN r31582.
2014-05-01 14:24:56 +00:00
Ralph Castain
238ecea311
When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
...
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn
This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Ralph Castain
d04a102ab8
Silence warnings
...
This commit was SVN r31573.
2014-04-30 20:55:46 +00:00
Ralph Castain
087b84b0ef
Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers.
...
This commit was SVN r31572.
2014-04-30 19:29:00 +00:00
Ralph Castain
8cda1b3dc6
Don't store cpu_bitmap unless it is non-NULL
...
This commit was SVN r31570.
2014-04-30 18:12:48 +00:00
Ralph Castain
7a79b25577
Ensure we cleanup some files so session dirs can be rolled up
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
34988ba2a2
Cleanup the MPI_Abort detection
...
Refs trac:4576
This commit was SVN r31561.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-30 00:51:59 +00:00
Ralph Castain
3c9d877c1b
Remove debug
...
This commit was SVN r31560.
2014-04-30 00:08:43 +00:00
Ralph Castain
9402380e1f
Fix some errors in transition
...
This commit was SVN r31559.
2014-04-30 00:07:53 +00:00
Ralph Castain
c4c9bc1573
As per the RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Ralph Castain
1f0efe62a4
Minor cleanup - remove unused RML tag
...
Refs trac:4576
This commit was SVN r31545.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-29 17:34:17 +00:00
Ralph Castain
e05b88fd18
Take another stab at resolving the "called-abort" requirement without getting stuck. Return to "drop a turd" mode, perhaps with a little more intelligence behind it. Don't worry about catching it if session dirs weren't created
...
cmr=v1.8.2:reviewer=jsquyres:subject=cleanup MPI_Abort hangs
This commit was SVN r31543.
2014-04-29 17:29:46 +00:00
Ralph Castain
2c6234698e
Fix the tarball build - need to include the orte_config.h header
...
This commit was SVN r31540.
2014-04-29 00:05:19 +00:00
Ralph Castain
3723b39f30
Ensure we don't silently fail when unable to make a connection - bark pleasantly first.
...
Refs trac:4571
This commit was SVN r31537.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-04-28 19:16:32 +00:00
Ralph Castain
d642babff6
Derived from patch provided by Artem, cleanup the "abnormal" code path for selecting TCP OOB modules to connect to a remote process. If we can't find a direct interface-to-address match, then assign all the provided addresses to the first available TCP module and let the normal failure process determine if the remote proc is truly reachable.
...
cmr=v1.8.2:reviewer=artpol:subject=fix abnormal code connection path in tcp oob
This commit was SVN r31536.
2014-04-28 19:05:14 +00:00
Ralph Castain
fb61a94804
Follow the lead set by Jeff: no need to run AC_CONFIG_HEADERS on orte_config.h. However, unlike the MPI layer, we don't run that macro on another file in orte/include, so ensure we add that -I path back!
...
This commit was SVN r31534.
2014-04-28 17:12:15 +00:00
Jeff Squyres
d8715f1e3a
Close 3 more fd's that were leaking into child processes.
...
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.
Refs trac:4550
This commit was SVN r31515.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 15:36:24 +00:00
Jeff Squyres
e1655ae68d
opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
...
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL. And so we shall.
Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.
Refs trac:4550
This commit was SVN r31513.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
410f5bfb91
oob_tcp_listener.c: set both ends of this thread to be close-on-exec
...
This pipe is used to communicate between threads in this process.
Mark both fd as close-on-exec so that children don't inherit this
pipe.
Refs trac:4550
This commit was SVN r31512.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-23 21:46:41 +00:00
Jeff Squyres
87e6232e67
orterun.c: set an fd to be close-on-exec
...
Make sure the debugger attach fifo is marked as close-on-exec so that
children procs don't inherit it. For example, if you salloc a SLURM
allocation and run "mpirun ..." in there (i.e., mpirun is running on
the head node, and launching on to back-end nodes), the forked srun's
will inherit this fd if it is still open.
Refs trac:4550
This commit was SVN r31499.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-22 21:55:09 +00:00
Jeff Squyres
63b7ef4103
orterun.1in: Document --allow-run-as-root option
...
Add some verbiage about how mpirun now defaults to disallowing running
as root, but you can use the --allow-run-as-root option to override
this default behavior.
Refs trac:4536
This commit was SVN r31477.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-22 14:34:32 +00:00
Jeff Squyres
ea4c916096
plm_slurm_module.c: don't leave the extra fd to /dev/null open
...
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed). It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.
This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31469.
The following SVN revision numbers were found above:
r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Jeff Squyres
38a27b858d
Protect for the CLEANUP case where tmp hasn't been set yet
...
Refs trac:4536
This commit was SVN r31438.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:34:53 +00:00
Jeff Squyres
482b465c05
Trivial format change: use the same length of lines and \n offsets as
...
opal_show_help().
Refs trac:4536
This commit was SVN r31437.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:14:45 +00:00
Jeff Squyres
530f22c403
proc_info.c: uncomment C99 struct member initialization usage
...
The C99 usage to initialize via struct member names was already there,
but commented out. This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.
This commit was SVN r31425.
2014-04-18 17:26:07 +00:00
Ralph Castain
8594f5d738
Correctly set a non-zero exit status when mpirun is terminated by signal
...
Fixes trac:4537
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31423.
The following Trac tickets were found above:
Ticket 4537 --> https://svn.open-mpi.org/trac/ompi/ticket/4537
2014-04-18 16:39:08 +00:00
Ralph Castain
12094eb7b2
Add some further protections after discussion with Jeff
...
Refs trac:4536
This commit was SVN r31422.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 16:21:55 +00:00
Ralph Castain
8d72633acf
Ensure that the session directory fields of orte_process_info have been initialized prior to cleaning up those directories as part of the initialization process that deals with stale session directory trees.
...
Fixes trac:4534
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31421.
The following Trac tickets were found above:
Ticket 4534 --> https://svn.open-mpi.org/trac/ompi/ticket/4534
2014-04-18 14:25:48 +00:00
Ralph Castain
a368e84e70
Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
...
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Ralph Castain
bbdbc5f8a8
Per suggestion from George, use a pipe for terminating the thread.
...
Refs trac:4510
This commit was SVN r31381.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-14 01:02:46 +00:00
Ralph Castain
deff85ffc3
Prevent a segfault if we encounter an error while parsing a hostfile. Don't issue and error_log output as the hostfile code already prints an error message
...
Thanks to Tetsuya Mishima for the patch. Reviewed ok by rhc.
RM-approved
cmr=v1.8.1:reviewer=ompi-gk1.8
This commit was SVN r31377.
2014-04-12 21:32:10 +00:00
Ralph Castain
2d8dff837c
Ensure we properly terminate the listening thread prior to exiting, but do so in a way that doesn't make us wait for select to timeout.
...
Refs trac:4510
This commit was SVN r31376.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-12 15:01:24 +00:00
Ralph Castain
9b30b2b783
Shave some time off of mpirun's operation by not waiting for the listener thread to terminate before exiting
...
cmr=v1.8.1:reviewer=rhc
This commit was SVN r31368.
2014-04-11 04:16:28 +00:00
Nathan Hjelm
9df795d1dd
plm/alps: silence annoying warning message when using Cray PMI 3.x or
...
newer
This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540
plm/slurm: tweak /dev/null usage ( #4489 )
...
See the ticket for more details.
cmr=v1.8.1:reviewer=rhc:ticket=4489
This commit was SVN r31351.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53
Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
...
Refs trac:4489
This commit was SVN r31348.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
7c4fa3446c
Per the telecon, revert r31302 for now pending an RFC review on the idea of setting app proc envar's using an MCA param
...
This commit was SVN r31345.
The following SVN revision numbers were found above:
r31302 --> open-mpi/ompi@6a1b78e26b
2014-04-08 15:47:12 +00:00