Ralph Castain
2c3d07db24
Cleanup the test so it is MPI correct
...
This commit was SVN r31919.
2014-06-01 17:57:36 +00:00
Ralph Castain
8736a1c138
Per RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php
Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).
This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
cf2c7381d0
Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs.
...
Refs trac:4643
This commit was SVN r31915.
The following Trac tickets were found above:
Ticket 4643 --> https://svn.open-mpi.org/trac/ompi/ticket/4643
2014-06-01 16:08:56 +00:00
Ralph Castain
1107f9099e
Per the RFC issued here:
...
http://www.open-mpi.org/community/lists/devel/2014/05/14827.php
Refactor PMI support
This commit was SVN r31907.
2014-06-01 04:28:17 +00:00
Nathan Hjelm
041b72b0cc
plm/alps: better workaround for the noisy cray pmi implementation
...
This commit is a slightly better workaround to prevent mesages of
the form:
[unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
[unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor
It works by completely disabling PMI in the application process when using
mpirun. This should not be an issue for any apps.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31882.
2014-05-22 16:04:36 +00:00
Oscar Vega-Gisbert
83bdebbf81
Java bindings for OSHMEM.
...
This commit was SVN r31810.
2014-05-18 21:48:09 +00:00
Nathan Hjelm
73bfecd650
More leak fixes.
...
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
2014-05-15 16:38:13 +00:00
Nathan Hjelm
59d09ad9de
orte: fix several small memory leaks
...
grpcomm: fix memory leaks
We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.
oob,rml: fix memory leaks
This commit fixes several leaks:
- Both the oob/base and oob/tcp were leaking objects on their peer
hash tables. Iterate on the hash tables and free any objects.
- Leaked sent messages because of missing OBJ_RELEASE. I placed the
release in ORTE_RML_SEND_COMPLETE to catch all the possible
paths.
ess/base: close the state framework
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Gilles Gouaillardet
5b9364fc12
Fix a memory leak in orte_register_params()
...
mca_base_var_register (..., MCA_BASE_VAR_TYPE_STRING, ...)
will dup() the orte_set_slots string, so there is no need
to do this in the first place.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31773.
2014-05-15 10:31:19 +00:00
Gilles Gouaillardet
5f82c391a6
Fix memory leaks in orte/util/nidmap.c
...
This patch fixes four memory leaks in orte/util/nidmap.c :
- hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed
- even if bo->bytes is freed in the decode, bo was not freed
- a job list is populated but never used nor freed
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31770.
2014-05-15 08:28:53 +00:00
Ralph Castain
ad0e8f841d
Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection.
...
Refs trac:4627
This commit was SVN r31763.
The following Trac tickets were found above:
Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627
2014-05-14 19:23:02 +00:00
Ralph Castain
e605e73379
Close the incoming socket if we aren't going to accept it
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31759.
2014-05-14 16:51:59 +00:00
Ralph Castain
3a1c2fff3e
Correct a misplaced bracket - daemons shouldn't be doing app-related operations
...
This may need a patch for 1.8.2, but we can try to directly apply it
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31754.
2014-05-14 15:23:30 +00:00
Nathan Hjelm
2a57e71a47
plm/alps: fix typo introduced in r31589
...
This commit was SVN r31747.
The following SVN revision numbers were found above:
r31589 --> open-mpi/ompi@445b552d3a
2014-05-13 22:36:54 +00:00
Ralph Castain
f55c587a74
Per patch from Tetsuya Mishima, ensure the rank_file mapper accurately tracks number of nodes in the map
...
Refs trac:4594
This commit was SVN r31725.
The following Trac tickets were found above:
Ticket 4594 --> https://svn.open-mpi.org/trac/ompi/ticket/4594
2014-05-13 14:36:25 +00:00
Ralph Castain
5388347511
Per Jeff's suggestion, remove function that has duplicate functionality and just use one to check if session_dir directory should be removed.
...
Refs trac:4584
This commit was SVN r31691.
The following Trac tickets were found above:
Ticket 4584 --> https://svn.open-mpi.org/trac/ompi/ticket/4584
2014-05-08 17:22:43 +00:00
Ralph Castain
aaae4841e9
Flush the show_help system on our way out - this also restores the opal_show_help function pointer to the OPAL layer for any subsequent processing.
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31685.
2014-05-08 14:37:47 +00:00
Ralph Castain
5602156a1c
Use the correct abstraction layer name for the data dirs
...
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091
The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
...
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6
The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
...
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
05590b6a8c
Correct the datastore containing the coprocessor info
...
This commit was SVN r31677.
2014-05-07 19:29:12 +00:00
Ralph Castain
4def94900a
Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
...
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Ralph Castain
87d809eefe
Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow
...
This commit was SVN r31633.
2014-05-05 19:22:06 +00:00
Ralph Castain
fae39a658d
Add third flag for open when using O_CREAT. Thanks to "robi" for reporting it and providing a patch.
...
Fixes trac:4596
Reviewed by rhc, RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31626.
The following Trac tickets were found above:
Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
2014-05-02 21:58:38 +00:00
Ralph Castain
60c554e097
Ugh - protect that --display-devel print with some NULL checks
...
This commit was SVN r31604.
2014-05-02 14:28:45 +00:00
Ralph Castain
c7f55be387
Per a user request, add binding info to the simple --diplay-map option
...
This commit was SVN r31603.
2014-05-02 14:25:59 +00:00
Ralph Castain
ccd33a17b8
Since we cannot block when calling abort, and we want to ensure any "show_help" message at least has a chance to get out before we exit, introduce a slight delay into the abort procedure.
...
Refs trac:4576
This commit was SVN r31601.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:46:25 +00:00
Ralph Castain
c1383ca1f3
Protect against NULL cpuset when not bound
...
This commit was SVN r31600.
2014-05-02 10:45:11 +00:00
Ralph Castain
0209cddb5b
Revert r31596 and r31595 as they recreate the "abort" problem - all they did was move the blocking send to another point in the code. An alternative solution to the "show_help and abort" problem. will come in another commit
...
Refs trac:4576
This commit was SVN r31599.
The following SVN revision numbers were found above:
r31595 --> open-mpi/ompi@2b61f22973
r31596 --> open-mpi/ompi@712634efd3
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:38:30 +00:00
Ralph Castain
6545e6e9a8
Add one more check for failed mapping that rarely occurs, but results in a hang when it does
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31598.
2014-05-02 10:35:14 +00:00
Ralph Castain
712634efd3
Silence warning
...
Refs trac:4576
This commit was SVN r31596.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-01 23:58:03 +00:00
Ralph Castain
2b61f22973
Now that the abort code no longer involves a blocking rml send section, apps that call show_help followed by abort are not printing their error message. So block them in show_help until that message gets out.
...
This commit was SVN r31595.
2014-05-01 22:57:17 +00:00
Ralph Castain
445b552d3a
Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
...
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.
Refs trac:4571
This commit was SVN r31589.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
567ed25938
As per the earlier RFC, move the DB framework to orcm, thus removing it from the OMPI code repo
...
This commit was SVN r31586.
2014-05-01 15:43:32 +00:00
Ralph Castain
3b64c603b4
First stage of RFC to rename OMPI_foo build system support: change OMPI_CHECK_PACKAGE -> OPAL_CHECK_PACKAGE
...
This commit was SVN r31582.
2014-05-01 14:24:56 +00:00
Ralph Castain
238ecea311
When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
...
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn
This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Ralph Castain
d04a102ab8
Silence warnings
...
This commit was SVN r31573.
2014-04-30 20:55:46 +00:00
Ralph Castain
087b84b0ef
Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers.
...
This commit was SVN r31572.
2014-04-30 19:29:00 +00:00
Ralph Castain
8cda1b3dc6
Don't store cpu_bitmap unless it is non-NULL
...
This commit was SVN r31570.
2014-04-30 18:12:48 +00:00
Ralph Castain
7a79b25577
Ensure we cleanup some files so session dirs can be rolled up
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
34988ba2a2
Cleanup the MPI_Abort detection
...
Refs trac:4576
This commit was SVN r31561.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-30 00:51:59 +00:00
Ralph Castain
3c9d877c1b
Remove debug
...
This commit was SVN r31560.
2014-04-30 00:08:43 +00:00
Ralph Castain
9402380e1f
Fix some errors in transition
...
This commit was SVN r31559.
2014-04-30 00:07:53 +00:00
Ralph Castain
c4c9bc1573
As per the RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Ralph Castain
1f0efe62a4
Minor cleanup - remove unused RML tag
...
Refs trac:4576
This commit was SVN r31545.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-29 17:34:17 +00:00
Ralph Castain
e05b88fd18
Take another stab at resolving the "called-abort" requirement without getting stuck. Return to "drop a turd" mode, perhaps with a little more intelligence behind it. Don't worry about catching it if session dirs weren't created
...
cmr=v1.8.2:reviewer=jsquyres:subject=cleanup MPI_Abort hangs
This commit was SVN r31543.
2014-04-29 17:29:46 +00:00
Ralph Castain
2c6234698e
Fix the tarball build - need to include the orte_config.h header
...
This commit was SVN r31540.
2014-04-29 00:05:19 +00:00
Ralph Castain
3723b39f30
Ensure we don't silently fail when unable to make a connection - bark pleasantly first.
...
Refs trac:4571
This commit was SVN r31537.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-04-28 19:16:32 +00:00
Ralph Castain
d642babff6
Derived from patch provided by Artem, cleanup the "abnormal" code path for selecting TCP OOB modules to connect to a remote process. If we can't find a direct interface-to-address match, then assign all the provided addresses to the first available TCP module and let the normal failure process determine if the remote proc is truly reachable.
...
cmr=v1.8.2:reviewer=artpol:subject=fix abnormal code connection path in tcp oob
This commit was SVN r31536.
2014-04-28 19:05:14 +00:00
Ralph Castain
fb61a94804
Follow the lead set by Jeff: no need to run AC_CONFIG_HEADERS on orte_config.h. However, unlike the MPI layer, we don't run that macro on another file in orte/include, so ensure we add that -I path back!
...
This commit was SVN r31534.
2014-04-28 17:12:15 +00:00
Jeff Squyres
d8715f1e3a
Close 3 more fd's that were leaking into child processes.
...
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.
Refs trac:4550
This commit was SVN r31515.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 15:36:24 +00:00
Jeff Squyres
e1655ae68d
opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
...
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL. And so we shall.
Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.
Refs trac:4550
This commit was SVN r31513.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
410f5bfb91
oob_tcp_listener.c: set both ends of this thread to be close-on-exec
...
This pipe is used to communicate between threads in this process.
Mark both fd as close-on-exec so that children don't inherit this
pipe.
Refs trac:4550
This commit was SVN r31512.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-23 21:46:41 +00:00
Jeff Squyres
87e6232e67
orterun.c: set an fd to be close-on-exec
...
Make sure the debugger attach fifo is marked as close-on-exec so that
children procs don't inherit it. For example, if you salloc a SLURM
allocation and run "mpirun ..." in there (i.e., mpirun is running on
the head node, and launching on to back-end nodes), the forked srun's
will inherit this fd if it is still open.
Refs trac:4550
This commit was SVN r31499.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-22 21:55:09 +00:00
Jeff Squyres
63b7ef4103
orterun.1in: Document --allow-run-as-root option
...
Add some verbiage about how mpirun now defaults to disallowing running
as root, but you can use the --allow-run-as-root option to override
this default behavior.
Refs trac:4536
This commit was SVN r31477.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-22 14:34:32 +00:00
Jeff Squyres
ea4c916096
plm_slurm_module.c: don't leave the extra fd to /dev/null open
...
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed). It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.
This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31469.
The following SVN revision numbers were found above:
r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Jeff Squyres
38a27b858d
Protect for the CLEANUP case where tmp hasn't been set yet
...
Refs trac:4536
This commit was SVN r31438.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:34:53 +00:00
Jeff Squyres
482b465c05
Trivial format change: use the same length of lines and \n offsets as
...
opal_show_help().
Refs trac:4536
This commit was SVN r31437.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:14:45 +00:00
Jeff Squyres
530f22c403
proc_info.c: uncomment C99 struct member initialization usage
...
The C99 usage to initialize via struct member names was already there,
but commented out. This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.
This commit was SVN r31425.
2014-04-18 17:26:07 +00:00
Ralph Castain
8594f5d738
Correctly set a non-zero exit status when mpirun is terminated by signal
...
Fixes trac:4537
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31423.
The following Trac tickets were found above:
Ticket 4537 --> https://svn.open-mpi.org/trac/ompi/ticket/4537
2014-04-18 16:39:08 +00:00
Ralph Castain
12094eb7b2
Add some further protections after discussion with Jeff
...
Refs trac:4536
This commit was SVN r31422.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 16:21:55 +00:00
Ralph Castain
8d72633acf
Ensure that the session directory fields of orte_process_info have been initialized prior to cleaning up those directories as part of the initialization process that deals with stale session directory trees.
...
Fixes trac:4534
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31421.
The following Trac tickets were found above:
Ticket 4534 --> https://svn.open-mpi.org/trac/ompi/ticket/4534
2014-04-18 14:25:48 +00:00
Ralph Castain
a368e84e70
Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
...
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Ralph Castain
bbdbc5f8a8
Per suggestion from George, use a pipe for terminating the thread.
...
Refs trac:4510
This commit was SVN r31381.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-14 01:02:46 +00:00
Ralph Castain
deff85ffc3
Prevent a segfault if we encounter an error while parsing a hostfile. Don't issue and error_log output as the hostfile code already prints an error message
...
Thanks to Tetsuya Mishima for the patch. Reviewed ok by rhc.
RM-approved
cmr=v1.8.1:reviewer=ompi-gk1.8
This commit was SVN r31377.
2014-04-12 21:32:10 +00:00
Ralph Castain
2d8dff837c
Ensure we properly terminate the listening thread prior to exiting, but do so in a way that doesn't make us wait for select to timeout.
...
Refs trac:4510
This commit was SVN r31376.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-12 15:01:24 +00:00
Ralph Castain
9b30b2b783
Shave some time off of mpirun's operation by not waiting for the listener thread to terminate before exiting
...
cmr=v1.8.1:reviewer=rhc
This commit was SVN r31368.
2014-04-11 04:16:28 +00:00
Nathan Hjelm
9df795d1dd
plm/alps: silence annoying warning message when using Cray PMI 3.x or
...
newer
This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540
plm/slurm: tweak /dev/null usage ( #4489 )
...
See the ticket for more details.
cmr=v1.8.1:reviewer=rhc:ticket=4489
This commit was SVN r31351.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53
Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
...
Refs trac:4489
This commit was SVN r31348.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
7c4fa3446c
Per the telecon, revert r31302 for now pending an RFC review on the idea of setting app proc envar's using an MCA param
...
This commit was SVN r31345.
The following SVN revision numbers were found above:
r31302 --> open-mpi/ompi@6a1b78e26b
2014-04-08 15:47:12 +00:00
Ralph Castain
92ca647d3d
Fix copy error in file name
...
This commit was SVN r31344.
2014-04-08 15:31:55 +00:00
Ralph Castain
61d94fcee2
Fix the sequential mapper - it was out-of-sync with the hostfile changes, and we missed the "seq" policy when parsing the --map-by option. Thanks to Bill Chen for reporting it
...
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31333.
2014-04-08 03:38:25 +00:00
Nathan Hjelm
155130fbfc
ras/alps: fix alps support for CLE 5.x
...
Cray moved the apstat command on CLE 5.x to /opt/cray/alps/../bin and
moved a configuration file. This commit adds support for both of these
changes.
cmr=v1.8.1
This commit was SVN r31329.
2014-04-07 22:51:21 +00:00
Jeff Squyres
82e104719a
hwloc/rmaps base: Add missing help message.
...
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).
This commit was SVN r31321.
2014-04-07 15:39:54 +00:00
Ralph Castain
8ce98ccc8d
Not sure when this got messed up, but correct the stdout/stderr redirection on the srun command so we don't get all those slurm warnings
...
cmr=v1.8.1:reviewer=dgoodell:subject=silence srun warning output
This commit was SVN r31308.
2014-04-04 04:23:31 +00:00
Ralph Castain
3fdcaeab97
Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
...
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.
cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup
This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Mike Dubman
6a1b78e26b
opal: add mca param to control ranks env variables
...
add -mca base_env_list "var1=val1 var2=val2 ..." mca parameter that can be used in mca param files
or with -am app.conf mpirun commandline to set rank env variables with mca mechanism
fixed by Elena, reviewed by Miked
cmr=v1.8.1:reviewer=ompi-rm1.8
This commit was SVN r31302.
2014-04-01 21:14:31 +00:00
Ralph Castain
9d2f5f6b1f
Silence warning
...
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31294.
2014-03-29 19:10:26 +00:00
Ralph Castain
70ee3fb000
Ensure that orted's are not bound to single processors if the TaskAffinity option is set by default. Thanks to Artem Polyakov for the patch, and for his patience in explaining the situation.
...
Reviewed with Moe Jette to ensure this was correct, and confirmed by me.
RM-approved
cmr=v1.8:reviewer=ompi-gk1.8
This commit was SVN r31288.
2014-03-29 18:30:38 +00:00
Jeff Squyres
173c046617
build: add Automake-like silent/verbose macros for "ln -s ..." operations
...
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
2014-03-28 18:24:32 +00:00
Ralph Castain
714cb8f573
Silence warnings
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31248.
2014-03-27 14:16:54 +00:00
Ralph Castain
390645ac2a
Per patch from Tetsuya Mishima, do a nicer job of warning the user that we need to map to a higher level to get the number of requested cpus/rank. Also, change the mapping policy to "byslot" when falling back to that option.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31196.
2014-03-24 15:47:29 +00:00
Ralph Castain
bd9bd2ff16
Be consistent in our handling of the "only HNP in allocation" case when setting up the VM. Thanks to Tetsuya Mishima for the suggestion.
...
cmr=v1.8:reviewer=rhc
This commit was SVN r31195.
2014-03-24 15:28:09 +00:00
Dave Goodell
5f3b81e291
oob: delete events when destroying a peer
...
Without this patch running ring_c with the usnic BTL under valgrind will
cause the orteds to segfault.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Ralph Castain <rhc@open-mpi.org>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31161.
2014-03-19 22:15:49 +00:00
Ralph Castain
d17f811ff5
Surrender to the tyranny of C++ and give up on enum for node states, as nice as that would be, in favor of retaining memory footprint constraints.
...
This commit was SVN r31149.
2014-03-19 16:15:24 +00:00
Ralph Castain
f7df960198
Silence warning
...
This commit was SVN r31139.
2014-03-18 23:15:29 +00:00
Jeff Squyres
3da579139b
More corrections w.r.t. process groups
...
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
2014-03-18 21:31:01 +00:00
Ralph Castain
518ba55cf4
Ensure MPIEXEC_TIMEOUT calls the correct state to exit
...
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r31125.
2014-03-18 20:12:02 +00:00
Ralph Castain
554da83865
Set the locality for remote procs even after a comm_spawn. Ensure we store our own local cpuset upon launch so it will be shared during comm_join.
...
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
2014-03-18 14:51:07 +00:00
Ralph Castain
0aa23cdc35
Cleanup copy/paste errors to ensure we progress the launch
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
545ac7dc58
Remove the job_control_forwarding logic as we want *any* signal to go to all members of the process group
...
Refs trac:4404
This commit was SVN r31094.
The following Trac tickets were found above:
Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404
2014-03-17 22:45:33 +00:00
Ralph Castain
5a868028a8
Revert r31091 - the functionality didn't disappear, but moved into the MPI layer :-(
...
This commit was SVN r31093.
The following SVN revision numbers were found above:
r31091 --> open-mpi/ompi@edf680855e
2014-03-17 22:30:03 +00:00
Ralph Castain
99c9ecaed0
Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31092.
2014-03-17 22:12:15 +00:00
Ralph Castain
edf680855e
Restore locality computation to the nidmap code - don't know how/when it was removed, but that was not good
...
cmr=v1.7.5:reviewer=hjelmn
This commit was SVN r31091.
2014-03-17 21:59:25 +00:00
Ralph Castain
45196d222b
Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
...
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
796dfe5ada
Do a little cleanup - only resusage needs the node/proc info, so remove it from the sensor base
...
This commit was SVN r31089.
2014-03-17 21:26:46 +00:00
Ralph Castain
7bb8dbade6
Extend the regular expression parsing support
...
This commit was SVN r31088.
2014-03-17 21:25:05 +00:00
Ralph Castain
0257d32eeb
There is no OOB component object - it is a simple struct with an opal_list_item_t element at the beginning
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31087.
2014-03-17 21:23:59 +00:00
Ralph Castain
38e02890aa
ORTE doesn't care about cxx flags
...
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31086.
2014-03-17 21:21:54 +00:00
Ralph Castain
f259d50ed7
Fully fix the PMI2 warning - turned out to be larger than originally thought due to the way the function was being handled across multiple files. Properly resolve the problem by not compiling the file if PMI2 is not desired, and then appropriately setting the visibility of the function within the module
...
Refs trac:4400
This commit was SVN r31084.
The following Trac tickets were found above:
Ticket 4400 --> https://svn.open-mpi.org/trac/ompi/ticket/4400
2014-03-17 17:36:37 +00:00
Ralph Castain
e152449be4
Silence warning
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r31083.
2014-03-17 17:05:24 +00:00
Ralph Castain
b248b27637
Remove a check that prevented mpirun from exiting when it should in the single-node case
...
Refs trac:4393
This commit was SVN r31080.
The following Trac tickets were found above:
Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773
Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Ralph Castain
2abed09d7c
Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating.
...
Jeff: please test a variety of conditions to ensure we get this right
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31058.
2014-03-13 04:02:24 +00:00
Ralph Castain
ac421c931d
The random number generator changes were incomplete (typo errors) in some places, and is missing the required declspec's for visibility.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31053.
2014-03-12 22:37:27 +00:00
Ralph Castain
f56f37d364
Shifting to an event-driven RTE raises some interesting issues during shutdown. We want the last messages to get thru, but also need to correctly shutdown the virtual machine. This requires a delicate balancing act across event priorities, and the need to check for termination conditions in places where related events get processed.
...
Change the priority of comm_failure and job_termination events to ensure we process final messages prior to terminating. Check for termination conditions when processing proc termination events as we may order proc termination when the daemon gets an exit command, but we can't see the proc actually terminate until we get out of that message event.
Jeff: probably easiest to review this by testing. I tested it under both Slurm and rsh on v1.7.5 as well as trunk
cmr=v1.7.5:reviewer=jsquyres:subject=resolve event priorities during VM shutdown
This commit was SVN r31042.
2014-03-12 16:49:58 +00:00
Ralph Castain
a254d2db34
Silence warning when CR is not enabled
...
This commit was SVN r31025.
2014-03-12 13:47:03 +00:00
Adrian Reber
4512b3375e
OOB/TCP: wire up the existing ft_event() function
...
This commit was SVN r31022.
2014-03-12 12:47:20 +00:00
Adrian Reber
34625b360b
use the newly created JOB_STATE_FT_* events
...
This commit was SVN r31021.
2014-03-12 12:37:14 +00:00
Adrian Reber
8d40cd53ae
use the existing pretty-print function for information about the job state
...
This commit was SVN r31020.
2014-03-12 12:34:25 +00:00
Ralph Castain
7869402f5f
Sigh - looks like I did too good a job of turning things off. Back some of it out in favor of trying again when more time is available
...
Refs trac:4368
This commit was SVN r31017.
The following Trac tickets were found above:
Ticket 4368 --> https://svn.open-mpi.org/trac/ompi/ticket/4368
2014-03-12 02:10:35 +00:00
Ralph Castain
dc28015bcb
Something funny is going on when --without-orte, so revert the orte/Makefile.am for now while we try to figure it out
...
Refs trac:4368
This commit was SVN r31011.
The following Trac tickets were found above:
Ticket 4368 --> https://svn.open-mpi.org/trac/ompi/ticket/4368
2014-03-11 23:07:21 +00:00
Ralph Castain
9c66c4f439
Correctly implement --disable-oshmem and --without-orte so we don't build the disabled section of code. Fix a bunch of code rot in the PMI rte component, and add several missing headers when building --without-orte.
...
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31006.
2014-03-11 22:02:40 +00:00
Adrian Reber
49173ccd61
add debug output for the ft_event handler
...
This commit was SVN r30990.
2014-03-11 15:39:16 +00:00
Adrian Reber
7304b700e1
Fix the newly added FT event state when compiling --with-ft
...
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
8e080fb95e
Need a slightly different header
...
This commit was SVN r30986.
2014-03-11 03:03:12 +00:00
Ralph Castain
2cd1cfc7fe
Remove this ignore for now
...
This commit was SVN r30985.
2014-03-11 03:02:13 +00:00
Ralph Castain
103a5c6df1
Output the bindings if ess verbosity is high enough
...
Refs trac:4356
This commit was SVN r30982.
The following Trac tickets were found above:
Ticket 4356 --> https://svn.open-mpi.org/trac/ompi/ticket/4356
2014-03-11 01:21:14 +00:00
Ralph Castain
176b326c27
Add a comment to make Jeff happier...
...
Refs trac:4340
This commit was SVN r30980.
The following Trac tickets were found above:
Ticket 4340 --> https://svn.open-mpi.org/trac/ompi/ticket/4340
2014-03-10 23:02:04 +00:00
Ralph Castain
081669b440
When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings
This commit was SVN r30968.
2014-03-10 15:53:07 +00:00
Adrian Reber
b51733c456
fix "warning: 'sstore_stage_select' defined but not used"
...
In the function sstore_stage_select() the local variables
were set up and defined. Unfortunately this function was
never called. This patch moves variable set up to the
sstore_stage_register() function and checks the return
values of the variable initialization.
This commit was SVN r30958.
2014-03-06 16:53:27 +00:00
Ralph Castain
7a44af375c
Add an FT event state and set the state machine to callback to the OOB base ft event when activated
...
This commit was SVN r30950.
2014-03-06 02:44:29 +00:00
Ralph Castain
9793909988
Correct the constant we check for an error. Thanks to George for noticing it.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30949.
2014-03-06 02:21:27 +00:00
Ralph Castain
fc2dd6ac48
Per Jeff's request, add a more detailed comment as to why we are turning off the warning at this time.
...
Refs trac:4339
This commit was SVN r30948.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-06 02:17:25 +00:00
Ralph Castain
c9465d97b4
Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
...
Thanks to Paul Kapinos for reporting the problem.
cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition
This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
a2b539c763
Per the telecon, silence the warning for 1.7.5 to give us time to consider a better permanent solution
...
Refs trac:4339
This commit was SVN r30941.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-05 03:02:29 +00:00
Ralph Castain
50c30d62ca
Repair builds without hwloc
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30940.
2014-03-05 02:48:15 +00:00
Adrian Reber
e5bef82ee1
OPAL_ENABLE_FT_CR: remove compiler warnings
...
When compiling --with-ft there are a few compiler warnings about
unused variables. This patch fixes those compiler warnings.
This commit was SVN r30927.
2014-03-04 15:28:07 +00:00
Ralph Castain
da4cb39683
If we can't find a route to communicate, emit an error message rather than just exiting with a non-zero status
...
cmr=v1.7.5:reviewer=jsquyres:subject=print error if cannot communicate
This commit was SVN r30922.
2014-03-04 04:57:53 +00:00
Ralph Castain
0ac97761cc
Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
...
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.
In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.
Also cleanup some a couple of issues in the mapping/binding system:
* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed
* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch
* minor cleanup to the warning message when oversubscribed and binding was requested
cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system
This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
88b0e0cc6d
Allow the user to turn off the oversubscribed-binding warning if overload-allowed has been provided
...
Refs trac:4317
This commit was SVN r30892.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 17:55:53 +00:00
Ralph Castain
4a645f0342
Add detection of oversubscription with binding requested - if binding requested to core or hwt, warn and do not bind or else we will hurt performance. Also, if no binding directive was given, turn off the default binding
...
Refs trac:4317
This commit was SVN r30888.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 16:08:52 +00:00
Ralph Castain
8500247c7b
Fix the by-obj mapper in the case where slots are not specified, and so we are in a perpetual oversubscribed state
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30887.
2014-02-28 05:21:46 +00:00
Ralph Castain
a4c3d0a5a0
Add some more debug to the by-obj mapper
...
This commit was SVN r30884.
2014-02-28 02:52:53 +00:00
Ralph Castain
d109c523b9
Per patch from Tetsuya Mishima, complete the overhaul of the round-robin mappers
...
Refs trac:4296
This commit was SVN r30861.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-27 00:43:53 +00:00
Ralph Castain
61a21e4f31
Based on Tetsuya's patch, with some changes, correct the case of map-by node where multiple cpus/rank are requested and result in a non-integer match with num slots. Also correct tests for binding policy given to use the proper macro.
...
Refs trac:4296
This commit was SVN r30857.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-26 18:12:23 +00:00
Ralph Castain
b880aa46bd
Update the map-by obj and map-by obj:span mappers to correct for errors in computing carryover across the nodes. Be a little less complex in the algorithm so it is easier to follow and debug.
...
Refs trac:4296
This commit was SVN r30826.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-25 23:32:43 +00:00
Joshua Ladd
9ea9bec4ad
Addressing Jeff's comments:
...
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
2014-02-24 23:18:35 +00:00
Joshua Ladd
e39d9f4080
Per the RFC schedule, add an additive lagged Fibonacci parallel random number generator to OPAL. In order to use, please add the following header to your code: opal/util/alfg.h. See ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c for an example how to seed with opal_srand and invoke the generator with opal_rand. This should be added to
...
cmr=v1.7.5:reviewer=rhc:subject=Add an OPAL RNG
This commit was SVN r30801.
2014-02-23 21:41:38 +00:00
Ralph Castain
c8112c1086
Loadbalancing across nodes (i.e., map-by node) wasn't working correctly - the algorithm relied on the nodes being defined in descending order of slots, or the numbe
...
r of slots remaing to be assigned being only one/node. Regardless, it didn't work for the case where nodes were defined in ascending order of slots.
Tetsuya's proposed patch didn't solve the problem for me, but it did correct the case where cpus/proc > 1. The final patch requires that we loop over the assignment
algo until all procs are assigned or all nodes are filled - any remaining procs are then handled in the cleanup loop.
cmr=v1.7.5:reviewer=rhc:subject=fix map-by node for different cases
This commit was SVN r30798.
2014-02-22 16:39:41 +00:00
Adrian Reber
f17ec1ab10
ESS/BASE: orte-restart needs sstore
...
Running orte-restart requires an initialized sstore.
This opens the sstore component for FT builds just like
the snapc component.
This commit was SVN r30796.
2014-02-21 21:23:26 +00:00
Ralph Castain
0319d5fb19
Seeing some errors coming out of MTT on this component, so turn it off for now and will debug later
...
This commit was SVN r30789.
2014-02-21 16:31:52 +00:00
Mike Dubman
8d4592a94b
rmaps/mindist: better error message
...
better error message when there is only one socket available
fixed by Elena, reviewed by Miked
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30787.
2014-02-21 11:38:35 +00:00
Ralph Castain
5520d6971b
We do have to track the origin of messages sent over usock as the daemon does route them back down, and we need to get the "sender" info correct. Also do a better job of dealing with simultaneous connections to avoid binding to a used socket.
...
Refs trac:4280
This commit was SVN r30781.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-20 17:27:05 +00:00
Ralph Castain
63803f5e61
Fix the leader data for PMI direct-launch as well
...
This commit was SVN r30778.
2014-02-20 01:41:19 +00:00
Ralph Castain
418ca60776
Since we don't know the name of the local leader, store that info under our own name :-)
...
This commit was SVN r30777.
2014-02-20 01:39:52 +00:00
Ralph Castain
262c927778
Define a new key and store the process name of the local_rank=0 process on each node so that the MPI layer can retrieve it as desired.
...
This commit was SVN r30759.
2014-02-18 00:32:58 +00:00
Adrian Reber
6b45d475e9
Fix compiler warnings when compiling with --with-ft
...
With enabled fault tolerance code different functions
are selected during compilation. Most of the ft
code is #ifdef'd out. This #ifdef's more code out
so that compiler warnings like
warning: unused variable 'item' [-Wunused-variable]
opal_list_item_t *item;
are removed.
This commit was SVN r30747.
2014-02-17 10:53:44 +00:00
Ralph Castain
c3df744a3b
Shift the orte_db_localrank key to the opal level. Add the job and proc-level session directory names to the database using opal_db keys.
...
This commit was SVN r30746.
2014-02-17 01:40:56 +00:00
Ralph Castain
ea0217c337
Remove unused file and minimize the usock uri contribution (add explanation as to why)
...
Refs trac:4280
This commit was SVN r30744.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-16 22:37:30 +00:00
Ralph Castain
a91d358c48
Add/modify a couple of tests
...
This commit was SVN r30743.
2014-02-16 20:54:34 +00:00
Ralph Castain
d42f4be8a4
Add unix socket component to OOB - no longer require active network for local operations. Demonstrate inter-transport crossover.
...
VERY tentatively schedule this for 1.7.5 - only to be applied if we see no troubles AND the branch is ready in advance.
cmr=v1.7.5:reviewer=rhc:subject=Add unix socket component to OOB
This commit was SVN r30742.
2014-02-16 20:54:12 +00:00
Ralph Castain
14bb7a117c
Fix bugs in the oob base - ensure we get the components in high-to-low priority, and that we correctly track reachability via all components. Adjust the priority of the tcp component to leave headroom for others
...
Refs trac:267
This commit was SVN r30740.
The following Trac tickets were found above:
Ticket 267 --> https://svn.open-mpi.org/trac/ompi/ticket/267
2014-02-16 03:19:08 +00:00
Ralph Castain
509d5d82b0
Add some verbage requested by Jeff, change the param level to something...?
...
Refs trac:4275
This commit was SVN r30736.
The following Trac tickets were found above:
Ticket 4275 --> https://svn.open-mpi.org/trac/ompi/ticket/4275
2014-02-15 15:11:05 +00:00
Ralph Castain
3f9db36e0d
Make Jeff smile - pretty-up the indentation
...
Refs trac:4267
This commit was SVN r30733.
The following Trac tickets were found above:
Ticket 4267 --> https://svn.open-mpi.org/trac/ompi/ticket/4267
2014-02-14 23:25:48 +00:00
Ralph Castain
91f90058ce
Add missing options and cleanup the code a bit. Default to by-slot ranking if a non-hardware option isn't given. Thanks to Tetsuya Mishima for the assist.
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30725.
2014-02-14 10:23:16 +00:00
Ralph Castain
fd9b301a8b
Check equality instead of bit-mask - thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30722.
2014-02-14 02:34:42 +00:00
Ralph Castain
4e1c07cbf2
If we are given a TCP oob address that doesn't match any active module, it is still possible that we could route to the address if a router is in the system. No harm in trying, so arbitrarily pick the first connection in the active module list and assign the peer to it. If that module can't reach it, we'll follow the usual failover mechanism until finally concluding that nobody can get there.
...
cmr=v1.7.5:reviewer=jsquyres:subject=handle non-matching addresses
This commit was SVN r30719.
2014-02-13 23:37:22 +00:00
Ralph Castain
449cd8f3d7
Update a couple of fields, add a scheduler field to proc_info
...
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Ralph Castain
fc6101b508
Handle "localhost" better
...
Refs trac:4263
This commit was SVN r30702.
The following Trac tickets were found above:
Ticket 4263 --> https://svn.open-mpi.org/trac/ompi/ticket/4263
2014-02-12 20:30:39 +00:00
Ralph Castain
a8a9801a0b
Ensure an orted exits with non-zero status if it is unable to send a message. Add more diagnostic messages to the OOB set_addr code
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30701.
2014-02-12 19:44:01 +00:00
Ralph Castain
1473dde6ea
Okay, once again be caught by the blasted hwloc inability to cleanly handle caches. Protect the calls to get_depth by first checking to see if it is a "cache", then use a cache-specific function to get the stupid data. Very, very irritating.
...
cmr=v1.7.5:reviewer=jsquyres:subject=treat caches as something different yet again
This commit was SVN r30693.
2014-02-12 01:45:06 +00:00
Ralph Castain
1565816988
Do a little better job of cleaning up the session directory left by mpirun by ensuring we delete the event associated with debugger attachment and unlinking the pipe used for that purpose. Also, we no longer leave "abort" files around, so remove that check when deleting session directory trees
...
cmr=v1.7.5:reviewer=jsquyres:subject=cleanup session directories better
This commit was SVN r30689.
2014-02-11 22:16:17 +00:00
Ralph Castain
fa7b686ccc
Provide better messages when we don't find any included interfaces, and/or don't find any interfaces for use by OOB.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30675.
2014-02-11 19:29:03 +00:00
Ralph Castain
b566cd5e30
Protect against no modifiers
...
Refs trac:4117
This commit was SVN r30672.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:34:37 +00:00
Ralph Castain
6fa34407bf
Handle modifiers to the --map-by dist option
...
Refs trac:4117
This commit was SVN r30671.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:19:05 +00:00
Ralph Castain
4781ea71b6
Correct the handling of various map/bind combinations when pe=N is given. Thanks to Elena Elkina for reporting it.
...
Refs trac:4117
This commit was SVN r30663.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 03:05:26 +00:00
Ralph Castain
707e51d786
Check for --cpus-per-proc earlier, before the correct option can be processed. Thanks to Tetsuya Mishima for reporting it.
...
Refs trac:4117
This commit was SVN r30662.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 02:53:53 +00:00
Ralph Castain
d66d2f5fb3
It is just fine to map by node or slot and bind, so ensure the switch statement includes those options. Thanks to Tatsuya Mishima for point it out.
...
Refs trac:4240
This commit was SVN r30661.
The following Trac tickets were found above:
Ticket 4240 --> https://svn.open-mpi.org/trac/ompi/ticket/4240
2014-02-11 02:52:01 +00:00
Ralph Castain
a49e0db8dd
We haven't supported a c++ wrapper for ORTE in quite some time
...
cmr=v1.7.5:reviewer=ompi-gk1.7:subject=remove c++ cruft
This commit was SVN r30653.
2014-02-10 17:16:30 +00:00
Ralph Castain
1a12325094
Rats - need to include bydist in the mapping list
...
Refs trac:4117
This commit was SVN r30649.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-09 16:17:05 +00:00
Ralph Castain
0dc5f50d27
Add a plm component for local-only operation that doesn't require rsh/ssh to be installed. Requested by Fedora packagers for testing purposes.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Add a plm component for local-only operation
This commit was SVN r30645.
2014-02-09 15:53:10 +00:00
Ralph Castain
ca0c806662
Resolve the problem of binding in inverted topologies - check the relative depth of the map and bind objects in the topology, and let that determine whether we bind downward or upwards.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Resolve the problem of binding in inverted topologies
This commit was SVN r30643.
2014-02-09 05:30:17 +00:00
Ralph Castain
0ee38353ba
In case there are stale session directories around, do a purge of the relevant session directory tree when an orted, HNP, or singleton start. This won't help in the case of direct-launched apps, but it's the best we can do.
...
cmr=v1.7.5:reviewer=jsquyres:subject=purge stale session dirs at startup
This commit was SVN r30642.
2014-02-09 02:10:31 +00:00
Ralph Castain
1d8c061687
Fix a race condition that could result in assert failures during finalize. Ensure we shutdown the orte progress thread prior to finalizing the rml/oob frameworks so that no async operations are executing during destruct of the base-level lists and objects.
...
cmr=v1.7.5:reviewer=jsquyres:subject=fix race condition in finalize
This commit was SVN r30641.
2014-02-08 22:04:19 +00:00
Ralph Castain
5b8e1180cf
Update a test
...
This commit was SVN r30640.
2014-02-08 22:00:12 +00:00
Ralph Castain
a94920276d
Fix singleton MPI_Abort. Singletons no longer immediately start an HNP, but only launch one when they need it for comm_spawn. So there isn't anyone to send the "abort" report to, and thus we just exit after emitting our message.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Fix singleton MPI_Abort
This commit was SVN r30635.
2014-02-08 18:15:07 +00:00
Ralph Castain
bc7cc09749
After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
...
can't apply its normal precedence rules.
So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
can do. If only the old style param is given, then process it after the warning.
Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.
For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.
Refs trac:4117
This commit was SVN r30620.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
Ralph Castain
c617d66d98
Paul Hargrove has pointed out that some big SMP systems (e.g., from SGI) configure Torque differently - instead of listing each node name once/slot in the nodefile, they list the node only once and set an envar to indicate the number of procs/node being allocated. Add an MCA param users can set to indicate we are in such an environment, and then use the envar to set the slots. Error out if the mode flag is given, but (a) we don't find the PBS_PPN envar, or (b) we find a node actually listed more than once in the PBS_Nodefile.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Support SMP mode in Torque
This commit was SVN r30568.
2014-02-05 15:51:17 +00:00
Ralph Castain
1326ed704f
Per the RFC discussed here:
...
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php
add support for async modex when requested.
cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support
This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Ralph Castain
230336b6a8
Upgrade the security framework to avoid multiple hits against the global security server. Add support for future case where mpirun assings a global security credential for a given run, though we need to work out how to handle connect-accept from other mpirun's in that case. Remove a bunch of duplicate code in the OOB by consolidating the connection handshake code.
...
Refs trac:4221
This commit was SVN r30554.
The following Trac tickets were found above:
Ticket 4221 --> https://svn.open-mpi.org/trac/ompi/ticket/4221
2014-02-04 14:47:04 +00:00
Adrian Reber
fde1040d2f
Use unique collective ids for the checkpoint/restart code
...
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
5980b7e042
Add a security framework for authenticating connections - we will add LDAP, Kerberos, and Keystone support in the next month. For now, just put a placeholder "basic" module that does the minimum.
...
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
2014-02-04 01:38:45 +00:00
Ralph Castain
e43589ed84
Fix warning - thanks to Paul Hargrove for reporting it
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30548.
2014-02-03 23:51:45 +00:00
Ralph Castain
993198cfba
Fix lost message problem - if multiple messages are queued before the connection is formed, we lost all but the first one. Ensure that all messages get properly queued prior to completing the connection
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix lost message problem
This commit was SVN r30516.
2014-01-31 05:30:51 +00:00
Ralph Castain
2bc9fd30ee
Orcm sends heartbeats to its daemons, but ORTE needs to continue sending it to the HNP
...
This commit was SVN r30514.
2014-01-31 01:56:01 +00:00
Ralph Castain
193cceb483
Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps.
...
Yippee skippee...
This commit was SVN r30513.
2014-01-30 23:50:14 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00
Ralph Castain
83e32aadb7
Add a variant of opal_init/finalize for running unit tests
...
This commit was SVN r30497.
2014-01-30 11:14:36 +00:00
Ralph Castain
db92ac3ce1
Cleanup role of aggregator relative to daemons
...
Refs trac:4176
This commit was SVN r30495.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-30 00:53:30 +00:00
Ralph Castain
ed3da20672
Add unit test for opal_db
...
This commit was SVN r30494.
2014-01-30 00:51:44 +00:00
Adrian Reber
af934fc6e8
removed trailing whitespaces in snapc
...
This commit was SVN r30489.
2014-01-29 21:27:13 +00:00
Adrian Reber
7de34ea201
SNAPC/CRCP/SSTORE: remove compiler warnings
...
This commit was SVN r30488.
2014-01-29 20:52:00 +00:00
Adrian Reber
5f95db3902
SNAPC: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls.
This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.
This commit was SVN r30487.
2014-01-29 20:46:14 +00:00
Adrian Reber
fa1036f38c
SSTORE/CRCP: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls were replaced by non-blocking
which broke the code. This patch uses ORTE_WAIT_FOR_COMPLETION()
to wait until the non-blocking calls have finished.
This commit was SVN r30486.
2014-01-29 20:30:35 +00:00
Adrian Reber
d5c1e33900
SSTORE: use dynamic buffers for rml.send and rml.recv
...
The sstore component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30485.
2014-01-29 20:06:23 +00:00
Adrian Reber
2900f24b67
SNAPC: use dynamic buffers for rml.send and rml.recv
...
The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30484.
2014-01-29 19:58:33 +00:00
Ralph Castain
4e3d12d9c1
Fix suicide operation when MPI app loses connection to its local daemon. In that scenario, we correctly callback up to the MPI layer notifying it of the lost connection. However, when the MPI layer calls back down to tell the RTE to abort, it is passing back a flag indicating we should report that error to our local daemon - which is dead. This leads to an infinite loop. Break it by using checking the flag indicating an abnormal term was ordered by the RTE and thus don't attempt to send the message.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30475.
2014-01-29 16:56:54 +00:00
Ralph Castain
410a3afa7b
Fix --without-hwloc operations - must default to map-by slot in that scenario
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30474.
2014-01-29 16:54:05 +00:00