Ralph Castain
42bf7466fc
This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
...
Refs trac:4717
This commit was SVN r32019.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Gilles Gouaillardet
d26ac02b4a
#if OPAL_HAVE_HWLOC protect access to orte_proc_info_t.cpuset
...
Fix a bug when trunk is configured with --without-hwloc
v1.8 is safe so no cmr
This commit was SVN r31957.
2014-06-06 07:25:39 +00:00
Ralph Castain
34cb137314
Add another attribute to the orte_proc_t area
...
This commit was SVN r31953.
2014-06-05 14:48:19 +00:00
Ralph Castain
b771388fa7
We really need to send *all* the daemon info whenever the daemon job has changed as new daemons need a full nidmap
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31948.
2014-06-04 03:38:54 +00:00
Ralph Castain
f1978fba7c
Cleanup a set of typos on the orte_get_attribute call
...
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
8736a1c138
Per RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php
Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).
This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Nathan Hjelm
73bfecd650
More leak fixes.
...
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
2014-05-15 16:38:13 +00:00
Gilles Gouaillardet
5f82c391a6
Fix memory leaks in orte/util/nidmap.c
...
This patch fixes four memory leaks in orte/util/nidmap.c :
- hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed
- even if bo->bytes is freed in the decode, bo was not freed
- a job list is populated but never used nor freed
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31770.
2014-05-15 08:28:53 +00:00
Ralph Castain
5388347511
Per Jeff's suggestion, remove function that has duplicate functionality and just use one to check if session_dir directory should be removed.
...
Refs trac:4584
This commit was SVN r31691.
The following Trac tickets were found above:
Ticket 4584 --> https://svn.open-mpi.org/trac/ompi/ticket/4584
2014-05-08 17:22:43 +00:00
Ralph Castain
5602156a1c
Use the correct abstraction layer name for the data dirs
...
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
05590b6a8c
Correct the datastore containing the coprocessor info
...
This commit was SVN r31677.
2014-05-07 19:29:12 +00:00
Ralph Castain
0209cddb5b
Revert r31596 and r31595 as they recreate the "abort" problem - all they did was move the blocking send to another point in the code. An alternative solution to the "show_help and abort" problem. will come in another commit
...
Refs trac:4576
This commit was SVN r31599.
The following SVN revision numbers were found above:
r31595 --> open-mpi/ompi@2b61f22973
r31596 --> open-mpi/ompi@712634efd3
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:38:30 +00:00
Ralph Castain
712634efd3
Silence warning
...
Refs trac:4576
This commit was SVN r31596.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-01 23:58:03 +00:00
Ralph Castain
2b61f22973
Now that the abort code no longer involves a blocking rml send section, apps that call show_help followed by abort are not printing their error message. So block them in show_help until that message gets out.
...
This commit was SVN r31595.
2014-05-01 22:57:17 +00:00
Ralph Castain
238ecea311
When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
...
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn
This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Ralph Castain
087b84b0ef
Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers.
...
This commit was SVN r31572.
2014-04-30 19:29:00 +00:00
Ralph Castain
8cda1b3dc6
Don't store cpu_bitmap unless it is non-NULL
...
This commit was SVN r31570.
2014-04-30 18:12:48 +00:00
Ralph Castain
7a79b25577
Ensure we cleanup some files so session dirs can be rolled up
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
c4c9bc1573
As per the RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Jeff Squyres
38a27b858d
Protect for the CLEANUP case where tmp hasn't been set yet
...
Refs trac:4536
This commit was SVN r31438.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:34:53 +00:00
Jeff Squyres
530f22c403
proc_info.c: uncomment C99 struct member initialization usage
...
The C99 usage to initialize via struct member names was already there,
but commented out. This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.
This commit was SVN r31425.
2014-04-18 17:26:07 +00:00
Ralph Castain
12094eb7b2
Add some further protections after discussion with Jeff
...
Refs trac:4536
This commit was SVN r31422.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 16:21:55 +00:00
Ralph Castain
8d72633acf
Ensure that the session directory fields of orte_process_info have been initialized prior to cleaning up those directories as part of the initialization process that deals with stale session directory trees.
...
Fixes trac:4534
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31421.
The following Trac tickets were found above:
Ticket 4534 --> https://svn.open-mpi.org/trac/ompi/ticket/4534
2014-04-18 14:25:48 +00:00
Ralph Castain
deff85ffc3
Prevent a segfault if we encounter an error while parsing a hostfile. Don't issue and error_log output as the hostfile code already prints an error message
...
Thanks to Tetsuya Mishima for the patch. Reviewed ok by rhc.
RM-approved
cmr=v1.8.1:reviewer=ompi-gk1.8
This commit was SVN r31377.
2014-04-12 21:32:10 +00:00
Ralph Castain
61d94fcee2
Fix the sequential mapper - it was out-of-sync with the hostfile changes, and we missed the "seq" policy when parsing the --map-by option. Thanks to Bill Chen for reporting it
...
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31333.
2014-04-08 03:38:25 +00:00
Ralph Castain
3fdcaeab97
Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
...
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.
cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup
This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Jeff Squyres
173c046617
build: add Automake-like silent/verbose macros for "ln -s ..." operations
...
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
2014-03-28 18:24:32 +00:00
Ralph Castain
5a868028a8
Revert r31091 - the functionality didn't disappear, but moved into the MPI layer :-(
...
This commit was SVN r31093.
The following SVN revision numbers were found above:
r31091 --> open-mpi/ompi@edf680855e
2014-03-17 22:30:03 +00:00
Ralph Castain
edf680855e
Restore locality computation to the nidmap code - don't know how/when it was removed, but that was not good
...
cmr=v1.7.5:reviewer=hjelmn
This commit was SVN r31091.
2014-03-17 21:59:25 +00:00
Ralph Castain
7bb8dbade6
Extend the regular expression parsing support
...
This commit was SVN r31088.
2014-03-17 21:25:05 +00:00
Adrian Reber
8d40cd53ae
use the existing pretty-print function for information about the job state
...
This commit was SVN r31020.
2014-03-12 12:34:25 +00:00
Joshua Ladd
9ea9bec4ad
Addressing Jeff's comments:
...
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
2014-02-24 23:18:35 +00:00
Joshua Ladd
e39d9f4080
Per the RFC schedule, add an additive lagged Fibonacci parallel random number generator to OPAL. In order to use, please add the following header to your code: opal/util/alfg.h. See ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c for an example how to seed with opal_srand and invoke the generator with opal_rand. This should be added to
...
cmr=v1.7.5:reviewer=rhc:subject=Add an OPAL RNG
This commit was SVN r30801.
2014-02-23 21:41:38 +00:00
Ralph Castain
418ca60776
Since we don't know the name of the local leader, store that info under our own name :-)
...
This commit was SVN r30777.
2014-02-20 01:39:52 +00:00
Ralph Castain
262c927778
Define a new key and store the process name of the local_rank=0 process on each node so that the MPI layer can retrieve it as desired.
...
This commit was SVN r30759.
2014-02-18 00:32:58 +00:00
Ralph Castain
c3df744a3b
Shift the orte_db_localrank key to the opal level. Add the job and proc-level session directory names to the database using opal_db keys.
...
This commit was SVN r30746.
2014-02-17 01:40:56 +00:00
Ralph Castain
449cd8f3d7
Update a couple of fields, add a scheduler field to proc_info
...
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Ralph Castain
1565816988
Do a little better job of cleaning up the session directory left by mpirun by ensuring we delete the event associated with debugger attachment and unlinking the pipe used for that purpose. Also, we no longer leave "abort" files around, so remove that check when deleting session directory trees
...
cmr=v1.7.5:reviewer=jsquyres:subject=cleanup session directories better
This commit was SVN r30689.
2014-02-11 22:16:17 +00:00
Adrian Reber
fde1040d2f
Use unique collective ids for the checkpoint/restart code
...
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
e3cb4b4a5b
Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
...
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested
This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
14bf1c9463
Some minor cleanups:
...
* don't return null if someone wants to print ORTE_SUCCESS
* rename some stale process types
* keep show_help local if we are in standalone operation as there is nobody to send it to
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30400.
2014-01-23 21:35:20 +00:00
Ralph Castain
a01470190d
Allow a little more flexibility - if getpwuid fails, just use the return from getuid to define the session directory
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30388.
2014-01-23 05:00:05 +00:00
Ralph Castain
3e9c8497e0
Shift the verbose output a bit
...
Refs trac:4136
This commit was SVN r30332.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-20 14:41:37 +00:00
Ralph Castain
5ad9795bd8
Cleanup some potential memory overruns
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30331.
2014-01-19 16:31:26 +00:00
Ralph Castain
9f6fd7b98d
A few corrections to hostfile parsing - thanks to Tetsuya Mishima for the review
...
Refs trac:4136
This commit was SVN r30330.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-19 16:26:12 +00:00
Ralph Castain
fcdd904af4
Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
d5647394d8
Initialize variable so dash-host option gets correctly parsed
...
cmr=v1.7.4:reviewer=rolfv
This commit was SVN r30159.
2014-01-08 15:17:16 +00:00
Brian Barrett
8b778903d8
Fix longstanding issue with our multi-project support. Rather than using
...
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Ralph Castain
3f2b3c53ea
Ensure that rankfile-provided allocations are correctly handled
...
Fixes trac:4043
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that rankfile-provided allocations are correctly handled
This commit was SVN r30106.
The following Trac tickets were found above:
Ticket 4043 --> https://svn.open-mpi.org/trac/ompi/ticket/4043
2014-01-02 16:07:16 +00:00
Ralph Castain
bb80625a8a
Add missing var initialization
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30063.
2013-12-24 00:02:22 +00:00