1
1
Граф коммитов

277 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
18cb5c9762 Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next.
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.

The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.

This commit was SVN r14499.
2007-04-24 20:53:54 +00:00
Ralph Castain
c774f641fb Modify orterun to provide more user-friendly reporting on jobs that fail to start
This commit was SVN r14496.
2007-04-24 19:19:14 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
Ralph Castain
009be1c1b5 Reorganize the orted code for easier maintenance. Add ability to deliver xcast messages to local procs (not used at this point).
This commit was SVN r14474.
2007-04-23 18:28:20 +00:00
Brian Barrett
0a8af62c64 Fix broken build on OS X with static compiles. Everything that uses
anything in OPAL *MUST* call either opal_init() or opal_init_util().

This commit was SVN r14468.
2007-04-23 15:45:39 +00:00
Josh Hursey
27a42f48d3 Make sure to call opal_init_util before mca_base_open().
This bug(?) become apparent due to the installdirs commit since these tools
were not finding the proper libraries since the paths were wonkey.

It all looks good now. :)

This commit was SVN r14461.
2007-04-21 22:38:15 +00:00
Jeff Squyres
5bebd24250 Bring over Brian's installdirs fixes from this afternoon (r14445).
This commit was SVN r14450.

The following SVN revision numbers were found above:
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:16:31 +00:00
Jeff Squyres
0ba47105ed Merge the /tmp/jms-installdirs-trunk branch into the trunk. This
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).

Short version of new features:

 * Support for ibv_fork_init() 
 * Automatically fill in the openib BTL bandwidth value by 
   querying the HCA port 
 * Installdirs functionality 
 * Fixes to always use -I in the Fortran wrapper compilers (#924) 
 * Gleb's mpool updates 
 * Remove some kruft in btl/openib/configure.m4, therefore 
   fixing the harmless warnings noted in #665 
 * Bunches of updates to the Linux RPM spec file 

I.e., effectively the same thing that r14411 brought to the v1.2
branch.

Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2).  Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).

This commit was SVN r14449.

The following SVN revision numbers were found above:
  r14411 --> open-mpi/ompi@83b31314ae
  r14432 --> open-mpi/ompi@a48f160595
  r14433 --> open-mpi/ompi@68f346d2bc
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:15:05 +00:00
Josh Hursey
6ee0c641fd Cleanup the output from orte-checkpoint so it is a bit more clear and references
the sequence number.

Before:
[...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt

After:
Snashot Ref.:   1 ompi_global_snapshot_1234.ckpt

This commit was SVN r14381.
2007-04-15 14:28:56 +00:00
Josh Hursey
8fd6d4ba09 add a newline so output is cleaner/clearer
This commit was SVN r14229.
2007-04-05 17:45:03 +00:00
George Bosilca
f2a6b9394f Deal with the include spree. Protect "environ" on Windows.
Some others minors modifications in order to make it
compile [again] on Windows.

This commit was SVN r14188.
2007-04-01 16:16:54 +00:00
Ralph Castain
0d98264097 Fix the nolocal option on the OMPI trunk
This commit was SVN r14138.
2007-03-24 16:16:16 +00:00
Jeff Squyres
bcdfbacaa4 Oops -- typo from previous commit. :-(
This commit was SVN r14130.
2007-03-23 00:51:50 +00:00
Jeff Squyres
a3dd0f2e08 Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it
should be (it's a mistake that it got left out).

This commit was SVN r14127.
2007-03-22 19:29:47 +00:00
Josh Hursey
101a2abd09 - Be more careful with parens
- Run the destructor *before* shutting things down.

This commit was SVN r14064.
2007-03-19 17:33:20 +00:00
Josh Hursey
a181c987cc Remove some old references to ft_enable parameter that no longer exists.
This was replaced by the "-am ft-enable-cr" AMCA parameter.

This commit was SVN r14055.
2007-03-17 20:02:42 +00:00
Josh Hursey
dadca7da88 Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD).
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.

This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.

This commit closes trac:158

More details to follow.

This commit was SVN r14051.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r13912

The following Trac tickets were found above:
  Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
2007-03-16 23:11:45 +00:00
Josh Hursey
0404444dbe * Added 2 new MCA parameters
- mca_base_param_file_prefix
     (Default: NULL)
     This is the fullname of the "-am" mpirun option. Used to specify a ':'
     separated list of AMCA parameter set files.
  - mca_base_param_file_path
     (Default: $SYSCONFDIR/amca-param-sets/:$CWD)
     The path to search for AMCA files with relative paths. A warning will be
     printed if the AMCA file cannot be found.

* Added a new function "mca_base_param_recache_files" the re-reads the file
configurations. This is used internally to help bootstrap the MCA system.

* Added a new orterun/mpirun command line option '-am' that aliases for the
mca_base_param_file_prefix MCA parameter

* Exposed the opal_path_access function as it is generally useful in other
places in the code.

* New function "opal_cmd_line_make_opt_mca" which will allow you to append a
new command line option with MCA parameter identifiers to set at the same
time. Previously this could only be done at command line declaration time.

* Added a new directory under the $pkgdatadir named "amca-param-sets" where all
the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first
place to search for AMCA sets with relative paths.

* An example.conf AMCA parameter set file is located in
contrib/amca-param-sets/.

* Jeff Squyres contributed an OpenIB AMCA set for benchmarking.

Note: You will need to autogen with this commit as it adds a configure param.
  Sorry :(

This commit was SVN r13867.
2007-03-01 13:39:20 +00:00
Rainer Keller
0889ebd59f - Eliminate warnings, that PGI-6.2.5 issues with -Minform=inform
This commit was SVN r13840.
2007-02-28 08:36:34 +00:00
George Bosilca
d29423b1f7 orted_globals_t should be global.
This commit was SVN r13684.
2007-02-16 18:16:06 +00:00
Pak Lui
2d6b3776bf * fix the SEGV described in trac #892 that the exit_status in the 200 range
causes a strsignal to show NULL as a result. Still trying to determine
  why exit_status is in that range.

This commit was SVN r13583.
2007-02-09 16:39:30 +00:00
Pak Lui
ccff0a6e65 * minor fix to correct the pid that always shows up as 0 in the abort
error message. e.g: 

  mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4
  exited on signal 15 (Terminated).

This commit was SVN r13537.
2007-02-07 17:46:19 +00:00
Rolf vandeVaart
dcce8c739c Fix compiler warning. I am not sure how this got
passed us, but thanks to Jeff Squyres for pointing it out.

This commit was SVN r13501.
2007-02-05 22:03:58 +00:00
Rolf vandeVaart
74e3b68ce8 Better document orte-clean's behavior.
This commit was SVN r13498.
2007-02-05 20:01:15 +00:00
Jeff Squyres
4e506e69e5 Add missing <sys/param.h>
This commit was SVN r13478.
2007-02-03 01:11:35 +00:00
Rolf vandeVaart
bf5113198d Update to orte-clean so it will remove files on local and
remote nodes.  It will also kill off rogue orteds and orterun
processes.  The killing of processes is ifdef'ed out for Windows
since I do not know how to do it there.  Note that this change
will requite an autogen.  

This commit was SVN r13477.
2007-02-03 00:25:42 +00:00
Ralph Castain
3daf8b341b Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence:
1. if the user has specified sched_yield, we simply do what we are told

2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures.

3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false

4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true.

Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are *only* looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in *other* jobs that are also sharing this node.

Something to continue to ponder.

This commit was SVN r13430.
2007-02-01 19:31:44 +00:00
George Bosilca
9f73335bdb Silence the compiler.
This commit was SVN r13381.
2007-01-31 04:24:56 +00:00
Jeff Squyres
8d872b195a Refs trac:726
Tested this functionality quite a bit more and made some fixes:

 * Print far fewer help messages
 * Fix one additional deadlock upon error
 * Change some ORTE_LOG messages to silent (because they're not
   errors)
 * Some code got re-indented, sorry...

Discussed and reviewed with Ralph.

This commit was SVN r13375.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-30 23:03:13 +00:00
Rainer Keller
3669e8921e - Fix further compiler warnings regarding initialization
and shadowing variables.

This commit was SVN r13358.
2007-01-30 06:34:38 +00:00
Ralph Castain
ab5ea61100 Bring over the rest of the ctrl-c fixes. This commit includes:
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.

2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.

3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded

4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond

This needs more testing before migrating to 1.2.

This commit was SVN r13304.
2007-01-25 14:17:44 +00:00
Ralph Castain
455e4ada9a Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes.
This commit was SVN r13191.
2007-01-18 17:15:19 +00:00
Ralph Castain
cc905290e4 Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options
This commit was SVN r13151.
2007-01-17 14:56:22 +00:00
Ralph Castain
5d698dc55b Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion.
This commit was SVN r13127.
2007-01-16 16:10:31 +00:00
Jeff Squyres
e5205657cf A much better fix for #739. No configure test -- just do a simple
memcpy() instead of assigning the struct's by value.

Fixes trac:739.

This commit was SVN r13081.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:30:32 +00:00
Jeff Squyres
add3909096 Back out 13076 and 13077 in favor of a much simpler approach.
Sorry for the configure change -- hopefully it's early enough in the
morning that it won't affect people... (new approach won't have a
configure change).

Refs trac:739.

This commit was SVN r13080.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:07:15 +00:00
George Bosilca
24a91fad1d OPAL_BOOL_STRUCT_COPY or OMPI_BOOL_STRUCT_COPY that's the question!
Let's minimize the disturbances and say that the configure system is right.
From now on it's OPAL_BOOL_STRUCT_COPY. This one is related to r13076 and
has to follow when r13076 goes in the 1.2.

This commit was SVN r13077.

The following SVN revision numbers were found above:
  r13076 --> open-mpi/ompi@f0932a0701
2007-01-11 05:44:48 +00:00
Jeff Squyres
f0932a0701 A workaround for a bug in the PGI 6.2 compiler series. This bug has
been fixed in the 7.0 PGI series, but is unlikely to be fixed in the
6.2 series:

 * Add a configure test looking for the bad behavior (the PGI compiler
   chokes on C code where structs containing bool's are copied by
   value)
 * Set OMPI_BOOL_STRUCT_COPY to 1 if it's ok, 0 if it's not (i.e., PGI
   6.2 series will have this value set to 0)
 * In two places in the code base -- orte-clean and btl_openib_ini.h,
   we have a struct that contains a bool that is copied by value.  In
   these two places, check OMPI_BOOL_STRUCT_COPY and if it's 1, use
   the "int" type instead of "bool".

Fixes trac:739

This commit was SVN r13076.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 02:21:26 +00:00
Jeff Squyres
8a289cf1cb Part 1 of the fix for ticket #726. This commit adds logic to orteun
to effect the following:

 * The first time the user hits ctrl-c, we go into the process of
   killing the ORTE job (this is not new).
 * While waiting for the job to actually terminate, if the user hits
   ctrl-c a second time, we print a warning saying "Hey, I'm still
   trying to kill the job.  If you *really* want me to die
   immediately, hit ctrl-c again within 1 second."
 * If the user hits ctrl-c a within 1 second, orterun quits with a
   warning about how the job may not have actually been killed.

Note that none of this logic won't really work until the second part
of the fix for #726 is also committed (i.e., make pls.terminate_job()
non-blocking).  So I'm now throwing the ticket over to Ralph for the
second part of the fix...

Refs trac:726

This commit was SVN r13040.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-08 20:25:26 +00:00
Rolf vandeVaart
fdf44cc4ab Add the ability to not only report broken files and directories,
but remove them also.  This current set of changes will affect
nothing as no one is making use of this ability.  However, orte-clean
will be changed soon to utilize this new feature.

This commit was SVN r12996.
2007-01-04 21:48:34 +00:00
Brian Barrett
bc6cec346f Print out the description of the signal from mpirun when a proc was aborted
by a signal if we have strsignal()

This commit was SVN r12888.
2006-12-17 20:01:11 +00:00
Ralph Castain
7b8f445e13 Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer.
This commit was SVN r12840.
2006-12-13 13:49:15 +00:00
Ralph Castain
82946cb220 Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code.
Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here.

This commit was SVN r12838.
2006-12-13 04:51:38 +00:00
Ralph Castain
28ce8e5e5e Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior).
In "--npernode" operation, the "-np" command line parameter is ignored.

This commit was SVN r12826.
2006-12-12 00:54:05 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Rainer Keller
e61dd8722e - Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx
- No functional changes, just indentation and corrections to error
   output.

This commit was SVN r12734.
2006-12-03 13:59:23 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
Ralph Castain
652b91ee26 Remove some compiler warnings
This commit was SVN r12678.
2006-11-27 23:47:36 +00:00
Brian Barrett
32833deff0 since orteboot, ortehalt, and ortekill were all added today (including to
configure.ac), we need to add them to SUBDIRS to make them end up in the
tarball as well...

This commit was SVN r12658.
2006-11-23 03:10:57 +00:00
Ralph Castain
7f95b27141 Correctly "hide" the new orte tools - they shouldn't get compiled or seen unless you specifically go into those subdirectories and manually do a "make".
This commit was SVN r12650.
2006-11-22 14:35:16 +00:00