1
1
Граф коммитов

1686 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
75a97ebbf0 Many thanks to Ralf W. for finding a subtle bug in these Makefile.am's
that can *sometimes* cause problems with "make -j [N>1] install".
Ensure to make the target directory before we copy stuff into it --
read the thread starting here for more details:

    http://www.open-mpi.org/community/lists/devel/2008/06/4080.php

This commit was SVN r18570.
2008-06-04 01:28:03 +00:00
Ralph Castain
8ce4b64b5a Ensure we don't go past the end of the array
This commit was SVN r18569.
2008-06-03 21:31:02 +00:00
George Bosilca
25ae9c12e6 Silence few warnings.
This commit was SVN r18568.
2008-06-03 19:58:40 +00:00
George Bosilca
fa89d299bf Silence the Obj-C compiler.
This commit was SVN r18567.
2008-06-03 19:24:17 +00:00
Jeff Squyres
a4db97c213 More man pages fixes from our Debian Open MPI package maintainer
friends.  Woo hoo!

This commit was SVN r18559.
2008-06-03 16:44:40 +00:00
Ralph Castain
c992e99035 Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface
This commit was SVN r18557.
2008-06-03 14:24:01 +00:00
Ralph Castain
95578b0528 Fix single-node operations so that the HNP correctly exits when the job completes
This commit was SVN r18556.
2008-06-03 14:23:04 +00:00
Ralph Castain
b456fb2d42 Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2*heartrate, declares orted failure if not seen in last 2*heartrate time.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.

Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.

More info to come on devel list.

This commit was SVN r18555.
2008-06-02 21:46:34 +00:00
Shiqing Fan
af656b2b3d Fix some typing mistakes, make the sources compile again for Windows Visual Studio.
This commit was SVN r18542.
2008-05-29 15:27:43 +00:00
Ralph Castain
2772221ae0 Add the -xml option to mpirun to indicate that xml output is desired
This commit was SVN r18539.
2008-05-29 14:11:31 +00:00
Ralph Castain
2b28bef15a Provide a "nicer" indication that we don't know the pid of the failed orted
This commit was SVN r18538.
2008-05-29 14:10:58 +00:00
Ralph Castain
72530f8fed Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed:
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.

2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.

This represents the best-available fix for ticket #221 so I am closing that ticket at this time.

This commit was SVN r18536.
2008-05-29 13:38:27 +00:00
Ralph Castain
52fb773c6c Tell slurm to kill the job if an orted abnormally exits
This commit was SVN r18535.
2008-05-29 12:26:58 +00:00
Ralph Castain
b2a566b610 Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems
This commit was SVN r18534.
2008-05-29 12:21:17 +00:00
Ralph Castain
e5e542ddcf Clarify an error message
This commit was SVN r18533.
2008-05-29 12:20:24 +00:00
Josh Hursey
4ac7016200 Make sure to check "opal_list_get_last" instead of "opal_list_get_end".
The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list.

It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory.

This commit fixes trac:1279

This commit was SVN r18529.

The following Trac tickets were found above:
  Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279
2008-05-28 19:37:20 +00:00
Ralph Castain
f76240e7cc Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch.
This commit was SVN r18528.
2008-05-28 18:38:47 +00:00
Ralph Castain
347752e40b One more corner case (caught by Jeff) occurs when someone requests usage help - e.g., with mpirun --help. In this case, we do a show_help prior to orte_init, but it is okay to do so.
To allow this, we let show_help just operate correctly without any warning about pre-orte_output_init.

This commit was SVN r18525.
2008-05-28 13:58:03 +00:00
Ralph Castain
828ae26d90 ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params.
By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. 

Fixes ticket #1314

In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally.

Fixes ticket #1315

This commit was SVN r18524.
2008-05-28 13:29:58 +00:00
Ralph Castain
5a2992dea2 After some discussion with Jeff, we have determined that the only time orte_output functions can be accessed prior to calling orte_output_init is if someone calls them prior to calling orte_init. This is clearly an error, so we now report that fact to the caller so it can be fixed.
It is still possible that someone can call an orte_output function during orte_finalize - this is not an error. Prior commits ensured that this is correctly handled. This commit only deals with improper calls prior to calling orte_init.

This commit was SVN r18513.
2008-05-27 20:13:04 +00:00
Ralph Castain
be193ead83 Ensure that any use of orte_output prior to calling orte_output_init gets a properly initiated stream tracking object to avoid later segfaults
This commit was SVN r18512.
2008-05-27 19:07:31 +00:00
George Bosilca
1eb1742225 Remove this left over dependency.
This commit was SVN r18508.
2008-05-27 16:57:40 +00:00
Ralph Castain
93d932aa0c Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function.
This commit was SVN r18507.
2008-05-27 15:46:21 +00:00
Ralph Castain
e190a990ba Do not re-init the orte_output system if we have already finalized it as part of orte_finalize. Instead, default to routing the output to the std opal_output system so that the message still gets out. Of course, such messages cannot be filtered, but they are only for debug purposes by ORTE developers, so this should be a minimial issue.
This commit was SVN r18506.
2008-05-27 15:15:55 +00:00
Ralph Castain
0b2b655de5 Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312
This commit was SVN r18505.

The following Trac tickets were found above:
  Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312
2008-05-27 14:53:24 +00:00
Pak Lui
695c158192 silence some intel and pgcc compiler warnings.
This commit was SVN r18501.
2008-05-26 20:35:13 +00:00
Pak Lui
7b3d7dcac4 This commit closes trac:1300.
This commit was SVN r18473.

The following Trac tickets were found above:
  Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300
2008-05-21 22:35:04 +00:00
Terry Dontje
ef7ac86929 created opal_version_string and orte_version_string to match the ompi changes
made in r18345 for ompi_version_string.  This was done per request from Jeff 
Squyres to maintain consistency and to remove some warnings caused by the 
non-use of some static const char.

This commit was SVN r18461.

The following SVN revision numbers were found above:
  r18345 --> open-mpi/ompi@8dd0421015
2008-05-20 12:13:19 +00:00
Jeff Squyres
c546d7bda9 Ensure that the last few duplicates can be shown -- don't shut down
everything unitl those duplicates are shown

This commit was SVN r18460.
2008-05-20 01:34:55 +00:00
Jeff Squyres
af1a9290cb Ensure to check that opal_init_util() didn't fail.
This commit was SVN r18453.
2008-05-19 11:58:48 +00:00
Jeff Squyres
e88ac13e53 Fixes trac:1290: ensure that we setup the orte_init subsystem before using
it.

This commit was SVN r18448.

The following Trac tickets were found above:
  Ticket 1290 --> https://svn.open-mpi.org/trac/ompi/ticket/1290
2008-05-16 14:32:42 +00:00
Jeff Squyres
28d5f762ca Fixes trac:1289: ensure that if we haven't initialized the orte_output
system, we don't try to use it (e.g., if orte_output or orte_show_help
is called before orte_init).

This commit was SVN r18442.

The following Trac tickets were found above:
  Ticket 1289 --> https://svn.open-mpi.org/trac/ompi/ticket/1289
2008-05-16 02:03:42 +00:00
Josh Hursey
7e8cd20a0a a fix for C/R support
This commit was SVN r18438.
2008-05-14 16:57:37 +00:00
Jeff Squyres
671f0c379d Remove a whole pile of orte/util/show_help.h's that I missed. :-(
This commit was SVN r18437.
2008-05-14 11:32:33 +00:00
Pak Lui
4c8d79d907 Silence the compiler warnings/errors. There is no orte/util/show_help.h
This commit was SVN r18436.
2008-05-13 22:07:38 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Ralph Castain
313240f2b6 Pass the pointer to a string pointer to the packing function
This commit was SVN r18429.
2008-05-13 01:31:49 +00:00
Shiqing Fan
7ff440f628 Add quotation marks for windows path.
This commit was SVN r18420.
2008-05-09 14:12:09 +00:00
Josh Hursey
da2f1c58e2 Some checkpoint/restart cleanup.
* Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted.
 * Cleanup metadata interactions at the local level.
 * Touch up some of the INC funcitonality (fix typos and a minor ordering issue)

This commit was SVN r18416.
2008-05-08 18:47:47 +00:00
Ralph Castain
64ef4102c4 Add the topo mapper module - requires some work in carto for completion.
Little cleanup in round-robin mapper.

This commit was SVN r18412.
2008-05-08 05:09:13 +00:00
Ralph Castain
ac5263613c Fix stupid singletons yet again
This commit was SVN r18408.
2008-05-07 20:26:31 +00:00
George Bosilca
dbea3e070e Correct some copy/paste errors.
This commit was SVN r18396.
2008-05-07 04:04:42 +00:00
Ralph Castain
ff70636024 Allgather_list needs its own tag to avoid conflicting with the allgather modex operation.
All spawned procs must decode the port of the spawning process so they can communicate in direct routed mode.

This fixes comm_spawn for all routing modes.

This commit was SVN r18395.
2008-05-07 03:03:56 +00:00
Josh Hursey
bc67f40936 whoops typo
This commit was SVN r18390.
2008-05-06 22:00:24 +00:00
Josh Hursey
50c909a23d Fix a bit of selection logic. Filem should not fail select if the user decided not to build with any filem components. This matches the logic before the mca_base_select() change.
This commit was SVN r18389.
2008-05-06 21:57:45 +00:00
Pak Lui
108921c020 typo
This commit was SVN r18387.
2008-05-06 21:37:35 +00:00
Pak Lui
0302c098be minor typo
This commit was SVN r18386.
2008-05-06 21:26:17 +00:00
Ralph Castain
d97a4f880d Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation.
Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two.

Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day.

This commit was SVN r18385.
2008-05-06 20:16:17 +00:00
Josh Hursey
c47406810e Fix AMCA orted command line.
If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set.

This commit was SVN r18382.
2008-05-06 18:27:31 +00:00
Josh Hursey
9971bc9d95 Merge in the mca_base_select changes per RFC:
http://www.open-mpi.org/community/lists/devel/2008/04/3779.php

{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}

Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.

Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.

This commit was SVN r18381.
2008-05-06 18:08:45 +00:00