1
1

682 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
671f0c379d Remove a whole pile of orte/util/show_help.h's that I missed. :-(
This commit was SVN r18437.
2008-05-14 11:32:33 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Rolf vandeVaart
0e32dd1022 Add MPI_Alltoallv to tuned collectives and add a pairwise implementation of MPI_Alltoallv. However, do not change the default behavior for now. The only way to use new pairwise implementation is via mca parameters.
This commit was SVN r18394.
2008-05-07 02:31:24 +00:00
Rich Graham
4d1ae7b05f accidentally made a change in the wrong place.
This commit was SVN r18262.
2008-04-23 17:32:05 +00:00
Rich Graham
293dd6ad4e add myself to list of people building this module.
This commit was SVN r18261.
2008-04-23 17:25:36 +00:00
Rich Graham
7658cc79e4 Pass in the correct module to the reduction call.
This commit was SVN r18260.
2008-04-23 17:23:30 +00:00
Tim Mattox
0215474cb8 Fix two bugs in coll_sm_module.c from bit-rot:
Fixed a selection bug, and removed a bogus "free(proc)" call
which ultimately caused MPI_Finalize to crash.

This commit was SVN r18235.
2008-04-22 18:41:21 +00:00
Rich Graham
df35223603 add selection logic for barrier and reduce.
This commit was SVN r18215.
2008-04-19 22:40:04 +00:00
Rich Graham
bee8b42f29 remove debug code that would not let people run.
Add infrastructure for blocking-barrier.

This commit was SVN r18214.
2008-04-19 01:34:04 +00:00
Rich Graham
6c77fa4921 add a blocking shared memory algorithm.
This commit was SVN r18185.
2008-04-16 22:10:23 +00:00
Rich Graham
249445d61f added reduce-scatter followed by gather to root.
This commit was SVN r18133.
2008-04-11 13:49:08 +00:00
Rich Graham
a6bdbfab97 implement allreduce as reduce-scatter, followed by an allgather.
This commit was SVN r18132.
2008-04-11 04:06:29 +00:00
Rich Graham
70f3aab5f2 remove some code that is not needed.
This commit was SVN r18128.
2008-04-10 17:32:04 +00:00
Rich Graham
5c7db1e315 remove 2 race conditions in the buffer recycling logic.
This commit was SVN r18127.
2008-04-10 17:20:52 +00:00
Edgar Gabriel
4964434205 reverting commit 18122, since the commit was executed accidentally in the
wring directory. The UH copyrights do belong into this file (i.e. because of
the fix which is in the 1.2 branch, the UH copyright notes are in the header
there alreary), but I want to have the proper log for that.  

This commit was SVN r18124.
2008-04-10 15:09:31 +00:00
Edgar Gabriel
f87830767a the verification of recvcount==0 and rank = root was braking
inter-communicator scatter, since the root (root==MPI_ROOT) might very well
have recvcount=0. The same fix has been applied to gather.c just the other way
round. 
 
Fixes the bug reported on the mainling list by Martin Audet. If there is a
1.2.7 this fix might be worthwhile porting it over.

Please note, that while the test works now for basic and for inter, we get a
0byte malloc warning from the inter module, which we still have to fix in a
separate patch.

This commit was SVN r18122.
2008-04-10 14:58:51 +00:00
Rich Graham
c6783549ef getting old
This commit was SVN r18110.
2008-04-09 16:55:16 +00:00
Rich Graham
1a20c3ce51 more debug.
This commit was SVN r18109.
2008-04-09 16:19:52 +00:00
Rich Graham
e7e18303f6 more debug.
This commit was SVN r18108.
2008-04-09 15:10:58 +00:00
Rich Graham
b14c6b17d5 adding debug output.
This commit was SVN r18107.
2008-04-09 13:32:01 +00:00
Rich Graham
10434fb2f1 add barrier synchorinzation at the end of the module init, to
avoid initializing shared memory variables in use.

This commit was SVN r18105.
2008-04-09 03:44:40 +00:00
Rich Graham
19bb1a2e86 fix initialization bug.
This commit was SVN r18104.
2008-04-08 23:34:06 +00:00
Rich Graham
a69a8d9626 initialize the flags.
This commit was SVN r18102.
2008-04-08 22:16:39 +00:00
Rich Graham
8765a2bbdd more debug code.
This commit was SVN r18101.
2008-04-08 20:38:20 +00:00
Rich Graham
08becf33b5 add more debugging.
This commit was SVN r18100.
2008-04-08 18:44:50 +00:00
Rich Graham
aa1b7dd406 more debug
This commit was SVN r18099.
2008-04-08 03:56:47 +00:00
Rich Graham
0c18bdeff7 more debug code.
This commit was SVN r18098.
2008-04-08 03:04:20 +00:00
Rich Graham
9d5a7238df Add some debugging code.
This commit was SVN r18097.
2008-04-07 23:20:15 +00:00
Rich Graham
fa696734d5 add some debug code.
This commit was SVN r18096.
2008-04-07 21:03:23 +00:00
Rich Graham
1b54e8b76e fix buffer management for nb-barrier.
This commit was SVN r18081.
2008-04-05 21:59:04 +00:00
Rich Graham
94f8fd365c a few reduction optimizations. Add bcast.
This commit was SVN r18075.
2008-04-02 19:02:33 +00:00
George Bosilca
a00ca20446 More cleanups.
This commit was SVN r18069.
2008-04-02 06:38:33 +00:00
Rich Graham
eb5d6096f1 add reduction routine - fix buffer recycling logic which was totally
broken.

This commit was SVN r18065.
2008-04-01 22:56:18 +00:00
Rich Graham
90e53ca9ee debug the pipeline algorithm.
This commit was SVN r18008.
2008-03-28 15:10:07 +00:00
Rich Graham
e2ad9c4be2 adjust to change in orte_process_info.
This commit was SVN r17986.
2008-03-27 01:25:28 +00:00
Rich Graham
441fb9fb9e checkpoint.
This commit was SVN r17985.
2008-03-27 01:16:32 +00:00
Ralph Castain
cca449e379 Move an OMPI RML tag to the OMPI layer
This commit was SVN r17950.
2008-03-25 13:30:48 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Rich Graham
a7c836a2b0 fix location of the restrict key word.
Make the tag in the fan-in/fan-out algorithm be fragment based.

This commit was SVN r17903.
2008-03-21 01:40:36 +00:00
Rich Graham
2c66d396b7 take care of some bit-rot with the fanin-fanout method.
This commit was SVN r17902.
2008-03-21 01:08:49 +00:00
Rich Graham
b9520e61dc get the sm optimized allreduce working for all but user defined
operations.  Added to the reduction operations a set of reduction
functions that take 2 input buffers and one output buffer to avoid
some extra memory copies.  These can't be used with user defined
operations.  The intel c collective suite passes both original, and
new (new, not the user defined operations).

This commit was SVN r17901.
2008-03-20 23:51:16 +00:00
Edgar Gabriel
570bbea5e0 fixing the allgather problem reported on the mailing list. The problem was
that at one locatin we had the local-size instead of the remote size as a
receive argument.

This commit was SVN r17849.
2008-03-17 19:42:18 +00:00
Rich Graham
27182afb67 get the timers in correctly.
This commit was SVN r17832.
2008-03-16 03:25:16 +00:00
Rich Graham
afcd1016fd move temp buffer allocation out of the iteration loop - i.e. always use the
same temp loop.  The algorithm is rather synchronous already...

This commit was SVN r17831.
2008-03-16 03:20:46 +00:00
Rich Graham
a1766b29f6 fix some barrier addressing errors.
This commit was SVN r17830.
2008-03-15 22:46:19 +00:00
Rich Graham
0453e7d2f4 bug in management memory allocation - too much memory allocated.
This commit was SVN r17829.
2008-03-15 18:12:20 +00:00
Rich Graham
3c2f1eb8bf reduce the number of temp buffers used.
This commit was SVN r17828.
2008-03-15 17:23:04 +00:00
Rich Graham
0f9d642d51 temp buffer pointers are computed when they are set up. A bit more
efficient, but more important, it is much easier to play around with
memory layout now.

This commit was SVN r17827.
2008-03-15 16:36:35 +00:00
Rich Graham
e3e336b5ab check point
This commit was SVN r17826.
2008-03-15 13:31:21 +00:00
Rich Graham
ebcf928c24 add some diagnostics.
This commit was SVN r17789.
2008-03-07 22:27:41 +00:00