The rdmacm event handler has no way of reporting fatal errors to the upper
layers. By calling mca_btl_openib_endpoint_invoke_error in the rdmacm event
handler for the errors encountered, these errors can now be handled
appropriately.
Closes out Ticket #1283
This commit was SVN r18980.
environment, file, or API override).
Refs trac:1397
This commit was SVN r18943.
The following Trac tickets were found above:
Ticket 1397 --> https://svn.open-mpi.org/trac/ompi/ticket/1397
Short version: remove opal_paffinity_alone and restore
mpi_paffinity_alone. ORTE makes various information available for the
MPI layer to decide what it wants to do in terms of processor
affinity.
Details:
* remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone
MCA param
* move opal_paffinity_slot_list param registration to paffinity base
* ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that
succeeds use that. If no slot list was set, see if
mpi_paffinity_alone was set. If so, bind this process to its Node
Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you
COMM_SPAWN to a host in this ORTE universe that already has procs
on it, the NLR for the new job will start at N (not 0). So this is
slightly better than mpi_paffinity_alone in the v1.2 series.
* If a slot list is specified *and* mpi_paffinity_alone is set, we
display an error and abort.
* Remove calls from rmaps/rank_file component to register and lookup
opal_paffinity mca params.
* Remove code in orte/odls that set affinities - instead, have them
just pass a slot_list if it exists.
* Cleanup the orte/odls code that determined
oversubscribed/want_processor as these were just opposites of each
other.
This commit was SVN r18874.
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
If IBCM was explicitly specified with exclude/include parameter,
OpenIB BTL will enable verbose report for "/dev/infiniband/ucm" error,
other way the error will not be reported.
This commit was SVN r18868.
Lenny and I went back and forth on whether we should simply register
another "mpi_paffinity_alone" MCA param and then try to figure out
which one was set in ompi_mpi_init, but there was difficulty in
figuring out what to do. So it seemed like the Right Thing to do was
to implement what was committed in r18770; then we could tell where
MCA parameters were set from and you could do Better Things (this is
also useful in the openib BTL, where parameters can be set either via
MCA parameter or via an INI file).
But after that was done, it seemed only a few steps further to
actually implement two new features in the MCA params area:
* Synonyms (where one MCA param name is a synonym for another)
* Allow MCA params and/or their synonyms to be marked as "deprecated"
(printing out warnings if they are used)
These features have actually long been discussed/desired, and I had
some time in airports and airplanes recently where I could work in
this stuff on a standalone laptop. So I did it. :-)
This commit introduces these two new features, and then uses them to
register mpi_paffinity_alone as a non-deprecated synonym for
opal_paffinity_alone. A few other random points in this commit:
* Add a few error checks for conditions that were not checked before
* Correct some comments in mca_base_params.h
* Add a few comments in strategic places
* ompi_info now prints additional information:
* for any MCA parameter that has synonyms, it lists all the
synonyms
* synonyms are also output as 1st-class MCA params, but with an
additional attribute indicating that they have a "parent"
* all MCA param name (both "real" or "synonym") will output an
attribute indicating whether it is deprecated or not. A synonym
is deprecated if it iself is marked as deprecated (via the
mca_base_param_regist_syn() or mca_base_param_register_syn_name()
functions) or if its "parent" MCA parameter is deprecated
This commit was SVN r18859.
The following SVN revision numbers were found above:
r18770 --> open-mpi/ompi@8efe67e08c
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
first to the trunk. So, here is the trunk checkin:
The call to orte_show_help() to notify truncation of the max_inline value
was missing the want_error_header boolean, which eventually results in
a SEGV. This change corrects the call with the bool set to true.
This commit was SVN r18839.
These are mostly long additions to comments to document what is going on and why, and how/where it may be revised in the future. Just a couple of small, but important, changes to the code itself.
This commit was SVN r18827.
see how the next gen panasas stuff does in terms of warnings; we can
always re-merge this later if we want to. It's just easier if we have
as little OMPI-specific code as possible (particularly when we know
that the panasas code has some big changes coming).
This commit was SVN r18823.
The following SVN revision numbers were found above:
r17543 --> open-mpi/ompi@b4ec81a9fd
already, and we're just about to do a ROMIO version refresh -- so the
less OMPI-specific code we have (e.g., indenting and whatnot), the
better.
Refs trac:1370.
This commit was SVN r18821.
The following SVN revision numbers were found above:
r16691 --> open-mpi/ompi@8dca19cb3b
r16693 --> open-mpi/ompi@037a533752
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
ROMIO in Open MPI (the new version of ROMIO will make this patch
defunct, and David Daniel has confirmed that no one at LANL is using
this functionality, anyway).
Refs trac:1370.
This commit was SVN r18819.
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
hierarch disables itself now if the pml module used is *not* ob1. The reason
is, that the multi-level hierarchy detection algorithm checks the names of the
btl modules used. In case there are no btl's, we would segfault.
Furthermore, three minor changes:
- the 2-level hierarchy detection is now the default (sm vs. everything else
in the world).
- add udapl to the list of protocols checked for by the multi-level hierarch detection
- some of the verbose statements of hierarch were inaccurate. Fixed those comments/messages.
This commit was SVN r18817.
With help from Brian, modify the ompi/proc/proc.c code to be more thread-safe. Remove the list operations from the ompi_proc_t constructor and destructor. Insert list appends to ompi_proc_init and ompi_proc_find_and_add as required, and protect those with thread locks. Let only the ompi_proc_finalize function actually remove objects from the ompi_proc_list.
Cleanup a few places where functions might return without unlocking a thread. Ensure the ompi_proc_world also does an OBJ_RETAIN so that the reference count on any subsequently released object is correct.
This commit was SVN r18816.
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.
* Move the passive side QP move to RTS to before we send the reply
(vs. sending it after we get the RTU). A lengthy comment explains
the need for this.
* Add some timers to the code for analyzing where time is spent.
* Clarify a few error messages.
* Currently have a loop around ib_cm_listen() because sometimes it
fails for seeminly no reason. Have pending e-mails in to Sean
Hefty to see if we can figure out why this is happening. Note that
the more MPI processes you add, the more likely this error is to
occur (e.g., ran 720 processes and it happens at least 50% of the
time). This makes IBCM somewhat unattractive for general use;
hopefully we can get a fix...
This commit was SVN r18766.
endpoint.c) because it's almost identical in all the CPC's. OOB,
XOOB, and IBCM now all invoke the btl error handler properly if
there's an error during wireup. RDMACM still needs to be done.
This commit was SVN r18764.
The following SVN revision numbers were found above:
r18762 --> open-mpi/ompi@3eda04578f
upper level btl (and therefore the PML) when something goes wrong
during wireup.
Refs trac:1283.
This commit was SVN r18762.
The following Trac tickets were found above:
Ticket 1283 --> https://svn.open-mpi.org/trac/ompi/ticket/1283
* Properly handle non-symmetric subnet ID's
* Be a bit more stringent when checking for the GID
* Add lots of BTL_VERBOSE's for diagnostics
This commit was SVN r18754.
The issue is that the field mca_topo_base_comm_t->mtc_periods_or_edges
has a different length, depending on whether the communicator is a
graph or a cart. One of the comm dup functions always assumed that it
was the length required by graph comms, which could lead to badness in
some cases. This commit makes the legnth of that field on a comm dup
be the proper length and copies the data over appropriately.
I also changed the syntax of the ompi_comm_copy_topo() function to use
shorter pointer notation; it made the code much easier to read and
fix.
This commit was SVN r18752.
The following Trac tickets were found above:
Ticket 1345 --> https://svn.open-mpi.org/trac/ompi/ticket/1345
as the send and the put share the same execution path and in the case of the
put the MCA_BTL_DES_SEND_ALWAYS_CALLBACK was not set.
This commit was SVN r18735.
ompi_info will crash due to a reference to a unalloced struct. Add a check for
this struct to prevent this crash.
This commit was SVN r18727.
The following SVN revision numbers were found above:
r18725 --> open-mpi/ompi@c2351d39f1
lists of flags for exec. ' ' is a magic value that means match all
white space, and trim leading / trailing whitespace. Will prevent
many spurious arguments to underlying compiler.
This commit was SVN r18726.
If rdmacm cpc is disabled and an iwarp device is forced to be used, then
mca_btl_openib_get_iwarp_subnet_id will attempt to reference an uninitalized
list. Modify the code to check for the uninitalized list and return zero.
Also, fix a memory leak caused by not freeing alloced addresses structs during
shutdown.
This commit was SVN r18725.
the fragments that failed to be send, there is no need to replicate the same
mechanism in the BTL.
Force the SM BTL to empty all ack fragments in the component progress function.
This commit was SVN r18724.
INI file use_eager_rdma value (fixes trac:1169)
* fixed a typo in a MCA param help message
* made the check for enabling short/eager RDMA more robust in the
presence of progress threads; it now emits a show_help warning
This commit was SVN r18723.
The following Trac tickets were found above:
Ticket 1169 --> https://svn.open-mpi.org/trac/ompi/ticket/1169
specified, probe for max value supported by device.
This commit was SVN r18720.
The following Trac tickets were found above:
Ticket 1355 --> https://svn.open-mpi.org/trac/ompi/ticket/1355
of strings. We mostly did the Right Things already; I simplified the
code a bit and also had us not write to more characters in the C
bindings than we're supposed to (per language in the MPI-2.1 spec).
Fixes trac:1238.
This commit was SVN r18705.
The following Trac tickets were found above:
Ticket 1238 --> https://svn.open-mpi.org/trac/ompi/ticket/1238
* Ensure to destroy the correct QP (local->id[num]->qp will always
have a valid pointer in it, even if we setup a dummy qp)
* Note two notable places where we need to figure out how to
propagate errors up from the CPC to the main BTL / PML when errors
occur. Probably have the same issue in IBCM, too.
This commit was SVN r18700.
two processes on the same server (!). So for today, we'll simply mark
all local processes that use iWARP adapters as "unreachable".
More details in #1352.
This commit was SVN r18699.
(e.g., if you're excluding some devices, their destructors will be
invoked before the async event thread was setup for them).
This commit was SVN r18698.
multiple adapters (eg., Chelsio T3).
But we need to figure out how to determine a good value for the
resident adapter(s) at runtime. It's problematic because, for
example, Mellanox ConnectX and Chelsio T3 report max_inline values
differently at run-time. If you ibv_create_qp with a max_inline value
of 0, ConnectX reports back a value that is a formular based on a few
other values (e.g., max_send_sge and max_recv_sge). But T3 always
reports back "64".
We're looking into this to figure out the best way -- reducing the
default right now should allow other adapters to run while we figure
it out.
This commit was SVN r18697.
* Put the variable in the MPI namespace; keeps it safely segregated
from user apps
* Need to actually "extern" the variable to make the compiler not
complain that the variable is never referenced
This commit was SVN r18693.
the shared object when it is compiled with the Sun Studio C compiler.
Depends on where the extern variables that are included in the headers were
initialized, there can be instances wheter there is no storage allocated
for the variables and therefore the symbols may or may not be defined
when the debugger tries to dlopen this message queue dll.
This commit was SVN r18685.
the char string ident in the C++ library be non-static so that other
places can see it. This makes the C++ library version string
analogous to all the other version strings.
This commit was SVN r18679.
Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done.
Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations.
For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require.
This fixes ticket #1340
This commit was SVN r18673.
(still need the real yod environment variable name), and add in some
lengthy comments explaining the two different methods of debugger
attach that we're using.
MPI handle debugging doesn't seem to be working at the moment; still
checking into it...
This commit was SVN r18672.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.
This commit was SVN r18664.
is still to use the C based wrapper compilers (which have many more features
and are more well tested). The Perl compilers are enabled with the option
--enable-script-wrapper-compilers, which also ignores the option
--disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries
will result in perl-based wrapper compilers being installed, but no other
binaries being installed).
This commit was SVN r18655.
a standalone library named libopenmpi-malloc. Users wanting to
use leave_pinned with ptmalloc2 will now need to link the library
into their application explicitly. All other users will use the
libc-provided allocator instead of Open MPI's ptmalloc2. This change
may be overriden with the configure option enable-ptmalloc2-internal
- The leave_pinned options will now default to using mallopt on
Linux in the cases where ptmalloc2 was not linked in. mallopt
will also only be available if munmap can be intercepted (the
default whenever Open MPI is not compiled with --without-memory-
manager.
- Open MPI will now complain and refuse to use leave_pinned if
no memory intercept / mallopt option is available.
This commit was SVN r18654.
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.
So the ticket can be checked by Jeff when he returns from vacation... :-)
This commit was SVN r18625.
The following Trac tickets were found above:
Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.
* Add specific comments about why we're not setting MPI_ERROR here
This commit was SVN r18616.
The following SVN revision numbers were found above:
r18067 --> open-mpi/ompi@58e31d767e
that can *sometimes* cause problems with "make -j [N>1] install".
Ensure to make the target directory before we copy stuff into it --
read the thread starting here for more details:
http://www.open-mpi.org/community/lists/devel/2008/06/4080.php
This commit was SVN r18570.
We already show_help when we fail to create queues, so I just made the
message a little more verbose such that it may be that OMPI is trying
to use a feature that is not supported on the hardware.
This commit was SVN r18553.
The following Trac tickets were found above:
Ticket 1121 --> https://svn.open-mpi.org/trac/ompi/ticket/1121
1. The send path get shorter. The BTL is allowed to return > 0 to specify that the
descriptor was pushed to the networks, and that the memory attached to it is
available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag
can be used by the PML to force the BTL to always trigger the callback.
Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS
which force the PML to have exactly the same behavior as before. Some BTLs have
been modified: self, sm, tcp, mx.
2. Add send immediate interface to BTL.
The idea is to have a mechanism of allowing the BTL to take advantage of
send optimizations such as the ability to deliver data "inline". Some
network APIs such as Portals allow data to be sent using a "thin" event
without packing data into a memory descriptor. This interface change
allows the BTL to use such capabilities and allows for other optimizations
in the future. All existing BTLs except for Portals and sm have this interface
set to NULL.
This commit was SVN r18551.
than using a single receive callback followed by a switch on the header.
Also fast pathed the matching for small fragments.
This commit was SVN r18549.
that it's a directory. That's good enough to know that the
OpenFabrics kernel drivers have been loaded. If you have no RDMA
devices and don't want to see the OMPI warning about not finding any
devices, then don't start the OpenFabrics kernel drivers.
This commit was SVN r18540.
non-empty. If not, then exit the openib btl silently. This addresses
the case where libibverbs is installed (which is getting more common)
and therefore the openib BTL was built/installed, but the kernel
drivers are not loaded (assumedly because there is no RDMA hardware
present). In this case, "mpirun a.out" will not issue a warning.
There appears to be no good way to definitely tell if there are no
RDMA hardware devices present. For example, if libibverbs/the openib
BTL is installed, there are no RDMA devices present, but the RDMA
hardware kernel drivers ''are'' loaded, OMPI will warn that it was
unable to find suitable devices. This warning is easily eliminated by
unloading the kernel drivers.
This commit was SVN r18530.
The following Trac tickets were found above:
Ticket 1305 --> https://svn.open-mpi.org/trac/ompi/ticket/1305
By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too.
Fixes ticket #1314
In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally.
Fixes ticket #1315
This commit was SVN r18524.
* s/port/tcp_port/g where relevant to disambiguate TCP port from
device port
* Rework ipaddrcheck to make it work in the LMC>0 case
This commit was SVN r18482.
The following Trac tickets were found above:
Ticket 1281 --> https://svn.open-mpi.org/trac/ompi/ticket/1281
* Ensure _iwarp.h is always included, or you'll get warnings on
platforms that don't have the RDMACM
* Add skeleton for function descriptions in comments in iwarp.h
This commit was SVN r18477.
opal_ifnext() return -1 upon completion); don't check it against
opal_ifcount() -- the interface indexes aren't necessarily related to
how many interfaces were found.
This commit was SVN r18476.
with something with a different size ... well we segfault. The reason was
that the logic in the PML OB1 call the convertor based on the length
of he data on the wire and not the length of the data that the receiver
expects.
In other words, this is only half a patch :) It fix the problem, but we
still have to make sure the unpack is not called at all when the receiver
expect ZERO bytes.
This commit was SVN r18474.
This commit has the same commit message as r18450, but without the
extra bonus memory corruption that was introduced.
This commit was SVN r18467.
The following SVN revision numbers were found above:
r18450 --> open-mpi/ompi@5295902ebe
The following Trac tickets were found above:
Ticket 1285 --> https://svn.open-mpi.org/trac/ompi/ticket/1285
However, no decision logic is changed by this commit so default behavior has not changed. This
is only selectable by runtime parameters.
This commit was SVN r18464.
Need to release the items and the item list after selecting the collective
modules that are being used. Reviewed by Jeff Squyres.
This commit was SVN r18457.
1. We can't use orte_output in the CPC service thread because orte is
not thread safe
1. Use the macro version sso that they're compiled out of production
builds
This commit was SVN r18455.
* allow receive_queues to be specified in the INI file
* detect when multiple different receive_queues are specified and
gracefully abort
However, accomplishing these goals ran into multiple difficulties. By
putting receive_queues in the INI file:
1. we may not find the value until we've already traversed multiple HCAs
1. we may find multiple different receive_queues values
But since the openib btl initializes as it discovers each HCA/port/LID
(including the BSRQ data), if we find a new receive_queues value late
in the discovery process, then all the BSRQ data that was previously
initialized will likely be invalid. So I had to pull all the BSRQ
initialization out until after the rest of the discovery /
initialization process.
Additionally, note that if the user specifies the MCA parameter
btl_openib_receive_queues, it trumps whatever was in the INI file. So
in this case, there can never be a receive_queues conflict. This
commit does the following (Jon wrote part of this, too):
* adapt _ini.c to accept the "receive_queues" field in the file
* move 90% of _setup_qps() from _ini.c to _component.c
* move what was left of _setup_qps() into the main
_register_mca_params() function
* adapt init_one_hca() to detect conflicting receive_queues values
from the INI file
* after the _component.c loop calling init_one_hca():
* call setup_qps() to parse the final receive_queues string value
* traverse all resulting btls and initialize their HCAs (if they
weren't already): setup some lists and call prepare_hca_for_use()
I tested this code on a dual-HCA system where I artificially put in
differing receive_queues values in the INI file for the two different
types of HCAs that I have and it all seemed to work.
This commit was SVN r18450.
The following Trac tickets were found above:
Ticket 1285 --> https://svn.open-mpi.org/trac/ompi/ticket/1285
There are more details in the code regarding how to use this feature.
Also shift a few of the orte_output back to opal_output. I'm experiencing an odd problem with locks in the oob/tcp when using orte_output. I haven't had time to track it down yet.
This commit was SVN r18439.
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.