* New "op" MPI layer framework
* Addition of the MPI_REDUCE_LOCAL proposed function (for MPI-2.2)
= Op framework =
Add new "op" framework in the ompi layer. This framework replaces the
hard-coded MPI_Op back-end functions for (MPI_Op, MPI_Datatype) tuples
for pre-defined MPI_Ops, allowing components and modules to provide
the back-end functions. The intent is that components can be written
to take advantage of hardware acceleration (GPU, FPGA, specialized CPU
instructions, etc.). Similar to other frameworks, components are
intended to be able to discover at run-time if they can be used, and
if so, elect themselves to be selected (or disqualify themselves from
selection if they cannot run). If specialized hardware is not
available, there is a default set of functions that will automatically
be used.
This framework is ''not'' used for user-defined MPI_Ops.
The new op framework is similar to the existing coll framework, in
that the final set of function pointers that are used on any given
intrinsic MPI_Op can be a mixed bag of function pointers, potentially
coming from multiple different op modules. This allows for hardware
that only supports some of the operations, not all of them (e.g., a
GPU that only supports single-precision operations).
All the hard-coded back-end MPI_Op functions for (MPI_Op,
MPI_Datatype) tuples still exist, but unlike coll, they're in the
framework base (vs. being in a separate "basic" component) and are
automatically used if no component is found at runtime that provides a
module with the necessary function pointers.
There is an "example" op component that will hopefully be useful to
those writing meaningful op components. It is currently
.ompi_ignore'd so that it doesn't impinge on other developers (it's
somewhat chatty in terms of opal_output() so that you can tell when
its functions have been invoked). See the README file in the example
op component directory. Developers of new op components are
encouraged to look at the following wiki pages:
https://svn.open-mpi.org/trac/ompi/wiki/devel/Autogenhttps://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponenthttps://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
= MPI_REDUCE_LOCAL =
Part of the MPI-2.2 proposal listed here:
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/24
is to add a new function named MPI_REDUCE_LOCAL. It is very easy to
implement, so I added it (also because it makes testing the op
framework pretty easy -- you can do it in serial rather than via
parallel reductions). There's even a man page!
This commit was SVN r20280.
corrections to non-windows files (but within ifdef __WINDOWS__)
type casts, event library for windows use win32.
in orte runtime, add windows sockets handling and object construction.
This commit was SVN r20110.
is >= 1. The default value of the MCA param is now -1, which means
"let someone else turn it on if they want to." So we should default
to ''off'' (false), and let the openib BTL (etc.) turn it on if it
can/wants to.
Failure to do this will default _pipeline to true because
-1(int)==true(bool). This causes a problem if the user tries to set
mpi_leave_pinned_pipeline to 1: they'll get a warning that you can't
set both _pinned and _pinned_pipeline to 1. This happens because
_pinned will get the bool-ified value of of the MCA parameter (-1),
and then the user sets the value of _pinned_pipeline to 1/true.
Hence, both of them are set to true. Bzzt!
This commit was SVN r19942.
1. register "mpi_preconnect_all" as a deprecated synonym for "mpi_preconnect_mpi"
2. remove "mpi_preconnect_oob" and "mpi_preconnect_oob_simultaneous" as these are no longer valid.
3. remove the routed framework's "warmup_routes" API. With the removal of the direct routed component, this function at best only wasted communications. The daemon routes are completely "warmed up" during launch, so having MPI procs order the sending of additional messages is simply wasteful.
4. remove the call to orte_routed.warmup_routes from MPI_Init. This was the only place it was used anyway.
The FAQs will be updated to reflect this changed situation, and a CMR filed to move this to the 1.3 branch.
This commit was SVN r19933.
THREAD_MULTIPLE. There's a new (hidden) MCA parameter to re-enable
these BTLs in the presence of THREAD_MULTIPLE:
btl_base_thread_multiple_override. This MCA parameter should ''only''
be used by developers who are working on make their BTLs thread safe;
it should ''not'' be used by end-users!
This commit was SVN r19826.
The following Trac tickets were found above:
Ticket 1588 --> https://svn.open-mpi.org/trac/ompi/ticket/1588
There is still a problem with OpenIB and threads (external to C/R functionality). It has been reported in Ticket #1539
Additionally:
* Fix a file cleanup bug in CRS Base.
* Fix a possible deadlock in the TCP ft_event function
* Add a mca_base_param_deregister() function to MCA base
* Add whole process checkpoint timers
* Add support for BTL: OpenIB, MX, Shared Memory
* Add support Mpool: rdma, sm
* Sundry bounds checking an cleanup in some scattered functions
This commit was SVN r19756.
figure it out at runtime (really meaning: we'll still default to "0"
unless something explicitly overrides to 1, such as the openib BTL).
This way, ompi_info doesn't confusingly report mpi_leave_pinned==0 for
mpi_leave_pinned, but we end up running with mpi_leave_pinned==1.
Fixes trac:1502.
This commit was SVN r19571.
The following Trac tickets were found above:
Ticket 1502 --> https://svn.open-mpi.org/trac/ompi/ticket/1502
need to clean them up during ompi_mpi_finalize(), though...
This commit was SVN r19462.
The following SVN revision numbers were found above:
r19458 --> open-mpi/ompi@697dc524c1
for the F90 type create functions to the requirements of MPI 2.1 standard.
Advice to implementors. An application may often repeat a call to
MPI_TYPE_CREATE_F90_xxxx with the same combination of (xxxx,p,r).
The application is not allowed to free the returned predefined, unnamed
datatype handles. To prevent the creation of a potentially huge amount of
handles, the MPI implementation should return the same datatype handle for
the same (REAL/COMPLEX/INTEGER,p,r) combination. Checking for the
combination (p,r) in the preceding call to MPI_TYPE_CREATE_F90_xxxx and
using a hash-table to find formerly generated handles should limit the
overhead of finding a previously generated datatype with same combination
of (xxxx,p,r). (End of advice to implementors.)
This commit fixes trac:1239, and #712.
This commit was SVN r19458.
The following Trac tickets were found above:
Ticket 1239 --> https://svn.open-mpi.org/trac/ompi/ticket/1239
See opal/mca/paffinity/paffinity.h for explanation as to the physical vs logical nature of the params used in the API.
Fixes trac:1435
This commit was SVN r19391.
The following Trac tickets were found above:
Ticket 1435 --> https://svn.open-mpi.org/trac/ompi/ticket/1435
* use "warn_on_fork" instead of "do_not_warn_on_fork" -- i.e.,
use positive logic instead of negative logic
* ensure that pthread_atfork() is only called once
* amended the error message to include the hostname, PID, and
MPI_COMM_WORLD rank of the offender
* ensure that the warn_fork_cb() function is only defined if
HAVE_PTHREAD_H so that we don't get a compiler warning if it isn't
used
This commit was SVN r19204.
The following SVN revision numbers were found above:
r19196 --> open-mpi/ompi@277e4ac292
This is setup so that it only is issued once (as opposed to every time they do it), and goes through orte_show_help so the user doesn't get hammered by #procs copies of the warning. In addition, there is a new MCA param (can't have too many!) to shut the warning off altogether.
This closes ticket #1244
This commit was SVN r19196.
set when it launches under debuggers using the --debug option.
This commit was SVN r19116.
The following Trac tickets were found above:
Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361
put the name of the file that set them if they were set by file. This is of great assistance to support personnel trying to understand why a user is having pro
blems.
Coordinated with Jeff.
This commit was SVN r19111.
* add "register" function to mca_base_component_t
* converted coll:basic and paffinity:linux and paffinity:solaris to
use this function
* we'll convert the rest over time (I'll file a ticket once all
this is committed)
* add 32 bytes of "reserved" space to the end of mca_base_component_t
and mca_base_component_data_2_0_0_t to make future upgrades
[slightly] easier
* new mca_base_component_t size: 196 bytes
* new mca_base_component_data_2_0_0_t size: 36 bytes
* MCA base version bumped to v2.0
* '''We now refuse to load components that are not MCA v2.0.x'''
* all MCA frameworks versions bumped to v2.0
* be a little more explicit about version numbers in the MCA base
* add big comment in mca.h about versioning philosophy
This commit was SVN r19073.
The following Trac tickets were found above:
Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392
Short version: remove opal_paffinity_alone and restore
mpi_paffinity_alone. ORTE makes various information available for the
MPI layer to decide what it wants to do in terms of processor
affinity.
Details:
* remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone
MCA param
* move opal_paffinity_slot_list param registration to paffinity base
* ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that
succeeds use that. If no slot list was set, see if
mpi_paffinity_alone was set. If so, bind this process to its Node
Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you
COMM_SPAWN to a host in this ORTE universe that already has procs
on it, the NLR for the new job will start at N (not 0). So this is
slightly better than mpi_paffinity_alone in the v1.2 series.
* If a slot list is specified *and* mpi_paffinity_alone is set, we
display an error and abort.
* Remove calls from rmaps/rank_file component to register and lookup
opal_paffinity mca params.
* Remove code in orte/odls that set affinities - instead, have them
just pass a slot_list if it exists.
* Cleanup the orte/odls code that determined
oversubscribed/want_processor as these were just opposites of each
other.
This commit was SVN r18874.
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
Lenny and I went back and forth on whether we should simply register
another "mpi_paffinity_alone" MCA param and then try to figure out
which one was set in ompi_mpi_init, but there was difficulty in
figuring out what to do. So it seemed like the Right Thing to do was
to implement what was committed in r18770; then we could tell where
MCA parameters were set from and you could do Better Things (this is
also useful in the openib BTL, where parameters can be set either via
MCA parameter or via an INI file).
But after that was done, it seemed only a few steps further to
actually implement two new features in the MCA params area:
* Synonyms (where one MCA param name is a synonym for another)
* Allow MCA params and/or their synonyms to be marked as "deprecated"
(printing out warnings if they are used)
These features have actually long been discussed/desired, and I had
some time in airports and airplanes recently where I could work in
this stuff on a standalone laptop. So I did it. :-)
This commit introduces these two new features, and then uses them to
register mpi_paffinity_alone as a non-deprecated synonym for
opal_paffinity_alone. A few other random points in this commit:
* Add a few error checks for conditions that were not checked before
* Correct some comments in mca_base_params.h
* Add a few comments in strategic places
* ompi_info now prints additional information:
* for any MCA parameter that has synonyms, it lists all the
synonyms
* synonyms are also output as 1st-class MCA params, but with an
additional attribute indicating that they have a "parent"
* all MCA param name (both "real" or "synonym") will output an
attribute indicating whether it is deprecated or not. A synonym
is deprecated if it iself is marked as deprecated (via the
mca_base_param_regist_syn() or mca_base_param_register_syn_name()
functions) or if its "parent" MCA parameter is deprecated
This commit was SVN r18859.
The following SVN revision numbers were found above:
r18770 --> open-mpi/ompi@8efe67e08c
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done.
Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations.
For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require.
This fixes ticket #1340
This commit was SVN r18673.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.
This commit was SVN r18664.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.
Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.
This commit was SVN r18338.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.
The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.
Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.
* Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
* Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
* Update ft_event functions in PML and BML to handle the new restart state.
* Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.
This commit was SVN r18276.
selected, but before we check whether we have been spawned. This is necessary
in order for the hierarch collective component to work. This component might
create new communicators already in MPI_Init(), which then have to execute the
dpm.mark_dyncomm function. If dpm is not initialized at that point, we
segfault.
This commit was SVN r18045.
1. applied prefix rule to functions and variables of RMAPS rank_file component
2. cleaned ompi_mpi_init.c from paffinity code
3. paffinity code moved to new opal/mca/paffinity/base/paffinity_base_service.c file
4. added opal_paffinity_slot_list mca parameter
This commit was SVN r18019.
The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.
Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).
Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.
This commit was SVN r17988.
"all", not just the first 3 chars (i.e., if someone sets the value
"allfoo", we should still error).
This commit was SVN r17981.
The following SVN revision numbers were found above:
r17980 --> open-mpi/ompi@b3ef774d46