* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
supposed to. I.e., half-baked/not complete stuff.
This commit backs out all of r25545. Sorry folks!
This commit was SVN r25546.
The following SVN revision numbers were found above:
r25545 --> open-mpi/ompi@7f9ae11faf
to make MPI_IN_PLACE (and other sentinel Fortran constants) work on OS
X, we need to use the following compiler (linker) flag:
-Wl,-commons,use_dylibs
So if we're compiling on OS X, test to see if that flag works with the
compiler. If so, add it to the wrapper FFLAGS and FCFLAGS (note that
per a future update, we'll only have one Fortran compiler anyway).
Fixes trac:1982.
This commit was SVN r25545.
The following Trac tickets were found above:
Ticket 1982 --> https://svn.open-mpi.org/trac/ompi/ticket/1982
After talking with Brian, we're pretty sure that this is only because
really, really old libevent didn't allow bitwise or-ing of the other
loop types, because what we really need is (EVLOOP_ONCE |
EVLOOP_NONBLOCK). And that's what EVLOOP_ONELOOP did (i.e., we
changed the logic of libevent's event.c to let ONELOOP do both ONCE
and NONBLOCK things).
In the new libevent version, we didn't implement EVLOOP_ONELOOP
properly. As a result, and we got hangs in the SM BTL add_procs
function. Note that the SM BTL wasn't to blame -- it was purely a
side-effect of bad ONELOOP integration (i.e., if you got past the SM
BTL add_procs, you may well have hung somewhere else).
This commit removes all ONELOOP customizations from event.c and
returns it to (almost) its original state from the libevent 2.0.7-rc
distribution. Everwhere in the code base where we used ONELOOP, we
now use (ONCE | NONBLOCK).
This commit was SVN r23957.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
MPI_INIT and start of MPI_FINALIZE.
* Clean up MPI Extensions build system to acknowledge that OMPI's the only
project with extensions, as well as remove some build artifacts necessary
for more general components.
This commit was SVN r23616.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.
The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.
This commit was SVN r22824.
1. replacing mpi_paffinity_alone with opal_paffinity_alone - for back-compatibility, I have aliased mpi_paffinity_alone to the new param name. This caus
es a mild abstraction break in the opal/mca/paffinity framework - per the devel discussion...live with it. :-) I also moved the ompi_xxx global variable
that tracked maffinity setup so it could be properly closed in MPI_Finalize to the opal/mca/maffinity framework to avoid an abstraction break.
2. Added code to the odls/default module to perform paffinity binding and maffinity init between process fork and exec. This has been tested on IU's odi
n cluster and works for both MPI and non-MPI apps.
3. Revise MPI_Init to detect if affinity has already been set, and to attempt to set it if not already done. I have *not* tested this as I haven't yet f
igured out a way to do so - I couldn't get slurm to perform cpu bindings, even though it supposedly does do so.
This has only been lightly tested and would definitely benefit from a wider range of evaluation...
This commit was SVN r21209.
OMPI_* to OPAL_*. This allows opal layer to be used more independent
from the whole of ompi.
NOTE: 9 "svn mv" operations immediately follow this commit.
This commit was SVN r21180.
- Delete unnecessary header files using
contrib/check_unnecessary_headers.sh after applying
patches, that include headers, being "lost" due to
inclusion in one of the now deleted headers...
In total 817 files are touched.
In ompi/mpi/c/ header files are moved up into the actual c-file,
where necessary (these are the only additional #include),
otherwise it is only deletions of #include (apart from the above
additions required due to notifier...)
- To get different MCAs (OpenIB, TM, ALPS), an earlier version was
successfully compiled (yesterday) on:
Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled
This commit was SVN r21096.
only show up if a notifier component is selected, of course). These
can be disabled by setting the MCA parameter mpi_notify_init_finalize
to 0.
These messages are both intended as "hey, does the community like
this?" and as a way to get some real-world testing of the notify
system. The default is currently to send these messages if a notify
component is selected; we can change the default later if desired.
This commit was SVN r21078.
get bitten by header depending on having already included
the corresponding [opal|orte|ompi]_config.h header.
When separating, things like [OPAL|ORTE|OMPI]_DECLSPEC
are missed.
Script to add the corresponding header in front of all following
(taking care of possible #ifdef HAVE_...)
- Including some minor cleanups to
- ompi/group/group.h -- include _after_ #ifndef OMPI_GROUP_H
- ompi/mca/btl/btl.h -- nclude _after_ #ifndef MCA_BTL_H
- ompi/mca/crcp/bkmrk/crcp_bkmrk_btl.c -- still no need for
orte/util/output.h
- ompi/mca/pml/dr/pml_dr_recvreq.c -- no need for mpool.h
- ompi/mca/btl/btl.h -- reorder to fit
- ompi/mca/bml/bml.h -- reorder to fit
- ompi/runtime/ompi_mpi_finalize.c -- reorder to fit
- ompi/request/request.h -- additionally need ompi/constants.h
- Tested on linux/x86-64
This commit was SVN r20720.
anyhow -- if oob functionality is neededm then orte/mca/oob/oob.h
Nevertheless compiles fine with -Wimplicit-function-declaration
This commit was SVN r20641.
Only proc_info.h-internal include file is opal/dss/dss_types.h
- In one case (orte/util/hnp_contact.c) had to add proc_info.h again.
- Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
works fine, no errors.
Again, let's have MTT the last word.
This commit was SVN r20631.
Often, orte/util/show_help.h is included, although no functionality
is required -- instead, most often opal_output.h, or
orte/mca/rml/rml_types.h
Please see orte_show_help_replacement.sh commited next.
- Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
actually showed two *missing* #include "orte/util/show_help.h"
in orte/mca/odls/base/odls_base_default_fns.c and
in orte/tools/orte-top/orte-top.c
Manually added these.
Let's have MTT the last word.
This commit was SVN r20557.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
This commit brings over all the work from the /tmp-public/datarep
branch. See commits r16855, r16859, r16860 for the highlights of what
was done.
This commit was SVN r16891.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r16855
r16859
r16860
The following Trac tickets were found above:
Ticket 1029 --> https://svn.open-mpi.org/trac/ompi/ticket/1029
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
mpi_show_mpi_alloc_mem_leaks
When activated, MPI_FINALIZE displays a list of memory allocations
from MPI_ALLOC_MEM that were not freed by MPI_FREE_MEM (in each MPI
process).
* If set to a positive integer, display only that many leaks.
* If set to a negative integer, display all leaks.
* If set to 0, do not show any leaks.
This commit was SVN r15736.
* General TCP cleanup for OPAL / ORTE
* Simplifying the OOB by moving much of the logic into the RML
* Allowing the OOB RML component to do routing of messages
* Adding a component framework for handling routing tables
* Moving the xcast functionality from the OOB base to its own framework
Includes merge from tmp/bwb-oob-rml-merge revisions:
r15506, r15507, r15508, r15510, r15511, r15512, r15513
This commit was SVN r15528.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15506
r15507
r15508
r15510
r15511
r15512
r15513
than just the PML/BTLs these days. Also clean up the code so that it
handles the situation where not all nodes register information for a given
node (rather than just spinning until that node sends information, like
we do today).
Includes r15234 and r15265 from the /tmp/bwb-modex branch.
This commit was SVN r15310.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15234
r15265
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.