This "feature" is disabled by default and it should not affect the current performance.
In case when the message size is large and segment size is smaller than eager size for particular interface,
the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without
any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB
message with 1KB segments for example). In case of binomial algorithm root node always has at least one
child which is leaf, so this can potentially affect the root's performance significantly [Especially in
large communicators where root may have quite a few children (binomial tree for example)].
When the segment size is bigger than the eager size, rendezvous protocol ensures that this does
not happen so it is not necessary.
Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes
(which may explain some "deadlocks" on Thunderbird tests).
To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM"
this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM.
Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time.
The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes
with 1KB segments over MX, but I need to test it some more to make sure it is consistent.
Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B.
If message/individual segment size is greater than 4000B - we will not use this feature. This variable may
or may not be exposed as mca parameter later...
I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests
up to 80 processes (over MX).
This commit was SVN r14518.
assumptions in the FT restart code for the ORTE layer.
This fixes those problems by having the RML completely shutdown and
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.
The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.
OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).
Things seem to be working now.
This commit was SVN r14515.
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
of #ifs in the OOOB code.
- Fix a compiler warning in the TCP BTL due to run-time determined
array size by making it a dynamicly allocated array.
- Fix the unpacking code of IPv4 addresses when using IPv6 support, so
that the address is in the correct location (instead of in an IPv6
structure, use an IPv4 structure). Refs trac:1005.
This commit was SVN r14514.
The following Trac tickets were found above:
Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.
The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.
This commit was SVN r14499.
Test for system limits (where known) prior to doing things like fork and pipe since some systems aren't very nice about it when we try to exceed such limits.
This commit was SVN r14494.
Make sure that the wrapper selection is compiled out if not enabling FT. Before the
logic would skip over it since the conditional if statements would not be satisfied,
now there are no additional if statements when compiled out.
With this modification the selection logic looks nearly identical to pre-r14051
with the exception of the non-FT related improvements.
This commit was SVN r14491.
The following SVN revision numbers were found above:
r14051 --> open-mpi/ompi@dadca7da88
use_default_rpm_opt_flags. It defaults to a value of 1, meaning that
we'll try to use $RPM_OPT_FLAGS. But if you're not compiling with the
GNU compilers, you might want to set this value to 0 so that your
compiler doesn't get flags that it doesn't understand (e.g., PGI 7.0
will barf on flags that it doesn't understand).
This commit was SVN r14477.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.