openmpi

Jelena Pjesivac-Grbovic 3eac49aa59 Adding flow control for leaf nodes in generalized reduce structure.

This "feature" is disabled by default and it should not affect the current performance.

In case when the message size is large and segment size is smaller than eager size for particular interface,
the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without
any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB
message with 1KB segments for example). In case of binomial algorithm root node always has at least one
child which is leaf, so this can potentially affect the root's performance significantly [Especially in
large communicators where root may have quite a few children (binomial tree for example)].
When the segment size is bigger than the eager size, rendezvous protocol ensures that this does
not happen so it is not necessary.
Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes
(which may explain some "deadlocks" on Thunderbird tests).

To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM"
this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM.
Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time.
The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes
with 1KB segments over MX, but I need to test it some more to make sure it is consistent.

Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B.
If message/individual segment size is greater than 4000B - we will not use this feature. This variable may
or may not be exposed as mca parameter later...

I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests
up to 80 processes (over MX).

This commit was SVN r14518.

2007-04-25 20:39:53 +00:00

coll_tuned_allgather.c

Fix for the special case where np=2 and the sendbuf is set to MPI_IN_PLACE.

2007-03-13 19:01:20 +00:00

coll_tuned_allreduce.c

Adding segmented ring algorithm for Allreduce for commutative operations.

2007-02-27 20:32:30 +00:00

coll_tuned_alltoall.c

Adding variant of linear alltoall algorithm where the number of

2007-02-20 04:25:00 +00:00