3eac49aa59
This "feature" is disabled by default and it should not affect the current performance. In case when the message size is large and segment size is smaller than eager size for particular interface, the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB message with 1KB segments for example). In case of binomial algorithm root node always has at least one child which is leaf, so this can potentially affect the root's performance significantly [Especially in large communicators where root may have quite a few children (binomial tree for example)]. When the segment size is bigger than the eager size, rendezvous protocol ensures that this does not happen so it is not necessary. Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes (which may explain some "deadlocks" on Thunderbird tests). To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM" this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM. Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time. The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes with 1KB segments over MX, but I need to test it some more to make sure it is consistent. Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B. If message/individual segment size is greater than 4000B - we will not use this feature. This variable may or may not be exposed as mca parameter later... I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests up to 80 processes (over MX). This commit was SVN r14518. |
||
---|---|---|
.. | ||
coll_tuned_allgather.c | ||
coll_tuned_allreduce.c | ||
coll_tuned_alltoall.c | ||
coll_tuned_barrier.c | ||
coll_tuned_bcast.c | ||
coll_tuned_component.c | ||
coll_tuned_decision_dynamic.c | ||
coll_tuned_decision_fixed.c | ||
coll_tuned_dynamic_file.c | ||
coll_tuned_dynamic_file.h | ||
coll_tuned_dynamic_rules.c | ||
coll_tuned_dynamic_rules.h | ||
coll_tuned_forced.c | ||
coll_tuned_forced.h | ||
coll_tuned_gather.c | ||
coll_tuned_module.c | ||
coll_tuned_reduce_scatter.c | ||
coll_tuned_reduce.c | ||
coll_tuned_scatter.c | ||
coll_tuned_topo.c | ||
coll_tuned_topo.h | ||
coll_tuned_util.c | ||
coll_tuned_util.h | ||
coll_tuned.h | ||
configure.params | ||
Makefile.am |