4e6a6fc146
zeroes); if so, use it for bit-operations like opal_cube_dim and opal_hibit. Implement two versions of power-of-two. In case of opal_next_poweroftwo, this reduces the average execution time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining, measured rdtsc, with loop over 2^27 values). Numbers for other functions are similar (but of course heavily depend on the usage, e.g. opal_hibit() with a start of 4 does not save much). The bsr instruction on AMD Opteron is also not as fast. - Replace various places where the next power-of-two is computed. Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes. This commit was SVN r25270. |
||
---|---|---|
.. | ||
.windows | ||
coll_basic_allgather.c | ||
coll_basic_allgatherv.c | ||
coll_basic_allreduce.c | ||
coll_basic_alltoall.c | ||
coll_basic_alltoallv.c | ||
coll_basic_alltoallw.c | ||
coll_basic_barrier.c | ||
coll_basic_bcast.c | ||
coll_basic_component.c | ||
coll_basic_exscan.c | ||
coll_basic_gather.c | ||
coll_basic_gatherv.c | ||
coll_basic_module.c | ||
coll_basic_reduce_scatter.c | ||
coll_basic_reduce.c | ||
coll_basic_scan.c | ||
coll_basic_scatter.c | ||
coll_basic_scatterv.c | ||
coll_basic.h | ||
Makefile.am |