2004-06-29 00:02:25 +00:00
|
|
|
/*
|
2005-11-05 19:57:48 +00:00
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2015-06-23 20:59:57 -07:00
|
|
|
* Copyright (c) 2004-2011 High Performance Computing Center Stuttgart,
|
2004-11-28 20:09:25 +00:00
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2004-11-22 01:38:40 +00:00
|
|
|
* $COPYRIGHT$
|
2015-06-23 20:59:57 -07:00
|
|
|
*
|
2004-11-22 01:38:40 +00:00
|
|
|
* Additional copyrights may follow
|
2015-06-23 20:59:57 -07:00
|
|
|
*
|
2004-06-29 00:02:25 +00:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
2005-07-04 00:13:44 +00:00
|
|
|
#ifndef OPAL_BIT_OPS_H
|
|
|
|
#define OPAL_BIT_OPS_H
|
2004-06-29 00:02:25 +00:00
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
#include "opal/prefetch.h"
|
|
|
|
|
2004-06-29 00:02:25 +00:00
|
|
|
/**
|
|
|
|
* Calculates the highest bit in an integer
|
|
|
|
*
|
|
|
|
* @param value The integer value to examine
|
|
|
|
* @param start Position to start looking
|
|
|
|
*
|
|
|
|
* @returns pos Position of highest-set integer or -1 if none are set.
|
|
|
|
*
|
|
|
|
* Look at the integer "value" starting at position "start", and move
|
|
|
|
* to the right. Return the index of the highest bit that is set to
|
|
|
|
* 1.
|
|
|
|
*
|
|
|
|
* WARNING: *NO* error checking is performed. This is meant to be a
|
|
|
|
* fast inline function.
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
* Using __builtin_clz (count-leading-zeros) uses 3 cycles instead
|
|
|
|
* of 17 cycles (on average value, with start=32)
|
|
|
|
* compared to the loop-version (on Intel Nehalem -- with icc-12.1.0 -O2).
|
2004-06-29 00:02:25 +00:00
|
|
|
*/
|
2005-07-04 00:13:44 +00:00
|
|
|
static inline int opal_hibit(int value, int start)
|
2004-06-29 00:02:25 +00:00
|
|
|
{
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
unsigned int mask;
|
2004-06-29 00:02:25 +00:00
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
#if OPAL_C_HAVE_BUILTIN_CLZ
|
|
|
|
/* Only look at the part that the caller wanted looking at */
|
|
|
|
mask = value & ((1 << start) - 1);
|
2004-06-29 00:02:25 +00:00
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
if (OPAL_UNLIKELY (0 == mask)) {
|
|
|
|
return -1;
|
2004-06-29 00:02:25 +00:00
|
|
|
}
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
|
|
|
|
start = (8*sizeof(int)-1) - __builtin_clz(mask);
|
|
|
|
#else
|
|
|
|
--start;
|
|
|
|
mask = 1 << start;
|
|
|
|
|
|
|
|
for (; start >= 0; --start, mask >>= 1) {
|
|
|
|
if (value & mask) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2015-06-23 20:59:57 -07:00
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
return start;
|
2004-06-29 00:02:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Returns the cube dimension of a given value.
|
|
|
|
*
|
|
|
|
* @param value The integer value to examine
|
|
|
|
*
|
|
|
|
* @returns cubedim The smallest cube dimension containing that value
|
|
|
|
*
|
|
|
|
* Look at the integer "value" and calculate the smallest power of two
|
|
|
|
* dimension that contains that value.
|
|
|
|
*
|
|
|
|
* WARNING: *NO* error checking is performed. This is meant to be a
|
|
|
|
* fast inline function.
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
* Using __builtin_clz (count-leading-zeros) uses 3 cycles instead of 50 cycles
|
|
|
|
* compared to the loop-version (on Intel Nehalem -- with icc-12.1.0 -O2).
|
2004-06-29 00:02:25 +00:00
|
|
|
*/
|
2015-06-23 20:59:57 -07:00
|
|
|
static inline int opal_cube_dim(int value)
|
2004-06-29 00:02:25 +00:00
|
|
|
{
|
2004-10-19 23:58:12 +00:00
|
|
|
int dim, size;
|
2004-06-29 00:02:25 +00:00
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
#if OPAL_C_HAVE_BUILTIN_CLZ
|
|
|
|
if (OPAL_UNLIKELY (1 >= value)) {
|
|
|
|
return 0;
|
2004-06-29 00:02:25 +00:00
|
|
|
}
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
size = 8 * sizeof(int);
|
2015-06-23 20:59:57 -07:00
|
|
|
dim = size - __builtin_clz(value-1);
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
#else
|
|
|
|
for (dim = 0, size = 1; size < value; ++dim, size <<= 1) /* empty */;
|
|
|
|
#endif
|
2004-06-29 00:02:25 +00:00
|
|
|
|
|
|
|
return dim;
|
|
|
|
}
|
|
|
|
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* @brief Returns next power-of-two of the given value.
|
|
|
|
*
|
|
|
|
* @param value The integer value to return power of 2
|
|
|
|
*
|
|
|
|
* @returns The next power of two
|
|
|
|
*
|
|
|
|
* WARNING: *NO* error checking is performed. This is meant to be a
|
|
|
|
* fast inline function.
|
|
|
|
* Using __builtin_clz (count-leading-zeros) uses 4 cycles instead of 77
|
|
|
|
* compared to the loop-version (on Intel Nehalem -- with icc-12.1.0 -O2).
|
|
|
|
*/
|
|
|
|
static inline int opal_next_poweroftwo(int value)
|
|
|
|
{
|
|
|
|
int power2;
|
|
|
|
|
|
|
|
#if OPAL_C_HAVE_BUILTIN_CLZ
|
|
|
|
if (OPAL_UNLIKELY (0 == value)) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
power2 = 1 << (8 * sizeof (int) - __builtin_clz(value));
|
|
|
|
#else
|
|
|
|
for (power2 = 1; value > 0; value >>= 1, power2 <<= 1) /* empty */;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return power2;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @brief Returns next power-of-two of the given value (and the value itselve if already power-of-two).
|
|
|
|
*
|
|
|
|
* @param value The integer value to return power of 2
|
|
|
|
*
|
|
|
|
* @returns The next power of two (inclusive)
|
|
|
|
*
|
|
|
|
* WARNING: *NO* error checking is performed. This is meant to be a
|
|
|
|
* fast inline function.
|
|
|
|
* Using __builtin_clz (count-leading-zeros) uses 4 cycles instead of 56
|
|
|
|
* compared to the loop-version (on Intel Nehalem -- with icc-12.1.0 -O2).
|
|
|
|
*/
|
|
|
|
static inline int opal_next_poweroftwo_inclusive(int value)
|
|
|
|
{
|
|
|
|
int power2;
|
|
|
|
|
|
|
|
#if OPAL_C_HAVE_BUILTIN_CLZ
|
|
|
|
if (OPAL_UNLIKELY (1 >= value)) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
power2 = 1 << (8 * sizeof (int) - __builtin_clz(value - 1));
|
|
|
|
#else
|
|
|
|
for (power2 = 1 ; power2 < value; power2 <<= 1) /* empty */;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return power2;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-07-04 00:13:44 +00:00
|
|
|
#endif /* OPAL_BIT_OPS_H */
|
- Check, whether the compiler supports __builtin_clz (count leading
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
|
|
|
|