Per RFC. There are two optimizations in this commit:
- Allocate requests for blocking sends and receives on the stack. This
bypasses the request free list and saves two atomics on the critical path.
This change improves the small message ping-pong by 50-200ns on both AMD
and Intel CPUs.
- For small messages try to use the btl sendi function before intializing a
send request. If the sendi fails or the btl does not have a sendi function
silently fallback on the standard send path.
cmr=v1.7.5:reviewer=brbarret
This commit was SVN r30343.