There are 2 reasons for this:
- pending CUDA events are not progressed by this BTL, so anything that becomes
asychronous will never be completed.
- we use the packed data on the shared memory backing file, and this will be
returned to the peer process upon return (thus if we copy asynchronously we
might not copy the right data).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>