cd11fc3081
The send code in the ugni btl has an optimization that enables it to return 1 (fragment gone) in some cases. This optimization involved removing the btl ownership and callback flags to ensure the fragment stuck around long enough for its completion flag to be checked. This works fine for the single-threaded case but not in the multi-threaded case. It is possible that a fragment will be completed by another thread while a thread is in mca_btl_ugni_send. This competition can lead to a leaked fragment, missed callback, or both. To fix the issue without removing the optimization a reference count has been added to the fragment. Callbacks and fragment release will not be made until the fragment reference count has reach 0. The count is incremented before sending the frag and decremented after the completion flag has been checked. The fix has been verified to work using a multi-threaded RMA benchmark with the osc/pt2pt component. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> |
||
---|---|---|
.. | ||
base | ||
openib | ||
portals4 | ||
scif | ||
self | ||
sm | ||
smcuda | ||
tcp | ||
template | ||
ugni | ||
usnic | ||
vader | ||
btl.h | ||
Makefile.am |