1
1

MTL OFI: Fix Deadlock in fi_cancel given completion during cancel

- If a message for a recv that is being cancelled gets completed after
the call to fi_cancel, then the OFI mtl will enter a deadlock state
waiting for ofi_req->super.ompi_req->req_status._cancelled which will
never happen since the recv was successfully finished.

- To resolve this issue, the OFI mtl now checks ofi_req->req_started
to see if the request has been started within the loop waiting for the
event to be cancelled. If the request is being completed, then the loop
is broken and fi_cancel exits setting
ofi_req->super.ompi_req->req_status._cancelled = false;

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 767135c580f75d3dde9cb9c88601dd18afda949a)
Этот коммит содержится в:
Spruit, Neil R 2018-05-08 13:00:19 -04:00
родитель 5704d4fab5
Коммит 9cc6bc1ea6

Просмотреть файл

@ -1003,8 +1003,11 @@ ompi_mtl_ofi_cancel(struct mca_mtl_base_module_t *mtl,
*/ */
while (!ofi_req->super.ompi_req->req_status._cancelled) { while (!ofi_req->super.ompi_req->req_status._cancelled) {
opal_progress(); opal_progress();
if (ofi_req->req_started)
goto ofi_cancel_not_possible;
} }
} else { } else {
ofi_cancel_not_possible:
/** /**
* Could not cancel the request. * Could not cancel the request.
*/ */