From 9cc6bc1ea64897723d772bb9960ab58a38eea277 Mon Sep 17 00:00:00 2001 From: "Spruit, Neil R" Date: Tue, 8 May 2018 13:00:19 -0400 Subject: [PATCH] MTL OFI: Fix Deadlock in fi_cancel given completion during cancel - If a message for a recv that is being cancelled gets completed after the call to fi_cancel, then the OFI mtl will enter a deadlock state waiting for ofi_req->super.ompi_req->req_status._cancelled which will never happen since the recv was successfully finished. - To resolve this issue, the OFI mtl now checks ofi_req->req_started to see if the request has been started within the loop waiting for the event to be cancelled. If the request is being completed, then the loop is broken and fi_cancel exits setting ofi_req->super.ompi_req->req_status._cancelled = false; Signed-off-by: Spruit, Neil R (cherry picked from commit 767135c580f75d3dde9cb9c88601dd18afda949a) --- ompi/mca/mtl/ofi/mtl_ofi.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/ompi/mca/mtl/ofi/mtl_ofi.h b/ompi/mca/mtl/ofi/mtl_ofi.h index 45a66673d1..d4c5f8a7b6 100644 --- a/ompi/mca/mtl/ofi/mtl_ofi.h +++ b/ompi/mca/mtl/ofi/mtl_ofi.h @@ -1003,8 +1003,11 @@ ompi_mtl_ofi_cancel(struct mca_mtl_base_module_t *mtl, */ while (!ofi_req->super.ompi_req->req_status._cancelled) { opal_progress(); + if (ofi_req->req_started) + goto ofi_cancel_not_possible; } } else { +ofi_cancel_not_possible: /** * Could not cancel the request. */