From 2acc4b7e7f2b564b8f28d2f0fb99d28e2dfcd98d Mon Sep 17 00:00:00 2001 From: Brian Barrett Date: Sat, 13 Oct 2018 01:46:36 +0000 Subject: [PATCH] btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett --- opal/mca/btl/tcp/btl_tcp_component.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c index 2f891997a3..85a472b82b 100644 --- a/opal/mca/btl/tcp/btl_tcp_component.c +++ b/opal/mca/btl/tcp/btl_tcp_component.c @@ -1300,6 +1300,24 @@ mca_btl_base_module_t** mca_btl_tcp_component_init(int *num_btl_modules, } } + /* Avoid a race in wire-up when using threads (progess or user) + and multiple BTL modules. The details of the race are in + https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032, + but the summary is that the lookup code in + component_recv_handler() below assumes that add_procs() is + atomic across all active TCP BTL modules, but in multi-threaded + code, that isn't guaranteed, because the locking is inside + add_procs(), and add_procs() is called once per module. This + isn't a proper fix, but will solve the "dropped connection" + problem until we can come up with a more complete fix to how we + initialize procs, endpoints, and modules in the TCP BTL. */ + if (mca_btl_tcp_component.tcp_num_btls > 1 && + (enable_mpi_threads || 0 < mca_btl_tcp_progress_thread_trigger)) { + for( i = 0; i < mca_btl_tcp_component.tcp_num_btls; i++) { + mca_btl_tcp_component.tcp_btls[i]->super.btl_flags |= MCA_BTL_FLAGS_SINGLE_ADD_PROCS; + } + } + #if OPAL_CUDA_SUPPORT mca_common_cuda_stage_one_init(); #endif /* OPAL_CUDA_SUPPORT */