btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Этот коммит содержится в:
родитель
37a3f32e47
Коммит
2acc4b7e7f
@ -1300,6 +1300,24 @@ mca_btl_base_module_t** mca_btl_tcp_component_init(int *num_btl_modules,
|
||||
}
|
||||
}
|
||||
|
||||
/* Avoid a race in wire-up when using threads (progess or user)
|
||||
and multiple BTL modules. The details of the race are in
|
||||
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032,
|
||||
but the summary is that the lookup code in
|
||||
component_recv_handler() below assumes that add_procs() is
|
||||
atomic across all active TCP BTL modules, but in multi-threaded
|
||||
code, that isn't guaranteed, because the locking is inside
|
||||
add_procs(), and add_procs() is called once per module. This
|
||||
isn't a proper fix, but will solve the "dropped connection"
|
||||
problem until we can come up with a more complete fix to how we
|
||||
initialize procs, endpoints, and modules in the TCP BTL. */
|
||||
if (mca_btl_tcp_component.tcp_num_btls > 1 &&
|
||||
(enable_mpi_threads || 0 < mca_btl_tcp_progress_thread_trigger)) {
|
||||
for( i = 0; i < mca_btl_tcp_component.tcp_num_btls; i++) {
|
||||
mca_btl_tcp_component.tcp_btls[i]->super.btl_flags |= MCA_BTL_FLAGS_SINGLE_ADD_PROCS;
|
||||
}
|
||||
}
|
||||
|
||||
#if OPAL_CUDA_SUPPORT
|
||||
mca_common_cuda_stage_one_init();
|
||||
#endif /* OPAL_CUDA_SUPPORT */
|
||||
|
Загрузка…
x
Ссылка в новой задаче
Block a user