1
1
openmpi/opal
Brian Barrett 2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
..
class opal/free_list: fix race condition 2018-10-16 13:17:09 -06:00
datatype opal/dataype: add additional interface to retrieve more details about 2018-06-21 09:25:50 -05:00
dss Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
etc Correct the comment in the default MCA param template - we do not support a param called "component_path". The correct syntax is "mca_base_component_path" 2018-01-05 08:46:44 -08:00
include opal/atomic: always use C11 atomics if available 2018-09-14 10:51:05 -06:00
mca btl tcp: Add workaround for "dropped connection" issue 2018-10-16 18:33:30 -07:00
memoryhooks opal: rename opal_atomic_init to opal_atomic_lock_init 2017-08-07 14:15:11 -06:00
runtime Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
test/reachable opal: update some string handling 2018-10-14 16:04:28 -07:00
threads opal/atomic: always use C11 atomics if available 2018-09-14 10:51:05 -06:00
tools Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
util opal: update some string handling 2018-10-14 16:04:28 -07:00
win32 opal: convert from strncpy() -> opal_string_copy() 2018-09-27 11:56:18 -07:00
common_sym_whitelist.txt opal: add code patcher framework 2016-04-13 17:16:13 -06:00
Makefile.am opal: remove generated asm code 2017-08-03 09:18:58 -06:00
win_makefile Purge whitespace from the repo 2015-06-23 20:59:57 -07:00