1
1
openmpi/opal/mca/btl
Brian Barrett 2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
..
base Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
ofi Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
openib Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
portals4 opal: add types for atomic variables 2018-09-14 10:48:55 -06:00
self mca: Dynamic components link against project lib 2017-08-24 11:56:16 -04:00
sm btl/sm: fix CID 1415105 2018-03-26 14:21:21 -07:00
smcuda Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
tcp btl tcp: Add workaround for "dropped connection" issue 2018-10-16 18:33:30 -07:00
template mca: Dynamic components link against project lib 2017-08-24 11:56:16 -04:00
uct btl/uct: fix deadlock in connection code 2018-10-16 18:28:47 -06:00
ugni Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
usnic Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
vader Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
btl.h btl/vader: add support for atomics and emulated rdma 2018-07-02 13:57:11 -06:00
Makefile.am Purge whitespace from the repo 2015-06-23 20:59:57 -07:00