2acc4b7e7f
Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com> |
||
---|---|---|
.. | ||
class | ||
datatype | ||
dss | ||
etc | ||
include | ||
mca | ||
memoryhooks | ||
runtime | ||
test/reachable | ||
threads | ||
tools | ||
util | ||
win32 | ||
common_sym_whitelist.txt | ||
Makefile.am | ||
win_makefile |