The problem was in mca_btl_openib_proc_create. This function may be called
from several places simultaneously:
* from the main thread when somebody wants to do `MPI_Send()` (for example) for
the first time;
* from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()`
is called.
In this case one of the threads may add an uninitialized proc structure
to the `mca_btl_openib_component.ib_procs` and the other will read it and
treat as initialized.
This commit turns ib_proc initialization into a single atomic operation.
Move .so version numbers to their appropriate project in the top-level
VERSION file. Also add the project name to all .so version number
names. Remove no-longer-used .so names.
Update the comment explaining why we don't build the verbs libfabric
provider (because Open MPI doesn't use it). Specifically: the
upstream libfabric bug about building libfabric under travis has been
fixed -- so that's no longer the reason we don't build it.
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
One of the criticisms of Travis on the call earlier today was that
when there's a compile failure, you really only see the *first*
compile failure. Then you fix that, resubmit, and 15-60 minutes
later, you see the *next* compile failure.
Per some discussion with @bosilca and @rhc54 this afternoon, use "make
-k", which will have "make" try to continue the build even after
compile failures. Hence, you'll be able to see many more compile
errors that just one at a time.
In some cases (e.g., if there's a typo in a top-level header file), it
will cause a bazillion lines of compile error output, but hopefully
that should be easy to spot / ignore.
A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.
Fixesopen-mpi/ompi#1290
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It was decided some time ago that there is no benefit to using any
per-peer receive queues on infiniband. At the time we decided not to
change the default but that objection has been dropped. This commit
changes the 128 message queue to use SRQ instead of PP. This has no
impact on iWarp which sets the default in a different way.
Closesopen-mpi/ompi#1156
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>