This commit fixes a bug that can occur when communicating via XRC to
peers on the same node. UDCM was not saving the SRQ numbers on the
loopback endpoint (which shares its ib_addr info with all local peers)
so any messages to local peers use an invalid SRQ number.
Fixesopen-mpi/ompi#1383
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
There is a bug in MPMD detection that disables totalview if a : is
found anywhere on the command line. This includes inside an argument
option or MCA variable value. This commit changes the check to look
for the string " : " instead of the character : which should eliminate
the issue in most cases.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This bug fixes two issue with the ib_addr lock:
- The ib_addr lock must always be obtained regardless of
opal_using_threads() as the CPC is run in a seperate thread.
- The ib_addr lock is held in mca_btl_openib_endpoint_connected when
calling back into the CPC start_connect on any pending
connections. This will attempt to obtain the ib_addr lock
again. Since this is not a performance-critical part of the code
the lock has been changed to be recursive.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug that occurs when attempting a get or put
operation on an endpoint that is not already connected. In this case
the remote_srqn may be set to an invalid value as the rem_srqs array
on the endpoint is not populated. This commit moves the usage of the
rem_srqs array to the internal put/get functions where it is
guaranteed this array is populated.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK.
If we only partially use a page, any data allocated on the remainder of
the page will be inaccessible to the child process.
Fixesopen-mpi/ompi#1363
This commit fixes a memory corruption bug when parsing lines of the
form:
-x FOO=bar
The code was making changes to the size of the buffer allocated for
key_buffer without making the appropriate changes to
key_buffer_len. This was causing subsequent calls to save_param_name
to write to invalid memory.
This commit makes the following changes:
- Fix the above bug by modifying trim_name to move the string within
the buffer instead of re-allocating space for the trimmed string.
- Cleaned up both trim_name and save_param_name. Both functions took
a prefix and suffix to trim. Problem was the prefix was not
treated like a prefix. Instead the "prefix" was located inside the
string using strstr then the trimmed value started after the
substring (even in the middle of the string). To allow trimming
both -x and --x (as well as -mca and --mca) trim_name is now
called with each prefix.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when
creating the loopback queue pair. This is needed when communicating
with any other local peer.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes two bugs in XRC support
- When dynamic add_procs support was added to master the remote
process name was added to the non-XRC request structure. The same
value was not added to the XRC xconnect structure. This error was
not caught because the send/recv code was incorrectly using the
wrong structure member. This commmit adds the member and ensure the
xconnect code uses the correct structure.
- XRC loopback QP support has been fixed. It was 1) not setting the
correct fields on the endpoint structure, 2) calling
udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
data.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Commit open-mpi/ompi@400af6c52d
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This fixes open-mpi/ompi@8b05f308f9
libmpi.so cannot be built (unresolved symbols) with configure'd with
--disable-mem-debug --disable-mem-profile --disable-memchecker --without-memory-manager