Some BTLs do not require local registration for some rdma
transactions. For example: inline put on openib, fma put on ugni. This
commit adds code to expose the local registration thresholds to BTL
users. Optimized code can take advantage of this information to
improve rdma performance.
@ggouaillardet identified that HAVE_ALIAS_ATTRIBUTE was not properly
being defined in the embedded libfabric. This is because the
embedded configury missed the test for it (i.e., the real configure.ac
for libfabric always defines HAVE_ALIAS_ATTRIBUTE to 0 or 1 -- we
didn't emulate that properly here in libfabric's configure.m4).
Also, fix some grammar and properly escape another AC_MSG_CHECKING
message in libfabric's configure.m4.
When configured --with-devel-headers, there's now 2 "osd.h" header
files in libfabric (in different dirs). Automake's "install" target
didn't like this, and errored out.
Since embedding libfabric is a temporary measure, just avoid the
problem by not installing any libfabric headers.
Add the functions that changed between BTL 2.0 and 3.0 into compat.h
and compat.c:
* module.btl_prepare_src: the signature and body of this method
changed between 2.0 and 3.0. However, the functions that this
method calls did *not* need to change, so they are copied over
wholesale (with the exception that they no longer accept the unused
`registration` parameter).
* module.btl_prepare_dst: this method does not exist in BTL 3.0.
* module.btl_put: the signature and body of this method changed
between 2.0 and 3.0.
The send inline optimization uses the btl_sendi function to achieve lower
latency and higher message rates. Before this commit BTLs were allowed to
assume the descriptor was non-NULL and were expected to return a valid
descriptor if the send could not be completed using btl_sendi. This
behavior was fine until the usage of btl_sendi was changed in ob1. This
commit allows the caller to specify NULL for the descriptor. The affected
btls have been updated to handle this case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an interface for btl's to export support for 64-bit atomic
operations on integers. BTL's that can support atomic operations should
implement these functions and set the appropriate btl_flags and btl_atomic_flags.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.
Changes:
- Added functions to register and deregister memory regions with the
btl. If no registration is needed a btl should set these function
pointers to NULL. These function take over for btl_prepare_src/dst
and btl_free for RDMA operations. The caller should specify the
maximum permissions needed on the memory.
- Changed the function signatures for both btl_put and btl_get. In
place of a prepared descriptor the caller should provide the source
and destination addresses and registration handles as well as a
new callback function. The callback will be provided with the local
address and registration handle, callback context, callback data, and
status. See mca_btl_base_rdma_completion_fn_t in btl.h.
- Added a new btl constraint: MCA_BTL_REG_HANDLE_MAX_SIZE. This
value specifies the maximum size of any btl's registration handle.
- Removed the btl_prepare_dst function. This reflects the fact that
RDMA operations no longer depend on "prepared" descriptors.
- Removed the btl_seg_size member. There is no need to btl's to
subclass the mca_btl_base_segment_t class anymore.
- Expose the btl's put/get limitations with new struct members:
btl_put_limit, btl_put_alignment, btl_get_limit, btl_get_alignment.
- Remove the mca_mpool_base_registration_t argument from the btl_prepare_src
function. The argument was intended to support RDMA operations and is no
longer necessary.
- Remove des_remote/des_remote_count from the mca_btl_base_descriptor_t
structure. This structure member was originally used to specify the remote
segment for RDMA operations. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment structure fields have been
renamed to from des_local/des_local_count to des_segments/des_segment_count.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per feedback from rhc, manually set the base_ptr member
of the opal_buffer_t variable to NULL prior to calling
OBJ_RELEASE. A similar feature of opal_dss.load also
exists so likewise reset the base_ptr to NULL prior to
invoking it.
Hopefully the opal_buffer_t struct does not change
frequently.
Minor cleanups to reduce output when pmix_base_verbose
mca paramater is set.
usnic_fls() can actually return 0, leading us to incorrectly free() a
buffer instead of OMPI_FREE_LIST_RETURN_MT'ing it.
So add an explicit bool in the struct that tracks whether the buffer
came from malloc or a freelist.
This was CID 1269660.
Coverity alerted us to the fact that there are places where
the synonym_for param is hard-coded to -1 when calling
register_variable(). It would be a coding error if synonym_for==-1
and (flags & MCA_BASE_VAR_FLAG_SYNONYM)>0, so let's add that to the
debug-only check at the top of the function.
This was CID 993717.
Remove use of the Cray PMI KVS - which is designed for a lighweight
MPI that exchanges only a minimimal amount of connection info
(about 128 bytes per rank) - within cray/pmix. Use Cray PMI
collective extensions instead.
This is the first of several steps to accelerate launch of
Open MPI on Cray systems using either native aprun or nativized
slurm.
in this code and it ended up changing the logic that is used to set up eager RDMA.
Rather than setting up eager RDMA with a high priority message, it did it the other
way around. For some reason, CUDA-aware support did not like this. So, basically,
restore the logic to the way it was prior to the refactoring. The refactoring did not
intend to change this. Lightly reviewed by hjelmn.
Per open-mpi/ompi#381, convert the specific intialization of opal_pmix
to use the generic "= { 0 }" initializer. This form can be used to
initialize any type when the intent is just to zero out / assign
*some* value.
Embedding libfabric is a temporary measure; I'm removing some warning
notifications so that the output isn't so cluttered (we're getting
the real warnings fixed upstream, but the OMPI community doesn't
really care/need to see the warnings in the meantime).
This commit adds an MCA variable to select Portals4 logical
addressing, populates the logical-to-physical mapping table and
initializes the NI in this mode.
For static builds, we need to also set
<framework>_<component>_WRAPPER_EXTRA_LIBS so that the wrappers know
what other libraries to add to link executables.
Ensure to count *this* process when checking for how many VFs we need
on the local server.
(cherry picked from commit 386c01934e98cb8dcb48ff648ecdfb0c8677baa9)
There was a mismatch between the structure for mca_rcache_vma_t and
the OBJ_CLASS_INSTANCE. One was opal_list_item_t and the other was
ompi_free_list_item_t. The super class in the structure looks like it
is the correct one. Changed the superclass in OBJ_CLASS_INSTANCE to
match.
If there are not enough resources (e.g., low VFs), we can end up
calling finalize_one_channel() on the same channel multiple times. So
ensure to NULL out fields that we have freed already so that we do not
try to free them a second time.
Fixes CSCus26648.
Fix the ordering so that we obtain the usnic netmask information
*before* we do the filtering based on CIDR-specified networks.
Also requires upstream Github libfabric commit 3976745.
Fixes CSCus22495.
We had several problems in the old code:
1. We were specifying an arbitrary timeout (100 ms) and then abandoning
all remaining pending AV insert operations. We would then free the
endpoint buffer that we gave to fi_av_insert(), usually causing
libfabric's progress thread to write to a freed buffer.
2. We were claiming in a show_help message that the timeout was
controllable via an MCA parameter. This commit removes that
parameter, since there's no good method for us to specify a timeout
like this to libfabric right now.
3. We also weren't waiting for the correct number of fi_av_insert()
operations to complete. We were waiting for nprocs, which is
accidentally fine for 2 procs on separate hosts, but not for most
other proc counts.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
There was a bug in the openib btl handling this valid sequence of
calls:
desc = btl_alloc ();
btl_free (desc);
When triggered the bug would cause either fragment loss or undefined
behavior (SEGV, etc). The problem occured because btl_alloc contained
the logic to modify the pending fragment (length, etc) and these
changes were not corrected if the fragment was freed instead of sent.
To fix this issue I 1) moved some of the coalescing logic to the
btl_send function, and 2) retry the coalesced fragment on btl_free
if it was never sent. This appears to completely address the issue.
For systems with OFED's lacking XRC support, commit b3617e73
broke the build of the openib btl. This commit addresses
the issues introduced by this commit.
This was more complicated than I would like, but it's just an
unfortunate GCC/clang difference. I don't have access to all the C
compilers out there, so this may still have problems with other
compilers that implement some form of `#pragma GCC diagnostic` support
but don't actually behave the same as some versions of GCC.
fixes#323
Use the pkg-config related m4 functions to find out where
Cray's xpmem.h and libxpmem are located on a system.
With this commit, there is no longer any need to have to
explicitly indicate an xpmem install location on the configure
line, at least for Cray systems running CLE 4.X and 5.X.
Replace temporary environment variables with a MCA
parameter for the ugni btl. A user wishing to
use the ugni btl async. progress thread needs to
set the request_progress_thread param to true.
For example, using env. variable format:
export OMPI_MCA_btl_ugni_request_progress_thread=1
Verified via testing with unit tests, etc. that
in fact BTE TX descriptors using CQs configured to
generate IRQs were in fact working correctly on Cray XC. Disable
send message back to self and just use IRQs generated
by completion of TX descriptors posted to BTE.
Honor enable_mpi_threads setting to enable the ugni btl
async progress thread. If the app doesn't request thread-multiple
the thread will not be created.
- by default allow to register maximum possible (i.e 2 * total_memory)
memory. This beheviour can be turned off using mca parameter
"btl_openib_allow_max_memory_registration"
- In fallback case, use device specific parameters to calulate
memory limit.
Properly test for some dependent libraries; don't just assume
elsewhere in Open MPI's configury will find those libraries. Also
consolidate some CPPFLAGS and clarify some comments.