This commit makes datagram checks time based and reduces their
frequency when only the wildcard datagram is posted. This change
improves latency on knl systems.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit removes the local operation count check from the local SMSG
completion queue. This check was leading to hangs due to an undocumented
feature of the ugni library. The local SMSG CQ is used to send credit
return messages back to the sender. The ugni library never checks for
the completion itself but relying on the SMSG user to periodically
check the CQ.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the ugni btl to make use of multiple device
contexts to improve the multi-threaded RMA performance. This commit
contains the following:
- Cleanup the endpoint structure by removing unnecessary field. The
structure now also contains all the fields originally handled by the
common/ugni endpoint.
- Clean up the fragment allocation code to remove the need to
initialize the my_list member of the fragment structure. This
member is not initialized by the free list initializer function.
- Remove the (now unused) common/ugni component. btl/ugni no longer
need the component. common/ugni was originally split out of
btl/ugni to support bcol/ugni. As that component exists there is no
reason to keep this component.
- Create wrappers for the ugni functionality required by
btl/ugni. This was done to ease supporting multiple device
contexts. The wrappers are thread safe and currently use a spin
lock instead of a mutex. This produces better performance when
using multiple threads spread over multiple cores. In the future
this lock may be replaced by another serialization mechanism. The
wrappers are located in a new file: btl_ugni_device.h.
- Remove unnecessary device locking from serial parts of the ugni
btl. This includes the first add-procs and module finalize.
- Clean up fragment wait list code by moving enqueue into common
function.
- Expose the communication domain flags as an MCA variable. The
defaults have been updated to reflect the recommended setting for
knl and haswell.
- Avoid allocating fragments for communication with already
overloaded peers.
- Allocate RDMA endpoints dyncamically. This is needed to support
spreading RMA operations accross multiple contexts.
- Add support for spreading RMA communication over multiple ugni
device contexts. This should greatly improve the threading
performance when communicating with multiple peers. By default the
number of virtual devices depends on 1) whether
opal_using_threads() is set, 2) how many local processes are in the
job, and 3) how many bits are available in the pid. The last is
used to ensure that each CDM is created with a unique id.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit exposes ugni statistics for use with MPI_T. There is
no overhead to providing these counters.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This is mostly for error cases, where we need to release the
newly created proc. Currently the code deadlocks because the endpoint
lock is help at the release and the lock is not recursive.
Aslo added some code to print the IP addresses that don't match during
the TCP connection step.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).
This commit:
1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
rest of the code base.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Add a verbose to show all the failed attempts to match the
remote interfaces based on the modex info.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Since the oob and connections systems do not work the same way they
did in older versions of Open MPI these operations are no longer
necessary. At best they do nothing and at worst they hurt performance
by making us enter the event library more often in opal_progress().
Fixes#2839
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Due to the conversion from ssize_t to int we were losing bytes, and
ended up writing outside the receiver buffer. Similarly on the send,
due to the conversion to a lesser type, we could missinterpret the
end of the fragment.
the hwloc topology might not contain a NUMA object with hwloc < v2
if the node is not NUMA, so force the NUMA object count to one
in order to correctly allocate mca_btl_sm_component.sm_mpools.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This should probably not go to the v2.x branch, since it changes the
output format of the usnic stats.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Double check the queue lengths that we get back from libfabric to
ensure that they are at least as long as we need. They *should* never
be shorter than we need, but let's just check to be sure.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Don't just blindly send ACKs; ensure that we have send credits before
doing so. If we don't have any send credits, just don't send the ACK
(it'll come again soon enough; it's not a tragedy if we don't send it
now).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The libfabric usnic provider may give you back TX/RX queues that are
longer than you asked for. So just use the TX/RQ/CQ lengths that we
asked for, regardless of what length comes back.
Additionally, keep the length of the priority channel CQ separate from
the length of the data CQ.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since the usnic BTL is single-threaded in this area, there really is
no danger, but don't use one of the pointers hanging off the frag
after we return it to the freelist. Instead, save the endpoint
pointer before returning the frag.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The types are technically typedef equivalent, but it's less confusing
to use the types that agree with the name of the constructor.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
opal_convertor_pack() might pack less bytes than requested,
so always set frag->segments[0].seg_len.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD
is always true and hence can be safely removed
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Remove BTL_OPENIB_FAILOVER_ENABLED code in the openib btl source.
Remove the failover-specific files from the openib btl.
Update the openib/Makefile.am accordingly.
Remove the -enable-openib-failover config logic.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
This commit rewrites much of the btl/self component to fix a long
standing memory usage bug. Before this commit the prepare_src path
would always allocate a max send fragment (256kB). This caused the
rank to allocate 32 * 256k useless buffers from one send. This commit
makes the following changes:
- Add the MCA_BTL_FLAGS_GET flag by default. No reason not to set it.
- Reduce the eager limit, max send size, buffers per allocation, and
maximum buffer count per fragment size. These changes should have
no noticible affect on performance but should greatly reduce the
memory usage of the component.
- Implement the sendi function. This should reduce self send latency
somewhat.
- Rewrite prepare_src to never allocate a eager or max send fragment
for contiguous data.
- add_procs needs to return something in the peer array for the proc
self not just set the reachability bit. Now stores (void *) 1.
- Various cleanups. Removed and unused file.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The vader btl kept a per-peer registration cache to keep track of
attachments. This is not really a problem with small numbers of local
ranks but can be a problem with large SMP machines. To reduce the
footprint there is now one registration cache for all xpmem
attachments. This will probably increase the lookup time for large
transfers but is a worthwhile trade-off.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
OMPI header cannot be included in OPAL source code, hence removed it.
Fixes: (740b636db) btl/openib: Disqualify rdmacm CPC if
MPI_THREAD_MULTIPLE.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>