This commit improves the injection rate and latency for RDMA
operations. This is done by the following improvements:
- If C11's _Thread_local keyword is available then always use the
same virtual device index for the same thread when using RDMA. If
the keyword is not available then attempt to use any device that
isn't already in use. The binding support is enabled by default but
can be disabled via the btl_ugni_bind_devices MCA variable.
- When posting FMA and RDMA operations always attempt to reap
completions after posting the operation. This allows us to
better balance the work of reaping completions across all
application threads.
- Limit the total number of outstanding BTE transactions. This
fixes a performance bug when using many threads.
- Split out RDMA and local SMSG completion queue sizes. The RDMA
queue size is better tuned for performance with RMA-MT.
- Split out put and get FMA limits. The old btl_ugni_fma_limit MCA
variable is deprecated. The new variable names are:
btl_ugni_fma_put_limit and btl_ugni_fma_get_limit.
- Change how post descriptors are handled. They are no longer
allocated seperately from the RDMA endpoints.
- Some cleanup to move error code out of the critical path.
- Disable the FMA sharing flag on the CDM when we detect that there
should be enough FMA descriptors for the number of virtual devices
we plan will create. If the user sets this flag we will not unset
it. This change should improve the small-message RMA performance by
~ 10%.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new btl for one-sided and two-sided. This btl
uses the uct layer in OpenUCX. This btl makes use of multiple uct
contexts and per-thread device pinning to provide good performance
when using threads and osc/rdma. This btl has been tested extensively
with osc/rdma and passes all MTT tests on aries and IB hardware.
For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_uct_memory_domains ib/mlx5_0. The specific transports used
can be selected using --mca btl_uct_transports. The default is to use
any available transport.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.
Instead, use a hash map. That way, it starts small and can grow if it
needs to. It also makes no assumptions about the values of the kernel
interface indexes.
Fixes#5292.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different). For example:
mpirun --bind-to cpu-list:ordered ...
Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.
Also add some minor updates to the orterun.1in man page for
clarification.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch.
Signed-off-by: Noah Evans <noah.evans@gmail.com>
This commit add support for scalable endpoint to enhance multithreaded
application performance. The BTL will detect the support from ofi
provider and will fallback to normal usage of scalable endpoint is not
supported.
NEW MCA parameters:
- mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint.
- mca_btl_ofi_num_contexts_per_module: number of communication context
to create (should be the same as number of thread).
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This
commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits
(FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR).
The btl functionality remains the same.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
OFI sockets provider will sometime return FI_EINTR. Apparently this flag
does not exist in OFI version 1.5. This commit added an ifdef check.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit adds patcher support of aarch64. The current
implementation uses mov, movk, and br to perform the jump. This uses 5
instructions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes problem where users have libfabric version < 1.5
installed and build failed because the new MR modebits does not exist.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>