This commit improves the injection rate and latency for RDMA
operations. This is done by the following improvements:
- If C11's _Thread_local keyword is available then always use the
same virtual device index for the same thread when using RDMA. If
the keyword is not available then attempt to use any device that
isn't already in use. The binding support is enabled by default but
can be disabled via the btl_ugni_bind_devices MCA variable.
- When posting FMA and RDMA operations always attempt to reap
completions after posting the operation. This allows us to
better balance the work of reaping completions across all
application threads.
- Limit the total number of outstanding BTE transactions. This
fixes a performance bug when using many threads.
- Split out RDMA and local SMSG completion queue sizes. The RDMA
queue size is better tuned for performance with RMA-MT.
- Split out put and get FMA limits. The old btl_ugni_fma_limit MCA
variable is deprecated. The new variable names are:
btl_ugni_fma_put_limit and btl_ugni_fma_get_limit.
- Change how post descriptors are handled. They are no longer
allocated seperately from the RDMA endpoints.
- Some cleanup to move error code out of the critical path.
- Disable the FMA sharing flag on the CDM when we detect that there
should be enough FMA descriptors for the number of virtual devices
we plan will create. If the user sets this flag we will not unset
it. This change should improve the small-message RMA performance by
~ 10%.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit makes datagram checks time based and reduces their
frequency when only the wildcard datagram is posted. This change
improves latency on knl systems.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the ugni btl to make use of multiple device
contexts to improve the multi-threaded RMA performance. This commit
contains the following:
- Cleanup the endpoint structure by removing unnecessary field. The
structure now also contains all the fields originally handled by the
common/ugni endpoint.
- Clean up the fragment allocation code to remove the need to
initialize the my_list member of the fragment structure. This
member is not initialized by the free list initializer function.
- Remove the (now unused) common/ugni component. btl/ugni no longer
need the component. common/ugni was originally split out of
btl/ugni to support bcol/ugni. As that component exists there is no
reason to keep this component.
- Create wrappers for the ugni functionality required by
btl/ugni. This was done to ease supporting multiple device
contexts. The wrappers are thread safe and currently use a spin
lock instead of a mutex. This produces better performance when
using multiple threads spread over multiple cores. In the future
this lock may be replaced by another serialization mechanism. The
wrappers are located in a new file: btl_ugni_device.h.
- Remove unnecessary device locking from serial parts of the ugni
btl. This includes the first add-procs and module finalize.
- Clean up fragment wait list code by moving enqueue into common
function.
- Expose the communication domain flags as an MCA variable. The
defaults have been updated to reflect the recommended setting for
knl and haswell.
- Avoid allocating fragments for communication with already
overloaded peers.
- Allocate RDMA endpoints dyncamically. This is needed to support
spreading RMA operations accross multiple contexts.
- Add support for spreading RMA communication over multiple ugni
device contexts. This should greatly improve the threading
performance when communicating with multiple peers. By default the
number of virtual devices depends on 1) whether
opal_using_threads() is set, 2) how many local processes are in the
job, and 3) how many bits are available in the pid. The last is
used to ensure that each CDM is created with a unique id.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit reduces the overhead of calling the ugni progress
function. It does the following:
- Check for new connections once every eight calls.
- Do not call remote smsg progress unless we are connected to at
least one remote peer.
- Do not call rdma progress unless at least one rdma fragment is
outstanding.
- Check endpoint wait list size before obtaining a lock.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:
- Before this change a significant portion of the rcache
functionality lived in mpool components. This meant that it was
impossible to add a new memory pool to use with rdma networks
(ugni, openib, etc) without duplicating the functionality of an
existing mpool component. All the registration functionality has
been removed from the mpool and placed in the rcache framework.
- All registration cache mpools components (udreg, grdma, gpusm,
rgpusm) have been changed to rcache components. rcaches are
allocated and released in the same way mpool components were.
- It is now valid to pass NULL as the resources argument when
creating an rcache. At this time the gpusm and rgpusm components
support this. All other rcache components require non-NULL
resources.
- A new mpool component has been added: hugepage. This component
supports huge page allocations on linux.
- Memory pools are now allocated using "hints". Each mpool component
is queried with the hints and returns a priority. The current hints
supported are NULL (uses posix_memalign/malloc), page_size=x (huge
page mpool), and mpool=x.
- The sm mpool has been moved to common/sm. This reflects that the sm
mpool is specialized and not meant for any general
allocations. This mpool may be moved back into the mpool framework
if there is any objection.
- The opal_free_list_init arguments have been updated. The unused0
argument is not used to pass in the registration cache module. The
mpool registration flags are now rcache registration flags.
- All components have been updated to make use of the new framework
interfaces.
As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an opal hash table to keep track of mapping between
process identifiers and ompi_proc_t's. This hash table is used by the
ompi_proc_by_name() function to lookup (in O(1) time) a given
process. This can be used by a BTL or other component to get a
ompi_proc_t when handling an incoming message from an as yet unknown
peer.
Additionally, this commit adds a new MCA variable to control the new
add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in
the process falls below the threshold a ompi_proc_t is created for
every process. If the number of ranks is above the threshold then a
ompi_proc_t is only created for the local rank. The code needed to
generate additional ompi_proc_t's for a communicator is not yet
complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.
This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.
Notes:
OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Replace temporary environment variables with a MCA
parameter for the ugni btl. A user wishing to
use the ugni btl async. progress thread needs to
set the request_progress_thread param to true.
For example, using env. variable format:
export OMPI_MCA_btl_ugni_request_progress_thread=1
Honor enable_mpi_threads setting to enable the ugni btl
async progress thread. If the app doesn't request thread-multiple
the thread will not be created.
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.
Changes:
- Removed the btl_prepare_dst function. This reflects the fact that
RDMA operations no longer depend on "prepared" descriptors.
- Removed the btl_seg_size member. There is no need to btl's to
subclass the mca_btl_base_segment_t class anymore.
...
Add more
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
of preregistered buffers
Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
This commit adds initial ugni thread safety support.
With this commit, sun thread tests (excepting MPI-2 RMA)
pass with various process counts and threads/process.
Also osu_latency_mt passes.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.