1
1

7 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
b5428aaf71 btl/uct: add support for UCX 1.6.x
This commit updates the uct btl to support the v1.6.x release of
UCX. This release breaks API.

Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>
(cherry picked from commit b78066720c3e3299bd76f2e22d2c0e415db572fc)
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2019-06-07 15:54:47 -05:00
Nathan Hjelm
0957861689 btl/uct: fix some issues when using UCX over ugni
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit e07a64c52d92adf51732ea78e17b679f6deffa12)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-06 10:57:59 -07:00
Nathan Hjelm
e6f84e79de btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 707d35deeb62a93ea8a3806d07e07e3a96c51d19)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-16 19:16:11 -06:00
Nathan Hjelm
0c4ba45af2 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6ed68da870c391d88575dc027a3de4826a77f57e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:34:33 -06:00
Nathan Hjelm
b6bd3d33f1 btl/uct: fix compile warnings/errors
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 47ed8e8830749b6b59c84592c15b7576ea164f0c)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-23 14:05:17 -06:00
Nathan Hjelm
6c089518e7 btl/uct: make uct endpoints array a flexible array member
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-25 18:14:58 -06:00
Nathan Hjelm
c5c5b42307 btl: add a new btl for the UCT layer in OpenUCX
This commit adds a new btl for one-sided and two-sided. This btl
uses the uct layer in OpenUCX. This btl makes use of multiple uct
contexts and per-thread device pinning to provide good performance
when using threads and osc/rdma. This btl has been tested extensively
with osc/rdma and passes all MTT tests on aries and IB hardware.

For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_uct_memory_domains ib/mlx5_0. The specific transports used
can be selected using --mca btl_uct_transports. The default is to use
any available transport.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-25 18:14:58 -06:00