1
1

596 Коммитов

Автор SHA1 Сообщение Дата
Austen Lauria
ecd990a67c
Merge pull request #6933 from devreal/osc-ucx-excl-lock
UCX osc: properly release exclusive lock to avoid lockup
2019-10-25 09:16:51 -04:00
Nathan Hjelm
c4d0752036
Merge pull request #6803 from guserav/fix-osc-sm-post-32-bit-atomics
Fix osc sm posts when only 32 bit atomics support
2019-08-27 18:23:48 -07:00
Joseph Schuchart
a5cc380416 UCX osc: properly release exclusive lock to avoid lockup
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-08-27 23:01:35 +02:00
Tomislav Janjusic
d5f6b088ae osc/ucx: Fix error path
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2019-08-12 21:54:01 +03:00
Nysal Jan K.A
3529d44702 osc/ucx: Fix data corruption with non-contiguous accumulates
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
2019-07-24 13:07:59 +05:30
Nysal Jan K.A
14808922cf osc/ucx: Add support for the no_locks info key
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
2019-07-18 17:29:01 +05:30
guserav
3c9f4e6823 Fix osc sm posts when only 32 bit atomics support
Signed-off-by: guserav <erik.zeiske@hpe.com>
2019-07-09 15:13:25 -07:00
Geoff Paulsen
f1b2a09675
Merge pull request #6649 from devreal/rdma-fetchop-local
OSC rdma: make sure accumulating in shared memory is safe
2019-06-28 14:46:21 -05:00
Artem Polyakov
6678ac0f55 osc/ucx: Fix possible win creation/destruction race condition
To avoid fully initializing the osc/ucx component for MPI application
that are not using One-Sided functionality, the initialization happens
at the first MPI window creation.

This commit ensures atomicity of global state modifications.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-06-20 09:05:03 -07:00
Artem Polyakov
0857742624 osc/ucx: Fix worker pool finalization
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-06-20 09:05:03 -07:00
Joseph Schuchart
8f27cc26d9 OSC rdma win allocate: synchronize error codes across shared memory group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-06-07 11:03:21 +02:00
Yossi Itigin
9d1994b906 OSC/UCX: Fix deadlock with atomic lock
Atomic lock must progress local worker while obtaining the remote lock,
otherwise an active message which actually releases the lock might not
be processed while polling on local memory location.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2019-05-19 20:10:09 +03:00
Joseph Schuchart
c67e229193 OSC rdma: make sure accumulating in shared memory is safe
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-05-10 17:27:10 +02:00
Nathan Hjelm
4345308dfd osc/rdma: fix CAS 32-bit network atomic compatibility check
When checking for btl compatibility with 32-bit CAS osc/rdma was
checking the incorrect flag field.

Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>
2019-05-10 07:27:53 -06:00
Mikhail Brinskii
2ef5bd8b36 SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-04-26 14:47:58 +03:00
Valentin Petrov
30970bdfdf OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP
Addtional bugfix: origin_addr -> result_addr for no_op, replace_op
    and sum_op fetch destination.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-04-17 10:30:21 +03:00
Artem Polyakov
bfff5783f9
Merge pull request #6371 from artpol84/osc/select_dbg
osc/base: Add debug output stating a selected component
2019-03-22 22:24:04 -07:00
Nathan Hjelm
73085e9ce3
Merge pull request #6413 from nuriallv/issue_osc_rdma
osc/rdma: fix when determining the node with the rank_array info for a peer
2019-02-27 16:30:06 -07:00
Nuria Losada
3cae149262 osc/rdma: fix when determining the node with the rank_array info for a peer
Signed-off-by: Nuria Losada <nlosada@icl.utk.edu>
2019-02-20 13:12:00 -05:00
Artem Polyakov
13a8e42108
Merge pull request #6163 from artpol84/osc/mt_submission
Refactoring of osc/ucx component for MT
2019-02-20 09:41:27 -08:00
Gilles Gouaillardet
fe05fcc11a osc/rdma: correctly handle communications to self
mark the "self" peer OMPI_OSC_RDMA_PEER_LOCAL_BASE when
the window is dynamically created and use_cpu_atomics is set
in order to correctly handle communications to self.

Thanks Bart Janssens for reporting this issue.

Refs. open-mpi/ompi#6394

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-20 09:52:17 +09:00
Artem Polyakov
19e2ae2efb opal/common/ucx: Switch to opal/tsd
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
7984d7d997 opal/common/ucx: Remove unused debugging macro
Will be reintroduced later if needed and after adaptation to the OMPI
infrastructure.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
43f16d8796 opal/common/ucx: Remove common_ucx_int.h
Place the content of common_ucx_int.h back to the common_ucx.h and
include common_ucx_wpool.h explicitly.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Xin Zhao
c6de09940f ompi/osc/ucx: Switch osc/ucx code to use Worker Pool.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
35090b69f1 osc/base: Add debug output stating a selected component
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-07 15:54:20 -08:00
Yossi Itigin
83cca9d52a ucx: add owner.txt for components
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-12-01 17:14:03 +02:00
Nathan Hjelm
5ebcbe444e
Merge pull request #6083 from devreal/rdma-plug-memleak
Plug two memory leaks in rdma osc
2018-11-27 09:56:19 -07:00
Sergey Oblomov
2d230b3aac OSC/UCX: set max level value to 60
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-27 14:20:28 +02:00
Yossi Itigin
ed967d867b
Merge pull request #6073 from hoopoepg/topic/set-osc-ucx-level-200
OSC: set UCX module used by default
2018-11-22 10:53:37 +02:00
Joseph Schuchart
91885f5876 Plug two memory leaks in rdma osc
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-11-14 14:31:54 -05:00
Sergey Oblomov
e91f214982 OSC/UCX: added UCX version evaluation
- added UCX version evaluation to set OSC UCX priority

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-14 10:03:13 +02:00
KAWASHIMA Takahiro
cacd6f389c datatype: Remove #if HAVE_[TYPE] for C99 types
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.

- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`

Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.

I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-14 09:32:52 +09:00
Sergey Oblomov
36934a8bb2 OSC: set UCX module used by default
- OSC/UCX module set priority to 200 to be used by default

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-12 15:08:22 +02:00
Joseph Schuchart
a193ae26bf Fix regression introduced earlier by re-adding a barrier after shared memory has been registered
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-10-24 15:54:19 -04:00
Sergey Oblomov
1099d5f023 COMMON/UCX: added error code to log output
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-21 11:37:25 +03:00
Sergey Oblomov
df765595e3 COMMON/UCX: suppressed coverity warnings
- suppressed coverity warnings - added log messages on failed calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-17 16:11:03 +03:00
Joseph Schuchart
d9dcdfdfba RDMA OSC: initialize segment memory before registering the segment
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-10-16 16:12:14 -04:00
Yossi Itigin
b8e1af6fcb osc_ucx: add worker flush before osc module free
Make sure all pending communications are done on all ranks before
closing the window. This way it will be safe to close the endpoints when
closing the component.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:47:16 +03:00
Yossi Itigin
bcc48515e4 Revert "osc_ucx: fix hang/timeout in component finalize"
This reverts commit 438d13b4ca1e7333b789ca3fb536fda17b0feb38.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:47:13 +03:00
Yossi Itigin
a012ee91d8
Merge pull request #5886 from yosefe/topic/osc-ucx-fix-finalize-hang
osc_ucx: fix hang/timeout in component finalize
2018-10-10 16:29:29 +03:00
Yossi Itigin
dc6809495d osc_ucx: fix hang/timeout in component finalize
Add barrier to make sure all endpoints are destroyed before destroying
the worker.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 14:38:06 +03:00
Sergey Oblomov
ae6f81983f OSC/UCX: fixed zero-size window processing
- added processing of zero-size MPI window

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-10 13:08:01 +03:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Nathan Hjelm
7fdf887937 osc/portal: use c99 subobject naming to initialize module
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 10:34:24 -06:00
Yossi Itigin
4bb6845888
Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno
MCA/COMMON/UCX: added synonym to opal_mem_hook variable
2018-08-29 14:16:33 +03:00
Nathan Hjelm
2221720fc7
Merge pull request #5591 from hjelmn/osc_rdma_cleanup
osc/rdma: clean out stale aggregation code
2018-08-27 10:05:50 -06:00
Sergey Oblomov
b72dd83f05 MCA/COMMON/UCX: added synonims for common ucx variables
- added synonims for atomic/osc modules

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-26 18:25:21 +03:00
Nathan Hjelm
d0cd80e902 osc/rdma: clean out stale aggregation code
The aggregation code in osc/rdma is currently broken and will likely
not be reused. This commit cleans it out.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-23 15:40:21 -06:00