1
1
Граф коммитов

584 Коммитов

Автор SHA1 Сообщение Дата
Yossi Itigin
9d1994b906 OSC/UCX: Fix deadlock with atomic lock
Atomic lock must progress local worker while obtaining the remote lock,
otherwise an active message which actually releases the lock might not
be processed while polling on local memory location.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2019-05-19 20:10:09 +03:00
Nathan Hjelm
4345308dfd osc/rdma: fix CAS 32-bit network atomic compatibility check
When checking for btl compatibility with 32-bit CAS osc/rdma was
checking the incorrect flag field.

Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>
2019-05-10 07:27:53 -06:00
Mikhail Brinskii
2ef5bd8b36 SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-04-26 14:47:58 +03:00
Valentin Petrov
30970bdfdf OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP
Addtional bugfix: origin_addr -> result_addr for no_op, replace_op
    and sum_op fetch destination.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-04-17 10:30:21 +03:00
Artem Polyakov
bfff5783f9
Merge pull request #6371 from artpol84/osc/select_dbg
osc/base: Add debug output stating a selected component
2019-03-22 22:24:04 -07:00
Nathan Hjelm
73085e9ce3
Merge pull request #6413 from nuriallv/issue_osc_rdma
osc/rdma: fix when determining the node with the rank_array info for a peer
2019-02-27 16:30:06 -07:00
Nuria Losada
3cae149262 osc/rdma: fix when determining the node with the rank_array info for a peer
Signed-off-by: Nuria Losada <nlosada@icl.utk.edu>
2019-02-20 13:12:00 -05:00
Artem Polyakov
13a8e42108
Merge pull request #6163 from artpol84/osc/mt_submission
Refactoring of osc/ucx component for MT
2019-02-20 09:41:27 -08:00
Gilles Gouaillardet
fe05fcc11a osc/rdma: correctly handle communications to self
mark the "self" peer OMPI_OSC_RDMA_PEER_LOCAL_BASE when
the window is dynamically created and use_cpu_atomics is set
in order to correctly handle communications to self.

Thanks Bart Janssens for reporting this issue.

Refs. open-mpi/ompi#6394

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-20 09:52:17 +09:00
Artem Polyakov
19e2ae2efb opal/common/ucx: Switch to opal/tsd
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
7984d7d997 opal/common/ucx: Remove unused debugging macro
Will be reintroduced later if needed and after adaptation to the OMPI
infrastructure.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
43f16d8796 opal/common/ucx: Remove common_ucx_int.h
Place the content of common_ucx_int.h back to the common_ucx.h and
include common_ucx_wpool.h explicitly.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Xin Zhao
c6de09940f ompi/osc/ucx: Switch osc/ucx code to use Worker Pool.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-19 14:22:07 -08:00
Artem Polyakov
35090b69f1 osc/base: Add debug output stating a selected component
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-02-07 15:54:20 -08:00
Yossi Itigin
83cca9d52a ucx: add owner.txt for components
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-12-01 17:14:03 +02:00
Nathan Hjelm
5ebcbe444e
Merge pull request #6083 from devreal/rdma-plug-memleak
Plug two memory leaks in rdma osc
2018-11-27 09:56:19 -07:00
Sergey Oblomov
2d230b3aac OSC/UCX: set max level value to 60
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-27 14:20:28 +02:00
Yossi Itigin
ed967d867b
Merge pull request #6073 from hoopoepg/topic/set-osc-ucx-level-200
OSC: set UCX module used by default
2018-11-22 10:53:37 +02:00
Joseph Schuchart
91885f5876 Plug two memory leaks in rdma osc
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-11-14 14:31:54 -05:00
Sergey Oblomov
e91f214982 OSC/UCX: added UCX version evaluation
- added UCX version evaluation to set OSC UCX priority

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-14 10:03:13 +02:00
KAWASHIMA Takahiro
cacd6f389c datatype: Remove #if HAVE_[TYPE] for C99 types
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.

- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`

Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.

I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-14 09:32:52 +09:00
Sergey Oblomov
36934a8bb2 OSC: set UCX module used by default
- OSC/UCX module set priority to 200 to be used by default

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-11-12 15:08:22 +02:00
Joseph Schuchart
a193ae26bf Fix regression introduced earlier by re-adding a barrier after shared memory has been registered
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-10-24 15:54:19 -04:00
Sergey Oblomov
1099d5f023 COMMON/UCX: added error code to log output
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-21 11:37:25 +03:00
Sergey Oblomov
df765595e3 COMMON/UCX: suppressed coverity warnings
- suppressed coverity warnings - added log messages on failed calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-17 16:11:03 +03:00
Joseph Schuchart
d9dcdfdfba RDMA OSC: initialize segment memory before registering the segment
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2018-10-16 16:12:14 -04:00
Yossi Itigin
b8e1af6fcb osc_ucx: add worker flush before osc module free
Make sure all pending communications are done on all ranks before
closing the window. This way it will be safe to close the endpoints when
closing the component.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:47:16 +03:00
Yossi Itigin
bcc48515e4 Revert "osc_ucx: fix hang/timeout in component finalize"
This reverts commit 438d13b4ca1e7333b789ca3fb536fda17b0feb38.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:47:13 +03:00
Yossi Itigin
a012ee91d8
Merge pull request #5886 from yosefe/topic/osc-ucx-fix-finalize-hang
osc_ucx: fix hang/timeout in component finalize
2018-10-10 16:29:29 +03:00
Yossi Itigin
dc6809495d osc_ucx: fix hang/timeout in component finalize
Add barrier to make sure all endpoints are destroyed before destroying
the worker.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 14:38:06 +03:00
Sergey Oblomov
ae6f81983f OSC/UCX: fixed zero-size window processing
- added processing of zero-size MPI window

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-10 13:08:01 +03:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Nathan Hjelm
7fdf887937 osc/portal: use c99 subobject naming to initialize module
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 10:34:24 -06:00
Yossi Itigin
4bb6845888
Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno
MCA/COMMON/UCX: added synonym to opal_mem_hook variable
2018-08-29 14:16:33 +03:00
Nathan Hjelm
2221720fc7
Merge pull request #5591 from hjelmn/osc_rdma_cleanup
osc/rdma: clean out stale aggregation code
2018-08-27 10:05:50 -06:00
Sergey Oblomov
b72dd83f05 MCA/COMMON/UCX: added synonims for common ucx variables
- added synonims for atomic/osc modules

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-26 18:25:21 +03:00
Nathan Hjelm
d0cd80e902 osc/rdma: clean out stale aggregation code
The aggregation code in osc/rdma is currently broken and will likely
not be reused. This commit cleans it out.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-23 15:40:21 -06:00
Nathan Hjelm
29320872b3 osc/rdma: quiet warning
gcc complains about ret possibly being used uninitialized. That will
never happen but we should still quiet the warning. This commit sets
ret to a valid value.

Fixes #5513

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-21 15:54:53 -06:00
Nathan Hjelm
438c40de03 osc/pt2pt: use c99 for module initialization
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-21 11:23:33 -06:00
Sergey Oblomov
fa33e322e7 OSC/UCX: code deduplication
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-19 12:39:15 +03:00
Sergey Oblomov
6f0a7a2005 OSC/UCX: opal progress register/unregister optimization
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-19 12:07:26 +03:00
Sergey Oblomov
55b934bacf OSC/UCX: enable progress when at least one window is allocated
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 17:52:30 +03:00
Sergey Oblomov
a081fba046 OSC/UCX: fixed hang on OSC init
- there worked progress was missed on startup which caused hang
  on one of ranks

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 17:01:53 +03:00
Joshua Ladd
b12868239c
Merge pull request #4765 from xinzhao3/topic/osc-ucx-mem-hook
OMPI/OSC/UCX: move memory hooks init in osc to win creation.
2018-07-13 09:36:20 -04:00
Xin Zhao
74ef51af1b OMPI/OSC/UCX: move memory hooks init in osc to win creation.
Move memory hooks init (for request based operation) in osc ucx to window
creation time, to avoid performance issue in MPI initialization.

Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-07-12 15:03:02 -07:00
Nathan Hjelm
304a6a52d4 osc/rdma: use local base for local process when possible
This commit fixes a crash that occurs when using btl/vader as an RDMA
btl. This btl supports using CPU atomics and does not support using
the btl for self communication so we must use the local memory
optimizations in osc/rdma.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-12 15:50:50 -06:00
Nathan Hjelm
35a75a6bf5 osc/sm: avoid filename collision when multiple windows share same CID
This commit fixes an issue identified by MTT where we can have two
different sets of processes on the same node creating a shared memory
window with communicators sharing the same CID. To avoid this issue
the temporary filename now includes the creating processes vpid.

References #5363

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-11 14:32:27 -06:00
Nathan Hjelm
037656bc1d osc/rdma: fix bug introduced in b90c838
This commit fixes an bug that was introduced back in 2016 which
impacts request-based RMA in some cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-10 18:17:55 -06:00
Yossi Itigin
e77e31b50b
Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging
MCA/COMMON/UCX: unified logging across all UCX modules
2018-07-08 12:45:26 +03:00