1
1

544 Коммитов

Автор SHA1 Сообщение Дата
Sergey Oblomov
fa33e322e7 OSC/UCX: code deduplication
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-19 12:39:15 +03:00
Sergey Oblomov
6f0a7a2005 OSC/UCX: opal progress register/unregister optimization
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-19 12:07:26 +03:00
Sergey Oblomov
55b934bacf OSC/UCX: enable progress when at least one window is allocated
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 17:52:30 +03:00
Sergey Oblomov
a081fba046 OSC/UCX: fixed hang on OSC init
- there worked progress was missed on startup which caused hang
  on one of ranks

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 17:01:53 +03:00
Joshua Ladd
b12868239c
Merge pull request #4765 from xinzhao3/topic/osc-ucx-mem-hook
OMPI/OSC/UCX: move memory hooks init in osc to win creation.
2018-07-13 09:36:20 -04:00
Xin Zhao
74ef51af1b OMPI/OSC/UCX: move memory hooks init in osc to win creation.
Move memory hooks init (for request based operation) in osc ucx to window
creation time, to avoid performance issue in MPI initialization.

Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-07-12 15:03:02 -07:00
Nathan Hjelm
304a6a52d4 osc/rdma: use local base for local process when possible
This commit fixes a crash that occurs when using btl/vader as an RDMA
btl. This btl supports using CPU atomics and does not support using
the btl for self communication so we must use the local memory
optimizations in osc/rdma.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-12 15:50:50 -06:00
Nathan Hjelm
35a75a6bf5 osc/sm: avoid filename collision when multiple windows share same CID
This commit fixes an issue identified by MTT where we can have two
different sets of processes on the same node creating a shared memory
window with communicators sharing the same CID. To avoid this issue
the temporary filename now includes the creating processes vpid.

References #5363

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-11 14:32:27 -06:00
Nathan Hjelm
037656bc1d osc/rdma: fix bug introduced in b90c838
This commit fixes an bug that was introduced back in 2016 which
impacts request-based RMA in some cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-10 18:17:55 -06:00
Yossi Itigin
e77e31b50b
Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging
MCA/COMMON/UCX: unified logging across all UCX modules
2018-07-08 12:45:26 +03:00
Sergey Oblomov
240670152e MCA/COMMON/UCX: code beautify - alignment
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-06 19:40:58 +03:00
Sergey Oblomov
eb7010933d OSC/UCX: suppressed compilation warnings
- suppressed sing/unsign-compare warnings

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-06 10:58:09 +03:00
Sergey Oblomov
bef47b792c MCA/COMMON/UCX: unified logging across all UCX modules
- added common logging infrastructure for all
  UCX modules
- all UCX modules are switched to new infra

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-05 16:25:39 +03:00
Sergey Oblomov
8080283b3d MCA/COMMON/UCX: changed return type for wait_request
- for now wait_request returns OMPI status
- updated callers

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-04 23:29:38 +03:00
Xin Zhao
c1ac0c00c5
Merge pull request #5185 from jjolly/fix-memcpy-size-mismatch
- Build warning: stringop-overflow in get_dynamic_win_info() at osc_ucx_comm.c
2018-06-29 19:37:53 -07:00
Yossi Itigin
aca61a6bfb
Merge pull request #5238 from hoopoepg/topic/fixed-coverity-issues-ucx-pml
UCX/PML: fixed few coverity issues
2018-06-27 11:14:06 +03:00
Nathan Hjelm
4c230683e7 osc/sm: fix a typo
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).

References #5262

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-26 12:53:12 -06:00
Sergey Oblomov
502d04bf12 UCX/PML/SPML: fixed few coverity issues
- fixed incorrect pointer manipulation/free
- cleaned dead code
- minor optimization on process delete routine
- fixed error handling - free pointers
- added debug output for woker flush failure

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-26 18:52:39 +03:00
Yossi Itigin
ee873f4f79
Merge pull request #5322 from hoopoepg/topic/mca-ucx-common
MCA/UCX: added common module
2018-06-26 13:54:12 +03:00
Joshua Ladd
256ad707f1
Merge pull request #5293 from yosefe/topic/osc-ucx-on-demand-progress
osc_ucx: register progress on-demand
2018-06-25 15:09:11 -04:00
Nathan Hjelm
e4989714c2 osc/rdma: fix data race on teardown
The osc/rdma module did not wait for all pending atomics to complete
before tearing down. This could lead to weird issues as the target
location may no longer be registered or allocated.

This commit also fixes an offset calculation issue in
ompi_osc_get_data_blocking ().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-25 11:47:34 -06:00
Sergey Oblomov
bf7fd480e9 MCA/COMMON/UCX: added non-blocking implementations of atomics
- added implementation of swap/cswap/fadd operations
- blocking add64 is replaced by non-blocking routine

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-25 12:25:31 +03:00
Sergey Oblomov
63e7ba6843 MCA/COMMON/UCX: added parameter for UCX/opal progress
- added parameter to set UCX/opal progresses
- minor refactoring of request wait routines

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-25 11:00:12 +03:00
Sergey Oblomov
d57ae62dee MCA/UCX: added common module
- implemented non-blocking routines for flush operations

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-22 16:41:09 +03:00
Yossi Itigin
c2fbf3a3e8 osc_ucx: register progress on-demand
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-06-19 12:47:08 +03:00
Howard Pritchard
64de269cc3 topo/treematch - quash compiler warning
quash a compiler warning showing up with gcc 7.3

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-06-13 16:34:17 -05:00
Nathan Hjelm
64a5baaa28
Merge pull request #5193 from hjelmn/osc_sm_location
Use /dev/shm for shared memory files in osc components
2018-06-05 09:42:14 -06:00
Nathan Hjelm
e9de42544e osc/sm: add support for controlling location of backing store
This commit adds a new MCA variable to set the location of the backing
store: osc_sm_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-05-29 21:44:01 -06:00
Nathan Hjelm
d0d59b1d7d osc/rdma: add support for controlling location of backing store
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-05-29 21:43:33 -06:00
Howard Pritchard
5b7c866f59 osc/pt2pt: disable when THREAD_MULTIPLE.
Per discussion at
https://github.com/open-mpi/ompi/issues/2614#issuecomment-392815654,
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd660841623c02d1dfa3151e7f7afd3327698)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-05-29 08:59:53 -07:00
John L. Jolly
36b9e15fb7 - Build warning: stringop-overflow in
get_dynamic_win_info() at osc_ucx_comm.c

In file included from /usr/include/string.h:494:0,
                 from ../../../../ompi/info/info.h:29,
                 from ../../../../ompi/mca/osc/base/base.h:24,
                 from osc_ucx_comm.c:13:
In function 'memcpy',
    inlined from 'get_dynamic_win_info' at osc_ucx_comm.c:359:5,
    inlined from 'ompi_osc_ucx_put' at osc_ucx_comm.c:401:18:
/usr/include/bits/string_fortified.h:34:10: warning: '__builtin___memcpy_chk' writing 8 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=]
   return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is caused by a type size mismatch in a call to memcpy

This fix corrects the type definition of the win_count variable.

Signed-off-by: John Jolly <jjolly@suse.com>
2018-05-22 10:11:57 -06:00
bosilca
2ab628b92e
Merge pull request #5074 from bosilca/topic/remove_warnings
Remove warnings identified by clang.
2018-05-15 11:15:23 -04:00
Nathan Hjelm
cf585d725c osc/rdma: fix SEGV will null origin in FOP in debug build
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-05-08 14:10:20 -06:00
Joshua Ladd
32ddc6af7e
Merge pull request #5094 from xinzhao3/topic/osc-win-fix-master
OMPI/OSC/UCX: fix issue in impl of MPI_Win_create_dynamic/MPI_Win_attach/MPI_Win_detach
2018-05-02 17:42:34 -04:00
Xin Zhao
3f5ac97649 OMPI/OSC/UCX: set priority to 0.
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-05-02 21:40:06 +03:00
Xin Zhao
53bdfd1dcb OMPI/OSC/UCX: fix issue in impl of MPI_Win_create_dynamic/MPI_Win_attach/MPI_Win_detach
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-04-24 23:09:52 +03:00
George Bosilca
6ff11267fb
Remove warnings identified by clang.
Plus minor spacing and indentation issues.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-04-14 17:14:12 -04:00
Nathan Hjelm
e79debc320 osc/rdma: fix overflow in offset calculation
This commit fixes a bug is osc/rdma that can occur if the total size
of the shared memory segment gets larger than 4 GiB. The bug was
caused by a typo. The type of my_base_offset should have been size_t
not int.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-03-27 09:33:44 -06:00
Nathan Hjelm
f7faacca4e osc/rdma: fix 32-bit builds
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-03-27 09:16:04 -06:00
Jeff Squyres
06af6f1c4c
Merge pull request #4962 from jsquyres/pr/cid-fixes
A bunch of CID fixes
2018-03-26 22:30:31 -04:00
Jeff Squyres
08ceb66a19 osc/pt2pt: fix (effectively false positive) CID 1402113
This will almost certainly never happen, but be defensive and
guarantee that we never return an uninitialized variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:26:17 -07:00
Jeff Squyres
124208198c osc/rdma: fix CID 1424327
Fix minor memory leak.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Ralph Castain
3a93b535ec Silence the flood of OSC/RDMA warnings
Fixes #4950

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 16:12:41 -07:00
Nathan Hjelm
7f4872d483 osc/rdma: performance improvments and bug fixes
This commit is a large update to the osc/rdma component. Included in
this commit:

 - Add support for using hardware atomics for fetch-and-op and single
   count accumulate  when using the accumulate lock. This will improve
   the performance of these operations even when not setting the
   single intrinsic info key.

 - Rework how large accumulates are done. They now block on the get
   operation to fix some bugs discovered by an IBM one-sided test. I
   may roll back some of the changes if the underlying bug in the
   original design is discovered. There appear to be no real
   difference (on the hardware this was tested with) in performance so
   its probably a non-issue. References #2530.

 - Add support for an additional lock-all algorithm: on-demand. The
   on-demand algorithm will attempt to acquire the peer lock when
   starting an RMA operation. The lock algorithm default has not
   changed. The algorithm can be selected by setting the
   osc_rdma_locking_mode MCA variable. The valid values are two_level
   and on_demand.

 - Make use of the btl_flush function if available. This can improve
   performance with some btls.

 - When using btl_flush do not keep track of the number of put
   operations. This reduces the number of atomic operations in the
   critical path.

 - Make the window buffers more friendly to multi-threaded
   applications. This was done by dropping support for multiple
   buffers per MPI window. I intend to re-add that support once the
   underlying performance bug under the old buffering scheme is
   fixed.

 - Fix a bug in request completion in the accumulate, get, and put
   paths. This also helps with #2530.

 - General code cleanup and fixes.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-03-15 14:53:53 -06:00
Jeff Squyres
e7f91f8068
Merge pull request #4527 from clementFoyer/osc-no-includes
Remove inter-dependencies between OSC modules.
2018-02-09 15:49:56 -05:00
Clement Foyer
f5b4fc05f8 Remove inter-dependencies between OSC modules.
The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue #4523.

Extend the common monitoring API.

  * Now it's possible to translate from local rank to world rank from
    both the communicator and the group.
  * Remove useless hashtable as we directly use the w_group contained
    in window structure.

Add automatic generation at config time.

The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.

Only keep a generated argv-style array.

Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.

Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
2018-02-07 11:52:00 +00:00
Gilles Gouaillardet
34b45cc879 osc/sm: fix the osc_free callback
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].

Thanks Martin Binder for the report.

Refs open-mpi/ompi#4770

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-31 11:23:21 +09:00
Xin Zhao
72ff2b1135 OMPI/OSC/UCX: adding atomic lock for fetch_and_op and compare_and_swap
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-01-18 00:36:22 +02:00
Matias A Cabral
009ba475e1 osc/rmda: fix missing opal_argv_free in mtls search.
Use asprintf in description message to avoid missing default values
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
2018-01-11 14:29:16 -08:00
Nathan Hjelm
7893248c5a opal/asm: add fetch-and-op atomics
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:23 -07:00