1
1
Граф коммитов

1385 Коммитов

Автор SHA1 Сообщение Дата
Brelle Emmanuel
e630046a4b pml/ob1: fixed local handle sent during PUT control message
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.

I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.

Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
2019-04-01 18:45:05 +02:00
Sergey Oblomov
d8e3562bae PML/SPML/UCX: added evaluation of mmap events
- there was a set of UCX related issues reported which caused
  by mmap API hooks conflicts. We added diagnostic of such
  problems to simplify bug-resolving pipeline

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2019-03-12 21:14:27 +02:00
Mikhail Brinskii
751d88192d PML/UCX: Use net worker address for remote peers
For remote node peers pack smaller worker address, which contains
network device addresses only. This would reduce amount of OOB traffic
during startup.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-02-14 18:06:36 +02:00
Thananon Patinyasakdikul
0263456cf4 pml/ob1: fix deadlock with communicator flag ALLOW_OVERTAKE.
We missed an assert to check if ALLOW_OVERTAKE is set or not before
validating the sequence number and this will cause deadlock.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2019-01-29 14:55:06 -05:00
Yossi Itigin
83cca9d52a ucx: add owner.txt for components
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-12-01 17:14:03 +02:00
Yossi Itigin
f36eeef4c5 pml_ucx: initialize req_mpi_object.comm for error handler
without this fix, an error handler invoked on pml_ucx request would
segfault while trying to dereference requests[i]->req_mpi_object.comm

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-11-25 19:37:54 +02:00
Yossi Itigin
4c442f2601
Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg
COMMON/UCX: suppressed coverity warnings
2018-10-22 11:00:47 +03:00
Sergey Oblomov
1099d5f023 COMMON/UCX: added error code to log output
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-21 11:37:25 +03:00
George Bosilca
dc972f0b92
Fix the PML monitoring.
The monitoring PML hides it's existence from the OMPI infrastructure by
removing itself from the list of PML loaded components, remaining hidden
until MPI_Finalize.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-18 00:29:23 -04:00
George Bosilca
668aa15dda
Early selection of the best PML.
With this patch the best PML is selected earlier, before finalizing
the others PML. This provides a simpler mechanism to intercept and
highjack the PML (as done in the monitoring PML)

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-18 00:29:23 -04:00
Howard Pritchard
7d6774acf8 remove the bfo pml
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-17 06:50:11 -06:00
Jeff Squyres
54ca3310ea ompi: cleanup various string operations
Several fixes to string handling:

1. strncpy() -> opal_string_copy() (because opal_string_copy()
   guarantees to NULL-terminate, and strncpy() does not)
2. Simplify a few places, such as:
   * Since opal_string_copy() guarantees to NULL terminate, eliminate
     some memsets(), etc.
   * Use opal_asprintf() to eliminate multi-step string creation

There's more work that could be done; e.g., this commit doesn't
attempt to clean up any strcpy() usage.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:10:20 -07:00
Yossi Itigin
b71e85b8d5 pml_ucx: fix return code from mca_pml_ucx_init() error flow
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-11 18:48:54 +03:00
Yossi Itigin
40ac9e4771 pml_ucx: fix return code from mca_pml_ucx_init()
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 14:41:05 +03:00
Yossi Itigin
4763822a64 pml_ucx: add ompi datatype attribute to release ucp_datatype
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-09 17:34:34 +03:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Gilles Gouaillardet
fed33c1530 pml/ob1: plug a memory leak in mca_pml_ob1_component_fini()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Yossi Itigin
68206a5635
Merge pull request #5569 from hoopoepg/topic/optimize-blocked-calls
PML/UCX: blocked calls optimizations
2018-08-29 14:19:09 +03:00
Yossi Itigin
4bb6845888
Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno
MCA/COMMON/UCX: added synonym to opal_mem_hook variable
2018-08-29 14:16:33 +03:00
Sergey Oblomov
c201c0abb3 PML/UCX: blocked calls optimizations: removed reset progress count
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:39 +03:00
Sergey Oblomov
2cd9e04166 PML/UCX: optimization of mprobe call - renamed vars
- renamed of internal variable names
- used unsigned datatypes

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:39 +03:00
Sergey Oblomov
38e908f83e PML/UCX: optimization of mprobe call
- refactoring of opal/UCX progress calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Sergey Oblomov
b0f87f2235 PML/UCX: blocked calls optimizations
- added UCX progress priority

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Jeff Squyres
fe0852bcb4 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-24 07:39:14 -07:00
Nathan Hjelm
1c84f48640 config: remove OPAL_ENABLE_MULTI_THREADS config macro
We long ago hard-coded this value to 1. This commit cleans it out
entirely.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-23 13:47:02 -06:00
Sergey Oblomov
e00f7a68ba MCA/COMMON/UCX: added synonim to opal_mem_hook variable
- added synonim to opal_mem_hook variable to allow
  to print it in opal_info -a

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-21 15:05:12 +03:00
Nathan Hjelm
c294bbc352
Merge pull request #5508 from hjelmn/fuzzy_match
Bring fuzzy matching support into master
2018-08-06 13:52:04 -06:00
Matthew Dosanjh
c8d13486cc Fixed promotion bug
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-06 12:56:36 -06:00
Boris Karasev
57683366ca pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-06 15:01:57 +06:00
Nathan Hjelm
dd74c6252f pml/ob1: custom matching cleanup and configury
This commit updates the new custom matching code in pml/ob1 so it can
not be enabled with a configure option. This commit also renames the
fuzzy-matching headers to avoid potential name conflicts and removes
the use of C reserved identifiers.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-02 13:06:19 -06:00
Matthew Dosanjh
572694b621 Adding custom match source.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-02 12:23:08 -06:00
Sergey Oblomov
d204b8a678 PML/SPML/UCX/COMPONENT: applied C99 initialization
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-28 09:44:03 +03:00
Sergey Oblomov
2806504290 PML/SPML/UCX: init global objects using C99 style
- to avoid value mix used C99 style of object initializations

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-25 14:52:45 +03:00
Sergey Oblomov
6fe0a73861 PML/UCX: fixed ucp request free on persistent request completion
- in sine cases persistent request was deleted during completion
  callback, this cause double free of linked UCX request (assert
  in debug build or hang in release build)
- UCX request is freed prior completion calback

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-20 19:32:20 +03:00
Yossi Itigin
29812494f2
Merge pull request #5402 from hoopoepg/topic/common-del-procs
MCA/COMMON/UCX: del_procs calls are unified to common module
2018-07-19 11:19:45 +03:00
Sergey Oblomov
920cc2e0d9 MCA/COMMON/UCX: del_procs calls are unified to common module
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 07:37:25 +03:00
Sergey Oblomov
1c7ae22dfb MCA/COMMON/UCX: shift opal memhooks into common UCX
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-17 13:46:38 +03:00
KAWASHIMA Takahiro
0021616984 pml/ob1: Fix data corruption of MPI_BSEND
Data transferred by `MPI_BSEND` may corrupt if all of the following
conditions are met.

- The message size is less than the eager limit.
- The `btl_alloc` function in the BTL interface returns `NULL`
  for some reason.
- The MPI program overwrites the send buffer after `MPI_BSEND`
  returns.

The problem is in the way of pending a send request in ob1 PML.
The `mca_pml_ob1_send_request_start_copy` function retruns
`OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns
`des = NULL`. In this case, the send request is added to the
`send_pending` list and `MPI_BSEND` returns immediately. Next time
the `mca_pml_ob1_send_request_start_copy` function tries sending,
the user buffer may have been overwritten by the MPI program.

Call hierarchy of `MPI_BSEND`:

```
  MPI_Bsend
    mca_pml_ob1_send
      if (MCA_PML_BASE_SEND_BUFFERED == sendmode)
        mca_pml_ob1_isend
          MCA_PML_OB1_SEND_REQUEST_START_W_SEQ
            mca_pml_ob1_send_request_start_seq
              mca_pml_ob1_send_request_start_btl
                if (size <= eager_limit)
                  if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED)
                    mca_pml_ob1_send_request_start_copy
                      mca_bml_base_alloc
                        btl_alloc
              if (OMPI_ERR_OUT_OF_RESOURCE == rc)
                add_request_to_send_pending
        ompi_request_free
```

To solve this problem, we should save the data to the buffer
attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`.

This problem was introduced by ob1 optimization (commits 2b57f422
and a06e491c) in v1.8 series.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-07-12 14:30:58 +09:00
Sergey Oblomov
240670152e MCA/COMMON/UCX: code beautify - alignment
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-06 19:40:58 +03:00
Sergey Oblomov
bef47b792c MCA/COMMON/UCX: unified logging across all UCX modules
- added common logging infrastructure for all
  UCX modules
- all UCX modules are switched to new infra

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-05 16:25:39 +03:00
Sergey Oblomov
8080283b3d MCA/COMMON/UCX: changed return type for wait_request
- for now wait_request returns OMPI status
- updated callers

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-04 23:29:38 +03:00
Sergey Oblomov
c2bd6af9f2 MCA/COMMON/UCX: minor unification of del_proces calls
- some common functionality of del_procs calls is moved into
  mca_common module
- blocking ucp_put call is replaced by non-blocking routine

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-02 15:10:53 +03:00
Sergey Oblomov
074f30ba27 PML/UCX: suppressed compilation warning
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-27 12:05:07 +03:00
Sergey Oblomov
502d04bf12 UCX/PML/SPML: fixed few coverity issues
- fixed incorrect pointer manipulation/free
- cleaned dead code
- minor optimization on process delete routine
- fixed error handling - free pointers
- added debug output for woker flush failure

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-26 18:52:39 +03:00
Yossi Itigin
ee873f4f79
Merge pull request #5322 from hoopoepg/topic/mca-ucx-common
MCA/UCX: added common module
2018-06-26 13:54:12 +03:00
Sergey Oblomov
d57ae62dee MCA/UCX: added common module
- implemented non-blocking routines for flush operations

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-22 16:41:09 +03:00
Gilles Gouaillardet
edd02b7144 pml/ucx: silence a warning
declare 'fenced' volatile in order to silence CID 1437465

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-06-22 13:11:42 +09:00
Sergey Oblomov
5f03628560 PML/UCX: removed uneeded flush
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-21 12:40:46 +03:00
Sergey Oblomov
2745da7dcc PML/UCX: use non-blocking fence instead of async progress
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-21 09:46:03 +03:00