1
1

4010 Коммитов

Автор SHA1 Сообщение Дата
Shintaro Iwasaki
84dcb233bf mca/threads: set THREAD_* flags in the component's root configure.m4
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-10-05 11:39:26 -05:00
Shintaro Iwasaki
919a16300c mca/threads/qthreads: implement missing functionalities
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-10-05 11:39:18 -05:00
Shintaro Iwasaki
db3e598b6a mca/threads/qthreads: remove Argobots dependency
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-10-05 11:39:09 -05:00
Shintaro Iwasaki
6cc17b0c6a mca/threads/qthreads: rework configury to be smarter
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-10-05 11:39:01 -05:00
Brian Barrett
8f89d15d31 build: Move PMIx to a 3rd-party package
With Open MPI 5.0, the decision was made to stop building
3rd-party packages, such as Libevent, HWLOC, PMIx, and PRRTE as
MCA components and instead 1) start relying on external libraries
whenever possible and 2) Open MPI builds the 3rd party
libraries (if needed) as independent libraries, rather than
linked into libopen-pal.

This patch moves the PMIx library bundled with Open MPI from a
MCA framework to a stand-alone library built outside of OPAL.  Due
to the amount of code in the MCA base (and its assumptions about
being part of an MCA framework), the framework is left with no
active components.  Any pre-installed version of PMIx 3.0.0 or
newer is preferred over the internal version.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-10-01 16:56:00 +00:00
Brian Barrett
0e9581d478 build: Move hwloc to a 3rd-party package
With Open MPI 5.0, the decision was made to stop building
3rd-party packages, such as Libevent, HWLOC, PMIx, and PRRTE as
MCA components and instead 1) start relying on external libraries
whenever possible and 2) Open MPI builds the 3rd party
libraries (if needed) as independent libraries, rather than
linked into libopen-pal.

This patch moves the hwloc library bundled with Open MPI from a
MCA framework to a stand-alone library built outside of OPAL.  Due
to the amount of code in the MCA base (and its assumptions about
being part of an MCA framework), the framework is left with no
active components.  Any pre-installed version of HWLOC 1.6 or
newer is preferred over the internal version.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-10-01 16:55:59 +00:00
Brian Barrett
9ffac85650 build: Move libevent to a 3rd-party package
With Open MPI 5.0, the decision was made to stop building
3rd-party packages, such as Libevent, HWLOC, PMIx, and PRRTE as
MCA components and instead 1) start relying on external libraries
whenever possible and 2) Open MPI builds the 3rd party
libraries (if needed) as independent libraries, rather than
linked into libopen-pal.

This patch moves libevent from an MCA framework to a stand-alone
library built outside of OPAL.  A wrapper in opal/util is provided
to minimize the unnecessary changes in the rest of the code.  When
using the internal Libevent, it will be installed as a stand-alone
libevent.a, instead of bundled in OPAL.  Any pre-installed version
of Libevent at or after 2.0.21 is preferred over the internal
version.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-10-01 16:55:58 +00:00
Nathan Hjelm
7fca99b2f7 patcher: remove the linux component
The Linux component was an attempt to hook calls by patching the dynamic
symbol table. It, unfortunately, does not work as it will always miss
calls made internally by glibc. For example, it might catch a user call
directly to munmap but will miss the chain free -> munmap. Since the
later is the common case we were trying to hook this made the component
unusable. This PR finally kills the component.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-09-18 10:23:01 -06:00
bosilca
eca00a7a3b
Merge pull request #8042 from bosilca/fix/sm_emu
Fix a copy/paste in the RDMA emulation.
2020-09-14 11:43:00 -04:00
Jeff Squyres
3a93e4f94d
Merge pull request #8038 from devreal/fix-opal-pmix-cond-init
Use correct conditional variable initializer in opal/mca/pmix/base
2020-09-14 09:38:43 -04:00
George Bosilca
49da998f33
Fix a copy/paste in the RDMA emulation.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-09-13 22:56:58 -04:00
Joseph Schuchart
b78c7e93db Use correct conditional variable initializer in opal/mca/pmix/base
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-09-09 09:05:30 +02:00
Joseph Schuchart
fc025c78df UCX: do not dereference NULL pointer in wpmem_[free|flush]
Flushing or freeing a newly created dynamic window causes NULL to be passed.

Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-09-04 09:31:18 +02:00
Nathan Hjelm
556a4ac0da btl: remove unused descriptor flags
This PR removes the MCA_BTL_DES_FLAGS_PUT and MCA_BTL_DES_FLAGS_GET
descriptor flags. At some point these had some meaning but they were
replaced by the rcache access flags.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-08-31 13:07:32 -06:00
Brian Barrett
f3832c1ab9
Merge pull request #7973 from wckzhang/btlexclude
btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
2020-08-12 13:34:03 -07:00
William Zhang
41acfee2bb btl/ofi: Disable ofi_rxm provider
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-08-11 16:47:19 -07:00
bosilca
b6a06ca37b
Merge pull request #7974 from abouteiller/bugfix/vader_des_tag
bug fix: des->tag = hdr->frag, should be hdr->tag
2020-08-11 11:13:14 -04:00
Mikhail Kurnosov
4708458d6b Fix a typo in parsing locality string: L0 changed to L1
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
2020-08-11 08:43:47 +07:00
Jeff Squyres
9a0f661a66
Merge pull request #7975 from wckzhang/btlcommonlist
btl/ofi: Use common provider include/exclude list
2020-08-10 14:41:53 -04:00
William Zhang
9b8f463a76 btl/ofi: Use common provider include/exclude list
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-07-31 12:13:00 -07:00
William Zhang
a7dcfd9874 btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-07-30 13:53:16 -07:00
Aurelien Bouteiller
8e0cb1d49d
des->tag = hdr->frag, should be hdr->tag
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-30 14:02:22 -04:00
Tommy Janjusic
2c8da2c0a9 Further code reduction and simplifications.
Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 20:00:22 +03:00
Tomislav Janjusic
cbfc9a3263 opal/mca/common/ucx: Use new TSD api
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
72296e12f4 opal/common/ucx:
-mutex lock/unlock suggestions
-common destructor/cleanup

Co-authored-with: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
27ba4b612f ompi/osc/ucx: Remove workerpool's global thread storage tables.
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Austen Lauria
d0152eb51e
Merge pull request #7940 from awlauria/revert_libevent_commit
Revert "Address a race condition in libevent select."
2020-07-28 11:34:59 -04:00
Tomislav Janjusic
d809f6ba27 New TSD API interface fix for various components
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:40 +03:00
Tomislav Janjusic
cba5a0e117 Rename tsd interface function calls
Co-authored by: Artem Polykaov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
cb1955bb53 Fix renamed interface functions for argo, q, and pthreads
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
07dc86eb3a opal/thread: New TSD API
Co-authored-by: Artem Polyakov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Ralph Castain
c0bc89dc50
Sync to PMIx and PRRTE master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-07-23 12:35:17 -07:00
George Bosilca
8bc1f3d8fb Don't allow any asynchronous CUDA operations.
There are 2 reasons for this:
- pending CUDA events are not progressed by this BTL, so anything that becomes
  asychronous will never be completed.
- we use the packed data on the shared memory backing file, and this will be
  returned to the peer process upon return (thus if we copy asynchronously we
  might not copy the right data).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:37:09 -04:00
George Bosilca
0e32b0acef Avoid a lock if no CUDA IPC operations are pending.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:35:34 -04:00
Austen Lauria
67d90166cf Revert "Address a race condition in libevent select."
We do not want to be patching upstream components anymore.
The proper method is to get this merged upstream, then
pull it in the next upstream release.

This reverts commit c39fb5758a772c062e20db9b42f2b06805884802.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-07-14 16:23:21 -04:00
George Bosilca
fd4ca394e2 Make the smcuda BTL great again.
It has been broken for months because of the lack of initialization of the
HWLOC library. The smcuda process creating the backing file (local rank 0)
uses opal_cache_line_size to align the objects in the backing file, and the
opal_cache_line_size is initialized by default to 128. Later on, when the rest
of the processes attach the same backing file, HWLOC has been called and the
cache size has now been updated to the correct value. If this value is
different than the default one (and they are as most cache sizes are 64 bytes
right now) the objects in the backing file will be misaligned.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-14 01:48:08 -04:00
George Bosilca
96e8cbe25f First step on fixing the BTL API conversion for the SMCUDA BTL
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-13 14:46:10 -04:00
Nathan Hjelm
d0c0cb7144
Merge pull request #7913 from hjelmn/btl_base_atomics_are_awesome
btl: change argument type of BTL receive callbacks
2020-07-11 12:13:26 -06:00
Howard Pritchard
677b662295
Merge pull request #7912 from tomhers/fix_opal_ofi_compile_bug
BTL/OFI: Fix missing include file.
2020-07-10 14:43:38 -06:00
Nathan Hjelm
88f51fbb8e btl: change argument type of BTL receive callbacks
This commit updates the btl interface to change the parameters
passed to receive callbacks. The interface used to pass the tag,
a btl base descriptor, and the callback context. Most of the
values in the btl base descriptor were unused and only helped
simplify the callbacks from the self btl. All of the arguments
have now been replaced with a single receive callback descriptor.
This descriptor contains the incoming endpoint, data segment(s),
tag, and callback context. All btls have been updated to use
the new callback and the btl interface version has been bumped
to v3.2.0.

As part of this change the descriptor argument (and the segments
contained within it) have been marked as const. The were treated
as const before but this change could allow the compiler to make
better optimization decisions and will enforce that the callback
does not attempt to change the data in the descriptor.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-07-08 07:38:46 -07:00
tomhers
88f9d2c90f BTL/OFI: Fix missing include file.
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <tom.herschberg@gmail.com>
2020-07-06 16:32:37 -04:00
Nikola Dancejic
7e46371301 common/ofi: added address format check to fix provider selection
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
2020-07-02 23:45:59 +00:00
Austen Lauria
b6b300d25d
Merge pull request #6784 from abouteiller/export/event-infloop
Address a race condition in libevent select.
2020-06-30 10:26:57 -04:00
Austen Lauria
868eee31c1
Merge pull request #7883 from hoopoepg/topic/fixed-potential-deadlock-wpool
UCX/WPOOL: fixed potential deadlock
2020-06-29 17:21:39 -04:00
Sergey Oblomov
a383312393 UCX/WPOOL: fixed potential deadlock
- fixed funcs:
  opal_common_ucx_wpmem_putget
  opal_common_ucx_wpmem_cmpswp
  opal_common_ucx_wpmem_post
  opal_common_ucx_wpmem_fetch
  opal_common_ucx_wpmem_fetch_nb

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-29 13:40:50 +03:00
Ralph Castain
ba27fb79b5
Sync ot PMIx/PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-27 16:50:09 -07:00
Sergey Oblomov
34f2f6af84 UCX/WPOOL: fixed potential deadlock
- fixed potential deadlock in error processing

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-27 15:19:01 +03:00
Christoph Niethammer
f0f206b247
Merge pull request #7673 from cniethammer/uct-supported-version-update
Accept UCX 1.8 in configure of btl/uct
2020-06-26 20:53:36 +02:00
Jeff Squyres
f64c30e93c common_ofi: fix preprocessor macro typo
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-06-26 07:23:27 -07:00
Joseph Schuchart
634f67b216
Merge pull request #7843 from devreal/clang-tidy-free
Some fixups for issues detected by clang-tidy
2020-06-25 17:30:04 +02:00