1
1

5866 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
f3832c1ab9
Merge pull request #7973 from wckzhang/btlexclude
btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
2020-08-12 13:34:03 -07:00
William Zhang
41acfee2bb btl/ofi: Disable ofi_rxm provider
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-08-11 16:47:19 -07:00
bosilca
b6a06ca37b
Merge pull request #7974 from abouteiller/bugfix/vader_des_tag
bug fix: des->tag = hdr->frag, should be hdr->tag
2020-08-11 11:13:14 -04:00
Mikhail Kurnosov
4708458d6b Fix a typo in parsing locality string: L0 changed to L1
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
2020-08-11 08:43:47 +07:00
Jeff Squyres
9a0f661a66
Merge pull request #7975 from wckzhang/btlcommonlist
btl/ofi: Use common provider include/exclude list
2020-08-10 14:41:53 -04:00
Nathan Hjelm
a44914cb6b
Merge pull request #7915 from bosilca/fix/intel_2330_warning_take2
Second take on fixing the Intel _Atomic atomic operation warning
2020-08-08 06:30:15 -06:00
Jeff Squyres
f5cb1a49b1
Merge pull request #7897 from cniethammer/cmd_line_fixes
Minor fix in cmd line parser help
2020-08-08 07:13:22 -04:00
William Zhang
9b8f463a76 btl/ofi: Use common provider include/exclude list
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-07-31 12:13:00 -07:00
William Zhang
a7dcfd9874 btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-07-30 13:53:16 -07:00
Aurelien Bouteiller
8e0cb1d49d
des->tag = hdr->frag, should be hdr->tag
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-30 14:02:22 -04:00
Tommy Janjusic
2c8da2c0a9 Further code reduction and simplifications.
Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 20:00:22 +03:00
Tomislav Janjusic
cbfc9a3263 opal/mca/common/ucx: Use new TSD api
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
72296e12f4 opal/common/ucx:
-mutex lock/unlock suggestions
-common destructor/cleanup

Co-authored-with: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
27ba4b612f ompi/osc/ucx: Remove workerpool's global thread storage tables.
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Austen Lauria
d0152eb51e
Merge pull request #7940 from awlauria/revert_libevent_commit
Revert "Address a race condition in libevent select."
2020-07-28 11:34:59 -04:00
Tomislav Janjusic
d809f6ba27 New TSD API interface fix for various components
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:40 +03:00
Tomislav Janjusic
cba5a0e117 Rename tsd interface function calls
Co-authored by: Artem Polykaov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
cb1955bb53 Fix renamed interface functions for argo, q, and pthreads
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
07dc86eb3a opal/thread: New TSD API
Co-authored-by: Artem Polyakov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Ralph Castain
c0bc89dc50
Sync to PMIx and PRRTE master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-07-23 12:35:17 -07:00
Aurelien Bouteiller
816acbdfb1
Merge pull request #7840 from abouteiller/mpi-next/init-errh
MPI-4: Initial error handler
2020-07-21 11:55:14 -04:00
Aurélien Bouteiller
3cd85a9ec5
Add the initial_errhandler info key to MPI_INFO_ENV and populate the
value from prun populated paremeters

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

Allow errhandlers to invoke the initial error handler before MPI_INIT

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Indentation

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
George Bosilca
8bc1f3d8fb Don't allow any asynchronous CUDA operations.
There are 2 reasons for this:
- pending CUDA events are not progressed by this BTL, so anything that becomes
  asychronous will never be completed.
- we use the packed data on the shared memory backing file, and this will be
  returned to the peer process upon return (thus if we copy asynchronously we
  might not copy the right data).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:37:09 -04:00
George Bosilca
0e32b0acef Avoid a lock if no CUDA IPC operations are pending.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:35:34 -04:00
Austen Lauria
67d90166cf Revert "Address a race condition in libevent select."
We do not want to be patching upstream components anymore.
The proper method is to get this merged upstream, then
pull it in the next upstream release.

This reverts commit c39fb5758a772c062e20db9b42f2b06805884802.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-07-14 16:23:21 -04:00
George Bosilca
fd4ca394e2 Make the smcuda BTL great again.
It has been broken for months because of the lack of initialization of the
HWLOC library. The smcuda process creating the backing file (local rank 0)
uses opal_cache_line_size to align the objects in the backing file, and the
opal_cache_line_size is initialized by default to 128. Later on, when the rest
of the processes attach the same backing file, HWLOC has been called and the
cache size has now been updated to the correct value. If this value is
different than the default one (and they are as most cache sizes are 64 bytes
right now) the objects in the backing file will be misaligned.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-14 01:48:08 -04:00
George Bosilca
96e8cbe25f First step on fixing the BTL API conversion for the SMCUDA BTL
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-13 14:46:10 -04:00
Nathan Hjelm
d0c0cb7144
Merge pull request #7913 from hjelmn/btl_base_atomics_are_awesome
btl: change argument type of BTL receive callbacks
2020-07-11 12:13:26 -06:00
Howard Pritchard
677b662295
Merge pull request #7912 from tomhers/fix_opal_ofi_compile_bug
BTL/OFI: Fix missing include file.
2020-07-10 14:43:38 -06:00
Brian Barrett
4a6e1629e8
Merge pull request #7906 from dancejic/multi
common/ofi: added address format check to fix provider selection
2020-07-09 16:59:02 -07:00
Nathan Hjelm
88f51fbb8e btl: change argument type of BTL receive callbacks
This commit updates the btl interface to change the parameters
passed to receive callbacks. The interface used to pass the tag,
a btl base descriptor, and the callback context. Most of the
values in the btl base descriptor were unused and only helped
simplify the callbacks from the self btl. All of the arguments
have now been replaced with a single receive callback descriptor.
This descriptor contains the incoming endpoint, data segment(s),
tag, and callback context. All btls have been updated to use
the new callback and the btl interface version has been bumped
to v3.2.0.

As part of this change the descriptor argument (and the segments
contained within it) have been marked as const. The were treated
as const before but this change could allow the compiler to make
better optimization decisions and will enforce that the callback
does not attempt to change the data in the descriptor.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-07-08 07:38:46 -07:00
George Bosilca
ddfb4def2d Second take on fixing the Inel _Atomic atomic operation warning.
We completely disable C11 atomic op support for _Atomic for
all Intel compiler prior to 20200310 (which is currently the
latest released), by switching to our pre-C11 atomic
operations.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-08 09:58:43 -04:00
tomhers
88f9d2c90f BTL/OFI: Fix missing include file.
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <tom.herschberg@gmail.com>
2020-07-06 16:32:37 -04:00
Christoph Niethammer
a3483e4b71 Fix null pointer arithmetic resulting in potential undefined behavior
NULL pointer arithmetic is undefined behaviour in c.
The payload_ptr can be NULL in the moment when mpool is not initialized.

References from the c11 standard:
- 6.5.6 Additive operators
- 6.3.2.3 Pointers

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-03 22:18:17 +02:00
Nikola Dancejic
7e46371301 common/ofi: added address format check to fix provider selection
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
2020-07-02 23:45:59 +00:00
Jeff Squyres
e547e2d3ea
Merge pull request #7905 from cniethammer/integer_shift_fix
Fix shifting signed integer
2020-07-02 14:35:40 -04:00
Christoph Niethammer
8483ec41ab Fix shifting signed integer
Shifting signed int will lead to undefined behavior in case of 1<<31 here

Signed-off-by: Christoph Niethammer niethammer@hlrs.de
2020-07-02 17:43:45 +02:00
Christoph Niethammer
6ac111c6dc Fix initialization warning reported by clang-10
Delete duplicate initializer for cpuset in opal_process_info.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-02 15:15:54 +02:00
Christoph Niethammer
7f6263cdd4 Update out of date documentation for opal_cmd_line_create()
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-06-30 22:04:55 +02:00
Christoph Niethammer
f9c42c96d6 Fix wrong enum type in default command line option category assignment
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-06-30 22:04:48 +02:00
Austen Lauria
b6b300d25d
Merge pull request #6784 from abouteiller/export/event-infloop
Address a race condition in libevent select.
2020-06-30 10:26:57 -04:00
Austen Lauria
868eee31c1
Merge pull request #7883 from hoopoepg/topic/fixed-potential-deadlock-wpool
UCX/WPOOL: fixed potential deadlock
2020-06-29 17:21:39 -04:00
Sergey Oblomov
a383312393 UCX/WPOOL: fixed potential deadlock
- fixed funcs:
  opal_common_ucx_wpmem_putget
  opal_common_ucx_wpmem_cmpswp
  opal_common_ucx_wpmem_post
  opal_common_ucx_wpmem_fetch
  opal_common_ucx_wpmem_fetch_nb

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-29 13:40:50 +03:00
Ralph Castain
ba27fb79b5
Sync ot PMIx/PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-27 16:50:09 -07:00
Sergey Oblomov
34f2f6af84 UCX/WPOOL: fixed potential deadlock
- fixed potential deadlock in error processing

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-27 15:19:01 +03:00
Christoph Niethammer
f0f206b247
Merge pull request #7673 from cniethammer/uct-supported-version-update
Accept UCX 1.8 in configure of btl/uct
2020-06-26 20:53:36 +02:00
Jeff Squyres
f64c30e93c common_ofi: fix preprocessor macro typo
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-06-26 07:23:27 -07:00
Joseph Schuchart
634f67b216
Merge pull request #7843 from devreal/clang-tidy-free
Some fixups for issues detected by clang-tidy
2020-06-25 17:30:04 +02:00
Artem Polyakov
907f4e196a
Merge pull request #6980 from devreal/ucx-acc-singel-intrinsics
UCX osc: add support for acc_single_intrinsic
2020-06-25 07:39:42 -07:00
Austen Lauria
910a030d1c
Merge pull request #7854 from devreal/remove-opal-dataype-get-pack
Remove stale datatype functions from opal header
2020-06-25 08:07:48 -04:00