1
1
Граф коммитов

5821 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
ba27fb79b5
Sync ot PMIx/PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-27 16:50:09 -07:00
Christoph Niethammer
f0f206b247
Merge pull request #7673 from cniethammer/uct-supported-version-update
Accept UCX 1.8 in configure of btl/uct
2020-06-26 20:53:36 +02:00
Jeff Squyres
f64c30e93c common_ofi: fix preprocessor macro typo
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-06-26 07:23:27 -07:00
Joseph Schuchart
634f67b216
Merge pull request #7843 from devreal/clang-tidy-free
Some fixups for issues detected by clang-tidy
2020-06-25 17:30:04 +02:00
Artem Polyakov
907f4e196a
Merge pull request #6980 from devreal/ucx-acc-singel-intrinsics
UCX osc: add support for acc_single_intrinsic
2020-06-25 07:39:42 -07:00
Austen Lauria
910a030d1c
Merge pull request #7854 from devreal/remove-opal-dataype-get-pack
Remove stale datatype functions from opal header
2020-06-25 08:07:48 -04:00
Joseph Schuchart
e3b417c776 Add missing copyright header
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Joseph Schuchart
434c9055ee UCX osc: fall back to get-compare-put for unsupported datatypes
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Joseph Schuchart
7d5a6e3e8b UCX osc: safely load/store 64bit integer from variable size pointer
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Joseph Schuchart
824afac483 UCX common: add non-blocking compare-and-swap
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Joseph Schuchart
70776b43fe Remove stale datatype functions from opal header
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-22 15:56:31 +02:00
Christoph Niethammer
bd7f002675 Fix wrongly placed bounds check; mark failure as unlikely
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-06-20 16:09:38 +02:00
Joseph Schuchart
602f833e57 Add missing OBJ_RELEASE to opal_reachable_allocate
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-19 14:30:07 +02:00
Joseph Schuchart
ae3974d249 Add missing free call to mca_btl_tcp_component_exchange
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-19 14:30:07 +02:00
Joseph Schuchart
950e08091c Add missing free to mca_base_alias_register
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-19 14:30:07 +02:00
Austen Lauria
d03a99c647
Merge pull request #7776 from simonbyrne/patch-1
Fix language in CUDA error
2020-06-18 09:49:50 -04:00
Geoff Paulsen
692f96e87a
Merge pull request #7799 from markalle/interception_early_toc_read
noinline to avoid compiler reading TOC before PATCHER_BEGIN
2020-06-17 14:26:24 -05:00
Sergey Oblomov
d6bff6ffbd COMMON/UCX: improved missing events test
- there is new API to detect missing memmory events.
  Enabled using of new UCX API to detect missing events

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-16 12:36:44 +03:00
Jeff Squyres
17acb775e9 Rename the use of "whitelist"
Use the term "allowlist" instead of "whitelist" in the script that
looks for common symbols.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-06-15 17:06:25 -04:00
Mark Allen
ddd1f578ec noinline to avoid compiler reading TOC before PATCHER_BEGIN
This bug was first seen in a different product that's using the same
interception code as OMPI.  But I think it's potentially in OMPI too.

In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
   0x00000000000d6778 <+40>:    std     r2,24(r1) // something gcc put in front
   0x00000000000d677c <+44>:    std     r2,96(r1) // PATCHER_BEGIN's toc_save
   0x00000000000d6780 <+48>:    nop               // NOPs from PATCHER_BEGIN
   0x00000000000d6784 <+52>:    nop               // that get replaced
   0x00000000000d6788 <+56>:    nop               // by instructions that
   0x00000000000d678c <+60>:    nop               // change r2
   0x00000000000d6790 <+64>:    nop               //

Later there are loads from that location like
   0x000000000019e0e4 <+132>:   ld      r2,24(r1)
that make me nervous since that's the pre-updated value.

I believe this is the same thing Nathan is describing way back in a9bc692d
and his solution was to put a second call around each interception, where
the outer call is just
    intercept_brk():
        PATCHER_BEGIN
        _intercept_brk()
        PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.

What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix.  So I want to add __opal_attribute_noinline__ to
restore the fix.

With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining _intercept_brk() and the susipicious
early save of r2 is gone.  I made the same fix to all the intercept_*
functions, although intercept_brk was the only one that had a suspicious
save of r2.

As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code.  I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2020-06-09 19:25:59 -04:00
Ralph Castain
9bdf1274c0
Sync to PMIx and PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-09 15:32:22 -07:00
Jeff Squyres
9b55419b40
Merge pull request #7777 from markalle/IPCOP_shmat
adding op-codes for syscall ipc for shmat/shmdt
2020-06-08 15:09:17 -04:00
Ralph Castain
a879a16df5
Merge pull request #7794 from rhc54/topic/sy
Sync to PMIx and PRRTE master branches
2020-06-08 12:05:33 -07:00
Howard Pritchard
46d834d674
Merge pull request #7781 from hkuno/john.l.byrne/mca_btl_ofi_rcache_init
mtl_btl_ofi_rcache_init() before creating domain
2020-06-08 13:01:45 -06:00
Ralph Castain
ad8a567212
Sync to PMIx and PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-08 10:50:10 -07:00
Mark Allen
e8fab058da adding op-codes for syscall ipc for shmat/shmdt
These op codes used to be in bits/ipc.h but were removed in glibc in 2015
with a comment saying they should be defined in internal headers:
https://sourceware.org/bugzilla/show_bug.cgi?id=18560
and when glibc uses that syscall it seems to do so from its own definitions:
https://github.com/bminor/glibc/search?q=IPCOP_shmat&unscoped_q=IPCOP_shmat

So I think using #ifndef and defining them if they're not already defined
using the values from glibc is the best option.

At IBM it was the testing on redhat 8 that found this as an issue
(the opcodes being undefined on the system made it select the
left undefined so shmat/shmdt memory events went unintercepted).

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2020-06-04 14:20:40 -04:00
Harumi Kuno
f1b21cb776 mtl_btl_ofi_rcache_init() before creating domain
mtl_btl_ofi_rcache_init() initializes patcher which should only take
place things are single threaded.  OFI providers may start spawn threads,
so initialize the rcache before creating OFI objects to prevent races.

Authored-by: John L. Byrne <john.l.byrne@hpe.com>
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-06-03 09:56:29 -06:00
Simon Byrne
27a2ed8cba Fix language in CUDA error
Removes a malapropism (passed should be past), and hopefully makes it a bit clearer.

Signed-off-by: Simon Byrne <simonbyrne@gmail.com>
2020-06-02 13:25:31 -07:00
Brian Barrett
0a21a58f08
Merge pull request #7771 from dancejic/multi
common/ofi: Fixing compilation issue with ofi versions that do not support fi_info.nic
2020-06-01 18:42:07 -07:00
Nikola Dancejic
ae2a447b0e common/ofi: Fixing compilation issue with ofi versions that do not support fi_info.nic
Added the flag OPAL_OFI_PCI_DATA_AVAILABLE to remove accessing the nic
object in
fi_info when the ofi version does not support that structure.

Signed-off-by: Nikola Dancejic dancejic@amazon.com
2020-06-01 23:14:41 +00:00
Howard Pritchard
c074a23e8f
Merge pull request #7675 from hppritcha/topic/fix_issue_7578
rework argobots configury to be smarter
2020-06-01 14:02:32 -06:00
Gilles Gouaillardet
c450b21405 opal/util: fix opal_str_to_bool()
correctly use strlen(char *) instead of sizeof(char *)

Thanks Georg Geiser for reporting this issue.

Refs. open-mpi/ompi#7772

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-05-30 20:47:41 +09:00
Ralph Castain
b27db0e2a3
Sync to PMIx and PRRTE masters
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-26 20:11:14 -07:00
Howard Pritchard
b9498ec31b rework argobots configury to be smarter
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-05-23 14:46:41 -07:00
Howard Pritchard
45b643d0cf OFI common: set include list explicitly to NULL
related to #7755

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2020-05-23 14:05:29 -06:00
Austen Lauria
b419edead4
Merge pull request #7732 from karasevb/fix_sys_limits
sys limits: fixed soft limit setting if it is less than hard limit
2020-05-20 16:25:34 -04:00
Joshua Hursey
05e095a1ee A slightly stronger check for LSF's libevent
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-05-18 15:08:10 -04:00
Ralph Castain
54f8b6d23c
Pickup the OMPI system-default parameters
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-16 12:43:04 -07:00
Ralph Castain
337fcb0047
Sync to PMIx and PRRTE masters
Roll in new mapping/binding methods and report outputs. Fix a few bugs

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-16 07:39:31 -07:00
Josh Hursey
9c0a2bb2d6
Merge pull request #7734 from jjhursey/fix-lsf-libevent
Move to `libevent_core` and add checks for libevent.so conflict with LSF
2020-05-15 14:36:22 -05:00
Boris Karasev
fb9eca55cf sys limits: fixed soft limit setting if it is less than hard limit
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2020-05-14 10:54:16 +07:00
Austen Lauria
9996b9f54d
Merge pull request #7720 from abouteiller/bugfix/tcp-failed-lock
Race condition when closing TCP endpoint with error
2020-05-13 16:52:21 -04:00
Joshua Hursey
33afdb6649 Move from legacy -levent to recommended -levent_core
* `libevent_core.so` contains the core functionality that we depend upon
   - `libevent.so` library has been identified as the legacy target.
   - `libevent_core.so` exists as far back as Libevent 2.0.5 (oldest supported by OMPI)
 * `libevent_pthreads.so` can work with either `-levent` or `-levent_core`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 886f41fe33)
2020-05-13 10:48:24 -04:00
Joshua Hursey
959353b421 Add checks for libevent.so conflict with LSF
* LSF ships a `libevent.so` that is no related to the `libevent.so`
   shipped with Libevent.
 * Add some checks to the configure logic to detect scenarios where this
   conflict can be detected, and provide the user with a descriptive
   warning message.
   - When detected by `event/external` this is just a warning since
     the internal component may be able to be used instead.
     - This happens when the user supplies the LSF path via the
       `LDFLAGS` envar instead of via `--with-lsf-libdir`.
   - When detected by a LSF component and LSF was explicitly requested
     then this becomes an error. Otherwise it will just print the warning
     and that component will fail to build.
 * Note for `master` the `orter_check_lsf.m4` portion of this cherry-pick
   was moved to `prrte/config/prrte_check_lsf.m4`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit fc4199e3ba)
2020-05-13 10:47:02 -04:00
Joshua Hursey
a73a89f6cf event/external: Fix typo in LDFLAGS vs LIBS var before check
* This should have been `LDFLAGS` not `LIBS`. Either works, but
   `LDFLAGS` is more correct. We should also include `CPPFLAGS`
   just in case the header is important to the check.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 22d8fa197b)
2020-05-13 10:45:26 -04:00
Howard Pritchard
f744668f5f
Merge pull request #7646 from hppritcha/topic/ofi_common_wl
add a common ofi whitelist/blacklist
2020-05-13 06:44:05 -06:00
Howard Pritchard
3078485eee
Merge pull request #7712 from shintaro-iwasaki/fix7697
opal/mca/threads/argobots: fix compilation error
2020-05-11 09:02:22 -06:00
Aurelien Bouteiller
0e93d0f647
Bugfix: when a TCP socket is closed in error, it could update the
endpoint state without holding the endpoint lock, resulting in a race
condition.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-05-11 01:11:05 -04:00
Howard Pritchard
9f1081a07a add a common ofi whitelist/blacklist
also add common verbose variable.

Note the verbosity thing is a little tricky owing to the way the MCA frameworks and components are registered and
and initialized.  The BTL's are registered/initialized prior to the MTL components even getting registered.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-05-09 14:50:31 -06:00
Joseph Schuchart
fa1b12ac33 Fix potential out-of-bounds write in opal_progress_unregister
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-05-08 21:11:51 +02:00