1
1

5807 Коммитов

Автор SHA1 Сообщение Дата
Joseph Schuchart
7d5a6e3e8b UCX osc: safely load/store 64bit integer from variable size pointer
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Joseph Schuchart
824afac483 UCX common: add non-blocking compare-and-swap
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-23 12:41:52 +02:00
Austen Lauria
d03a99c647
Merge pull request #7776 from simonbyrne/patch-1
Fix language in CUDA error
2020-06-18 09:49:50 -04:00
Geoff Paulsen
692f96e87a
Merge pull request #7799 from markalle/interception_early_toc_read
noinline to avoid compiler reading TOC before PATCHER_BEGIN
2020-06-17 14:26:24 -05:00
Sergey Oblomov
d6bff6ffbd COMMON/UCX: improved missing events test
- there is new API to detect missing memmory events.
  Enabled using of new UCX API to detect missing events

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-06-16 12:36:44 +03:00
Jeff Squyres
17acb775e9 Rename the use of "whitelist"
Use the term "allowlist" instead of "whitelist" in the script that
looks for common symbols.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-06-15 17:06:25 -04:00
Mark Allen
ddd1f578ec noinline to avoid compiler reading TOC before PATCHER_BEGIN
This bug was first seen in a different product that's using the same
interception code as OMPI.  But I think it's potentially in OMPI too.

In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
   0x00000000000d6778 <+40>:    std     r2,24(r1) // something gcc put in front
   0x00000000000d677c <+44>:    std     r2,96(r1) // PATCHER_BEGIN's toc_save
   0x00000000000d6780 <+48>:    nop               // NOPs from PATCHER_BEGIN
   0x00000000000d6784 <+52>:    nop               // that get replaced
   0x00000000000d6788 <+56>:    nop               // by instructions that
   0x00000000000d678c <+60>:    nop               // change r2
   0x00000000000d6790 <+64>:    nop               //

Later there are loads from that location like
   0x000000000019e0e4 <+132>:   ld      r2,24(r1)
that make me nervous since that's the pre-updated value.

I believe this is the same thing Nathan is describing way back in a9bc692d
and his solution was to put a second call around each interception, where
the outer call is just
    intercept_brk():
        PATCHER_BEGIN
        _intercept_brk()
        PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.

What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix.  So I want to add __opal_attribute_noinline__ to
restore the fix.

With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining _intercept_brk() and the susipicious
early save of r2 is gone.  I made the same fix to all the intercept_*
functions, although intercept_brk was the only one that had a suspicious
save of r2.

As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code.  I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2020-06-09 19:25:59 -04:00
Ralph Castain
9bdf1274c0
Sync to PMIx and PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-09 15:32:22 -07:00
Jeff Squyres
9b55419b40
Merge pull request #7777 from markalle/IPCOP_shmat
adding op-codes for syscall ipc for shmat/shmdt
2020-06-08 15:09:17 -04:00
Ralph Castain
a879a16df5
Merge pull request #7794 from rhc54/topic/sy
Sync to PMIx and PRRTE master branches
2020-06-08 12:05:33 -07:00
Howard Pritchard
46d834d674
Merge pull request #7781 from hkuno/john.l.byrne/mca_btl_ofi_rcache_init
mtl_btl_ofi_rcache_init() before creating domain
2020-06-08 13:01:45 -06:00
Ralph Castain
ad8a567212
Sync to PMIx and PRRTE master branches
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-08 10:50:10 -07:00
Mark Allen
e8fab058da adding op-codes for syscall ipc for shmat/shmdt
These op codes used to be in bits/ipc.h but were removed in glibc in 2015
with a comment saying they should be defined in internal headers:
https://sourceware.org/bugzilla/show_bug.cgi?id=18560
and when glibc uses that syscall it seems to do so from its own definitions:
https://github.com/bminor/glibc/search?q=IPCOP_shmat&unscoped_q=IPCOP_shmat

So I think using #ifndef and defining them if they're not already defined
using the values from glibc is the best option.

At IBM it was the testing on redhat 8 that found this as an issue
(the opcodes being undefined on the system made it select the
left undefined so shmat/shmdt memory events went unintercepted).

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2020-06-04 14:20:40 -04:00
Harumi Kuno
f1b21cb776 mtl_btl_ofi_rcache_init() before creating domain
mtl_btl_ofi_rcache_init() initializes patcher which should only take
place things are single threaded.  OFI providers may start spawn threads,
so initialize the rcache before creating OFI objects to prevent races.

Authored-by: John L. Byrne <john.l.byrne@hpe.com>
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-06-03 09:56:29 -06:00
Simon Byrne
27a2ed8cba Fix language in CUDA error
Removes a malapropism (passed should be past), and hopefully makes it a bit clearer.

Signed-off-by: Simon Byrne <simonbyrne@gmail.com>
2020-06-02 13:25:31 -07:00
Brian Barrett
0a21a58f08
Merge pull request #7771 from dancejic/multi
common/ofi: Fixing compilation issue with ofi versions that do not support fi_info.nic
2020-06-01 18:42:07 -07:00
Nikola Dancejic
ae2a447b0e common/ofi: Fixing compilation issue with ofi versions that do not support fi_info.nic
Added the flag OPAL_OFI_PCI_DATA_AVAILABLE to remove accessing the nic
object in
fi_info when the ofi version does not support that structure.

Signed-off-by: Nikola Dancejic dancejic@amazon.com
2020-06-01 23:14:41 +00:00
Howard Pritchard
c074a23e8f
Merge pull request #7675 from hppritcha/topic/fix_issue_7578
rework argobots configury to be smarter
2020-06-01 14:02:32 -06:00
Gilles Gouaillardet
c450b21405 opal/util: fix opal_str_to_bool()
correctly use strlen(char *) instead of sizeof(char *)

Thanks Georg Geiser for reporting this issue.

Refs. open-mpi/ompi#7772

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-05-30 20:47:41 +09:00
Ralph Castain
b27db0e2a3
Sync to PMIx and PRRTE masters
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-26 20:11:14 -07:00
Howard Pritchard
b9498ec31b rework argobots configury to be smarter
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-05-23 14:46:41 -07:00
Howard Pritchard
45b643d0cf OFI common: set include list explicitly to NULL
related to #7755

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2020-05-23 14:05:29 -06:00
Austen Lauria
b419edead4
Merge pull request #7732 from karasevb/fix_sys_limits
sys limits: fixed soft limit setting if it is less than hard limit
2020-05-20 16:25:34 -04:00
Joshua Hursey
05e095a1ee A slightly stronger check for LSF's libevent
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-05-18 15:08:10 -04:00
Ralph Castain
54f8b6d23c
Pickup the OMPI system-default parameters
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-16 12:43:04 -07:00
Ralph Castain
337fcb0047
Sync to PMIx and PRRTE masters
Roll in new mapping/binding methods and report outputs. Fix a few bugs

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-16 07:39:31 -07:00
Josh Hursey
9c0a2bb2d6
Merge pull request #7734 from jjhursey/fix-lsf-libevent
Move to `libevent_core` and add checks for libevent.so conflict with LSF
2020-05-15 14:36:22 -05:00
Boris Karasev
fb9eca55cf sys limits: fixed soft limit setting if it is less than hard limit
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2020-05-14 10:54:16 +07:00
Austen Lauria
9996b9f54d
Merge pull request #7720 from abouteiller/bugfix/tcp-failed-lock
Race condition when closing TCP endpoint with error
2020-05-13 16:52:21 -04:00
Joshua Hursey
33afdb6649 Move from legacy -levent to recommended -levent_core
* `libevent_core.so` contains the core functionality that we depend upon
   - `libevent.so` library has been identified as the legacy target.
   - `libevent_core.so` exists as far back as Libevent 2.0.5 (oldest supported by OMPI)
 * `libevent_pthreads.so` can work with either `-levent` or `-levent_core`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 886f41fe3381a338eac215f26360980c612e6bb8)
2020-05-13 10:48:24 -04:00
Joshua Hursey
959353b421 Add checks for libevent.so conflict with LSF
* LSF ships a `libevent.so` that is no related to the `libevent.so`
   shipped with Libevent.
 * Add some checks to the configure logic to detect scenarios where this
   conflict can be detected, and provide the user with a descriptive
   warning message.
   - When detected by `event/external` this is just a warning since
     the internal component may be able to be used instead.
     - This happens when the user supplies the LSF path via the
       `LDFLAGS` envar instead of via `--with-lsf-libdir`.
   - When detected by a LSF component and LSF was explicitly requested
     then this becomes an error. Otherwise it will just print the warning
     and that component will fail to build.
 * Note for `master` the `orter_check_lsf.m4` portion of this cherry-pick
   was moved to `prrte/config/prrte_check_lsf.m4`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit fc4199e3ba567a672ce1da0dc46efbfd996d71f6)
2020-05-13 10:47:02 -04:00
Joshua Hursey
a73a89f6cf event/external: Fix typo in LDFLAGS vs LIBS var before check
* This should have been `LDFLAGS` not `LIBS`. Either works, but
   `LDFLAGS` is more correct. We should also include `CPPFLAGS`
   just in case the header is important to the check.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 22d8fa197b73eff7afc6d5fd11a99ced396c388a)
2020-05-13 10:45:26 -04:00
Howard Pritchard
f744668f5f
Merge pull request #7646 from hppritcha/topic/ofi_common_wl
add a common ofi whitelist/blacklist
2020-05-13 06:44:05 -06:00
Howard Pritchard
3078485eee
Merge pull request #7712 from shintaro-iwasaki/fix7697
opal/mca/threads/argobots: fix compilation error
2020-05-11 09:02:22 -06:00
Aurelien Bouteiller
0e93d0f647
Bugfix: when a TCP socket is closed in error, it could update the
endpoint state without holding the endpoint lock, resulting in a race
condition.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-05-11 01:11:05 -04:00
Howard Pritchard
9f1081a07a add a common ofi whitelist/blacklist
also add common verbose variable.

Note the verbosity thing is a little tricky owing to the way the MCA frameworks and components are registered and
and initialized.  The BTL's are registered/initialized prior to the MTL components even getting registered.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-05-09 14:50:31 -06:00
Joseph Schuchart
fa1b12ac33 Fix potential out-of-bounds write in opal_progress_unregister
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-05-08 21:11:51 +02:00
Brian Barrett
0dc2325297
Merge pull request #7641 from dancejic/multi-NIC
Added multi-NIC support to provider selection
2020-05-07 15:24:41 -07:00
Shintaro Iwasaki
0fc2033c75 opal/mca/threads/argobots: fix compilation error
Fixes #7697

Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-05-07 16:07:12 +00:00
Ralph Castain
120cd31aaa
Update PMIx to fix some bugs
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-06 15:53:11 -07:00
Ralph Castain
ebd164b4c1
Update PMIx and PRRTE
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-06 12:40:11 -07:00
Nathan Hjelm
9d8f634044 btl/vader: rename vader -> sm
Now that the old sm btl has been gone for some time there was a request
to rename vader to sm. This commit does just that (reluctantly).

An alias has been generated so specifying vader in the btl selection
variable or specifying vader parameters will continue to work.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-05-05 06:43:19 -07:00
Nathan Hjelm
9fae5bfdf3 mca/base: add support for component aliasing
This commit adds support for aliasing component names. A component
name alias is created by calling: mca_base_alias_register. The name
of the project and framework are optional. The component name and
component alias are required. Once an alias is registered all
variables registered after the alias creation will have synonyms
also registered. For example:

```c
mca_base_alias_register("opal", "btl", "vader", "sm", false);
```

would cause all of the variables registered by btl/vader to have
aliases that start with btl_sm. Ex: btl_vader_single_copy_mechanism
would have the synonym: btl_sm_single_copy_mechanism.

If aliases are registered before component filtering the alias
can also be used for component selection. For example, if sm is
registered as an alias to vader in the btl framework register
function then ```--mca btl self,sm``` would be equivalent to
```--mca btl self,vader```.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-05-05 06:43:19 -07:00
Nathan Hjelm
3a036f8486 opal/class: add additional object helper functions
This commit adds two additional helpers to opal/class:

 - OPAL_HASH_TABLE_FOREACH_PTR: Same as OPAL_HASH_TABLE_FOREACH but
   operating on ptr hash tables. This is needed because the _ptr
   iterator functions take an additional argument.

 - OPAL_LIST_FOREACH_DECL: Same as OPAL_LIST_FOREACH but declares
   the variable specified in the first argument.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-05-05 06:43:19 -07:00
Ralph Castain
f608575eec
Remove references to numa_rank
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-01 13:32:29 -07:00
Ralph Castain
10c93a10e2
Ensure proper handling of default MCA param files
Update PMIx/PRRTE to ensure we pickup the default system and user MCA
param definitions during PMIx_server_setup_application so they get
propagated. Protect OPAL's MCA var processing so it doesn't try to
process a NULL filename when PMIx provides the params for it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-01 12:02:10 -07:00
Nikola Dancejic
167d75b42a common/ofi: Added multi-NIC support to provider selection
Adds the capability to select a NIC based on hardware locality.
Creates a list of NICs that share the same cpuset as the process,
then selects the NIC based on the (local rank) % (number of NICs).
If no NICs are available that share the same cpuset, the selection process
will create a list of all available NICs and make a selection based on
(local rank) % (number of NICs)

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
2020-05-01 01:05:13 +00:00
Ralph Castain
bd29ab0ae9
Update dpm to handle deprecation of MPI_Info keys
Deprecate the current OMPI-specific MPI_Info key definitions for
MPI_Comm_spawn and replace them with their PMIx equivalents. Issue a
deprecation/conversion warning as this is done. Also issue deprecation
warnings for options such as "ompi_non_mpi" that are no longer used.

Handle both cases where the user might pass either the PMIx attribute
name itself (e.g., "PMIX_MAPBY") or the string value of the attribute
(e.g., PMIX_MAPBY, which translates to "pmix.mapby"). This can only be
done for PMIx v4 and above, so protect that code.

Silence a couple of Coverity warnings and add a test along the way.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-29 14:56:38 -07:00
Ralph Castain
6146d52772
Sync PMIx
Remove pmix_config.h from the tarball. Deal with the case of no local
procs when register_nspace is called.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-29 09:00:04 -07:00
Ralph Castain
fd098d0eba
Sync PMIx and PRRTE
Remove prrte_config.h from tarball plus misc bug fixes

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-28 07:46:21 -07:00