This commit updates the btl interface to change the parameters
passed to receive callbacks. The interface used to pass the tag,
a btl base descriptor, and the callback context. Most of the
values in the btl base descriptor were unused and only helped
simplify the callbacks from the self btl. All of the arguments
have now been replaced with a single receive callback descriptor.
This descriptor contains the incoming endpoint, data segment(s),
tag, and callback context. All btls have been updated to use
the new callback and the btl interface version has been bumped
to v3.2.0.
As part of this change the descriptor argument (and the segments
contained within it) have been marked as const. The were treated
as const before but this change could allow the compiler to make
better optimization decisions and will enforce that the callback
does not attempt to change the data in the descriptor.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
- there is new API to detect missing memmory events.
Enabled using of new UCX API to detect missing events
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
This bug was first seen in a different product that's using the same
interception code as OMPI. But I think it's potentially in OMPI too.
In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
0x00000000000d6778 <+40>: std r2,24(r1) // something gcc put in front
0x00000000000d677c <+44>: std r2,96(r1) // PATCHER_BEGIN's toc_save
0x00000000000d6780 <+48>: nop // NOPs from PATCHER_BEGIN
0x00000000000d6784 <+52>: nop // that get replaced
0x00000000000d6788 <+56>: nop // by instructions that
0x00000000000d678c <+60>: nop // change r2
0x00000000000d6790 <+64>: nop //
Later there are loads from that location like
0x000000000019e0e4 <+132>: ld r2,24(r1)
that make me nervous since that's the pre-updated value.
I believe this is the same thing Nathan is describing way back in a9bc692d
and his solution was to put a second call around each interception, where
the outer call is just
intercept_brk():
PATCHER_BEGIN
_intercept_brk()
PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.
What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix. So I want to add __opal_attribute_noinline__ to
restore the fix.
With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining _intercept_brk() and the susipicious
early save of r2 is gone. I made the same fix to all the intercept_*
functions, although intercept_brk was the only one that had a suspicious
save of r2.
As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code. I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
These op codes used to be in bits/ipc.h but were removed in glibc in 2015
with a comment saying they should be defined in internal headers:
https://sourceware.org/bugzilla/show_bug.cgi?id=18560
and when glibc uses that syscall it seems to do so from its own definitions:
https://github.com/bminor/glibc/search?q=IPCOP_shmat&unscoped_q=IPCOP_shmat
So I think using #ifndef and defining them if they're not already defined
using the values from glibc is the best option.
At IBM it was the testing on redhat 8 that found this as an issue
(the opcodes being undefined on the system made it select the
left undefined so shmat/shmdt memory events went unintercepted).
Signed-off-by: Mark Allen <markalle@us.ibm.com>
mtl_btl_ofi_rcache_init() initializes patcher which should only take
place things are single threaded. OFI providers may start spawn threads,
so initialize the rcache before creating OFI objects to prevent races.
Authored-by: John L. Byrne <john.l.byrne@hpe.com>
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
Added the flag OPAL_OFI_PCI_DATA_AVAILABLE to remove accessing the nic
object in
fi_info when the ofi version does not support that structure.
Signed-off-by: Nikola Dancejic dancejic@amazon.com
* `libevent_core.so` contains the core functionality that we depend upon
- `libevent.so` library has been identified as the legacy target.
- `libevent_core.so` exists as far back as Libevent 2.0.5 (oldest supported by OMPI)
* `libevent_pthreads.so` can work with either `-levent` or `-levent_core`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 886f41fe3381a338eac215f26360980c612e6bb8)
* LSF ships a `libevent.so` that is no related to the `libevent.so`
shipped with Libevent.
* Add some checks to the configure logic to detect scenarios where this
conflict can be detected, and provide the user with a descriptive
warning message.
- When detected by `event/external` this is just a warning since
the internal component may be able to be used instead.
- This happens when the user supplies the LSF path via the
`LDFLAGS` envar instead of via `--with-lsf-libdir`.
- When detected by a LSF component and LSF was explicitly requested
then this becomes an error. Otherwise it will just print the warning
and that component will fail to build.
* Note for `master` the `orter_check_lsf.m4` portion of this cherry-pick
was moved to `prrte/config/prrte_check_lsf.m4`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit fc4199e3ba567a672ce1da0dc46efbfd996d71f6)
* This should have been `LDFLAGS` not `LIBS`. Either works, but
`LDFLAGS` is more correct. We should also include `CPPFLAGS`
just in case the header is important to the check.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 22d8fa197b73eff7afc6d5fd11a99ced396c388a)