- there is new API to detect missing memmory events.
Enabled using of new UCX API to detect missing events
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
This bug was first seen in a different product that's using the same
interception code as OMPI. But I think it's potentially in OMPI too.
In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
0x00000000000d6778 <+40>: std r2,24(r1) // something gcc put in front
0x00000000000d677c <+44>: std r2,96(r1) // PATCHER_BEGIN's toc_save
0x00000000000d6780 <+48>: nop // NOPs from PATCHER_BEGIN
0x00000000000d6784 <+52>: nop // that get replaced
0x00000000000d6788 <+56>: nop // by instructions that
0x00000000000d678c <+60>: nop // change r2
0x00000000000d6790 <+64>: nop //
Later there are loads from that location like
0x000000000019e0e4 <+132>: ld r2,24(r1)
that make me nervous since that's the pre-updated value.
I believe this is the same thing Nathan is describing way back in a9bc692d
and his solution was to put a second call around each interception, where
the outer call is just
intercept_brk():
PATCHER_BEGIN
_intercept_brk()
PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.
What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix. So I want to add __opal_attribute_noinline__ to
restore the fix.
With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining _intercept_brk() and the susipicious
early save of r2 is gone. I made the same fix to all the intercept_*
functions, although intercept_brk was the only one that had a suspicious
save of r2.
As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code. I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
These op codes used to be in bits/ipc.h but were removed in glibc in 2015
with a comment saying they should be defined in internal headers:
https://sourceware.org/bugzilla/show_bug.cgi?id=18560
and when glibc uses that syscall it seems to do so from its own definitions:
https://github.com/bminor/glibc/search?q=IPCOP_shmat&unscoped_q=IPCOP_shmat
So I think using #ifndef and defining them if they're not already defined
using the values from glibc is the best option.
At IBM it was the testing on redhat 8 that found this as an issue
(the opcodes being undefined on the system made it select the
left undefined so shmat/shmdt memory events went unintercepted).
Signed-off-by: Mark Allen <markalle@us.ibm.com>
mtl_btl_ofi_rcache_init() initializes patcher which should only take
place things are single threaded. OFI providers may start spawn threads,
so initialize the rcache before creating OFI objects to prevent races.
Authored-by: John L. Byrne <john.l.byrne@hpe.com>
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
Added the flag OPAL_OFI_PCI_DATA_AVAILABLE to remove accessing the nic
object in
fi_info when the ofi version does not support that structure.
Signed-off-by: Nikola Dancejic dancejic@amazon.com
correctly use strlen(char *) instead of sizeof(char *)
Thanks Georg Geiser for reporting this issue.
Refs. open-mpi/ompi#7772
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
* `libevent_core.so` contains the core functionality that we depend upon
- `libevent.so` library has been identified as the legacy target.
- `libevent_core.so` exists as far back as Libevent 2.0.5 (oldest supported by OMPI)
* `libevent_pthreads.so` can work with either `-levent` or `-levent_core`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 886f41fe33)
* LSF ships a `libevent.so` that is no related to the `libevent.so`
shipped with Libevent.
* Add some checks to the configure logic to detect scenarios where this
conflict can be detected, and provide the user with a descriptive
warning message.
- When detected by `event/external` this is just a warning since
the internal component may be able to be used instead.
- This happens when the user supplies the LSF path via the
`LDFLAGS` envar instead of via `--with-lsf-libdir`.
- When detected by a LSF component and LSF was explicitly requested
then this becomes an error. Otherwise it will just print the warning
and that component will fail to build.
* Note for `master` the `orter_check_lsf.m4` portion of this cherry-pick
was moved to `prrte/config/prrte_check_lsf.m4`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit fc4199e3ba)
* This should have been `LDFLAGS` not `LIBS`. Either works, but
`LDFLAGS` is more correct. We should also include `CPPFLAGS`
just in case the header is important to the check.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 22d8fa197b)
also add common verbose variable.
Note the verbosity thing is a little tricky owing to the way the MCA frameworks and components are registered and
and initialized. The BTL's are registered/initialized prior to the MTL components even getting registered.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>