Adds the capability to select a NIC based on hardware locality.
Creates a list of NICs that share the same cpuset as the process,
then selects the NIC based on the (local rank) % (number of NICs).
If no NICs are available that share the same cpuset, the selection process
will create a list of all available NICs and make a selection based on
(local rank) % (number of NICs)
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 167d75b42ac3ca4770d59c796c011d72e0fffde3)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Per suggestion of @awlauria
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
(cherry picked from commit ab4875ddc2b76ceced120fbfe09d8dcbde6e6ff3)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Per suggestion of @awlauria, added some comments about
the need to free ep before resources it points to.
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
(cherry picked from commit 1bc3dab118bb694d932ef365a7a922e514f82a20)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This fix is from John L. Byrne (john.l.byrne@hpe.com).
When OFI Libfabric binds objects to endpoints, before the object can
be successfully closed, the endpoint must first be freed. For scalable
endpoints, objects can also be bound to transmit and receive contexts,
and for objects that are bound to contexts, we need to first free the
contexts before freeing the endpoint. We also need to clear the memory
registration cache.
If we don't clean up properly, then fi\_close may not be able to close
the domain because the dom will have a non-zero ref count.
Signed-off-by: harumi kuno <harumi.kuno@hpe.com>
(cherry picked from commit 3095fabf94de249716fc146a4c4a609bd33a1aff)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Make sure to get an RDM provider that can provide both local and
remote communication. We need this check because some providers could
be selected via RXD or RXM, but can't provide local communication, for
example.
Add OPAL_CHECK_OFI_VERSION_GE() m4 macro to check that the Libfabric
we're building against is >= a target version. Use this check in two
places:
1. MTL/OFI: Make sure it is >= v1.5, because the FI_LOCAL_COMM /
FI_REMOTE_COMM constants were introduced in Libfabric API v1.5.
2. BTL/usnic: It already had similar configury to check for Libfabric
>= v1.1, but the usnic component was checking for >= v1.3. So
update the btl/usnic configury to use the new macro and check for
>= v1.3.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 21bc9042e1ff9c3075ababda9eaf49ccc90f64db)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
As discussed in open-mpi/ompi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().
Signed-off-by: guserav <erik.zeiske@hpe.com>
(cherry picked from commit 8a67a95c993dbfc2e3fa652777cab6ee20a4a735)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The changes made in f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527
have been done after the common/ofi component was removed and thus the
component doesn't reflect the changes made their.
Namely f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527 changed:
- How to call OPAL_CHECK_OFI (It sets opal_ofi_happy to yes now)
- Dropped the common part in the build flags for ofi
Signed-off-by: guserav <erik.zeiske@web.de>
(cherry picked from commit 0e25c95eaeabca67cbe5ad6370171e1cff47e52a)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Replace all tabs with spaces. No code or logic changes.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit b556cabfe937f84f30e8870e0a8c128b839e5dfd)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
It doesn't seem like the BTL was using uninitialized pointer. But simply
setting the rcache pointer to NULL after destroying it makes the valgrind
errors go away.
Fixes Issue #6345
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 786e686d4347655b574e609c65626c8323bb49b2)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.
Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
(cherry picked from commit 080115d44069e0c461a1af105cd41f28849cdffc)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
These op codes used to be in bits/ipc.h but were removed in glibc in 2015
with a comment saying they should be defined in internal headers:
https://sourceware.org/bugzilla/show_bug.cgi?id=18560
and when glibc uses that syscall it seems to do so from its own definitions:
https://github.com/bminor/glibc/search?q=IPCOP_shmat&unscoped_q=IPCOP_shmat
So I think using #ifndef and defining them if they're not already defined
using the values from glibc is the best option.
At IBM it was the testing on redhat 8 that found this as an issue
(the opcodes being undefined on the system made the #define HAS_SHMDT
evaluate to false so intercept_shmat / intercept_shmdt were
left undefined so shmat/shmdt memory events went unintercepted).
(cherry picked from commit e8fab058dac7300569cb54b08e5500115f8bab8f)
Signed-off-by: Mark Allen <markalle@us.ibm.com>
correctly use strlen(char *) instead of sizeof(char *)
Thanks Georg Geiser for reporting this issue.
Refs. open-mpi/ompi#7772
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit c450b2140540a1f8eae1a6e6f9a22d17cd40e7d8)
* `libevent_core.so` contains the core functionality that we depend upon
- `libevent.so` library has been identified as the legacy target.
- `libevent_core.so` exists as far back as Libevent 2.0.5 (oldest supported by OMPI)
* `libevent_pthreads.so` can work with either `-levent` or `-levent_core`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* LSF ships a `libevent.so` that is no related to the `libevent.so`
shipped with Libevent.
* Add some checks to the configure logic to detect scenarios where this
conflict can be detected, and provide the user with a descriptive
warning message.
- When detected by `event/external` this is just a warning since
the internal component may be able to be used instead.
- This happens when the user supplies the LSF path via the
`LDFLAGS` envar instead of via `--with-lsf-libdir`.
- When detected by a LSF component and LSF was explicitly requested
then this becomes an error. Otherwise it will just print the warning
and that component will fail to build.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* This should have been `LDFLAGS` not `LIBS`. Either works, but
`LDFLAGS` is more correct. We should also include `CPPFLAGS`
just in case the header is important to the check.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Keep track of the connected procs in vader_add_procs().
Otherwise, the same rank will reconnect the same shmem
segment (rank 0+...) multiple times instead of the next
one as intended.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit f69c8d6819dcc14f471cea90b50dc8ca98de12d4)
Build was broken by mistake in commit d40662edc41a5a4d09ae690b640cfdeeb24e15a1
Fixes#7362
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
(cherry picked from commit 907ad854b46b42ae7cb1e9c87238691a5cc25e36)
Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.
I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.
Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html
It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
(cherry picked from commit ea80a20e108cb69efc67ad04ad968da7b85772af)
This PR removes the constant defining the max attachment address and
replaces it with the largest address that shows up in /proc/self/maps.
This should address issues found on AARCH64 where the max address
may differ based on the configuration.
Since the calculated max address may differ between processes the
max address is sent as part of the modex and stored in the endpoint
data.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit 728d51f9f3f2df6577e5f9729b9d6a0fe9441d37)
This commit fixes an issue discovered in the XPMEM registration cache. It
was possible for a registration to be invalidated by multiple threads
leading to a double-free situation or re-use of an invalidated registration.
This commit fixes the issue by setting the INVALID flag on a registation
when it will be deleted. The flag is set while iterating over the tree
to take advantage of the fact that a registration can not be removed
from the VMA tree by a thread while another thread is traversing the VMA
tree.
References #6524
References #7030Closes#6534
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit f86f805be1145ace46b570c5c518555b38e58cee)
There are cases where the same interval may be in the tree multiple
times. This generally isn't a problem when searching the tree but
may cause issues when attempting to delete a particular registration
from the tree. The issue is fixed by breaking a low value tie by
checking the high value then the interval data.
If the high, low, and data of a new insertion exactly matches an
existing interval then an assertion is raised.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
(cherry picked from commit 1145abc0b790f82ea25e24a3becad91ff502769c)
Thanks @devreal for catching this.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 3de636dc6fda43fb31bcd69b6a91b7f8de1a985c)
This patch fixes#7147 by preventing overflow when multiplying
the count and the blocklen. The count reflects MPI count and is
therefore bound to the size of an int (it is an uint32_t) while the
blocklen can be merged together to represent the largest contiguous
memory layout and it is therefore promoted to a size_t.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 59fb02618e75022e3cd137199e3d390757e3c7e0)
Clarify in README what --with-hwloc does in its different use cases.
Also, ensure that the behavior when specifying `--with-hwloc` is the
same as if that option is not specified at all. This is what we did
in Open MPI <= v3.x; looks like we inadvertantly caused `--with-hwloc`
to be synonymous with `--with-hwloc=external` in v4.0.0.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 18c3e1af5ef281e8b502a7ad778256889cc707d1)
syscall() returns a long, but we are invoking shmat(), which returns
a void*.
Signed-off-by: Maxwell Coil <mcoil@nd.edu>
(cherry picked from commit 52a9cce6f3dcd87e9ae66177398b60b9317e9339)
This commit fixes a configure bug that caused flow control to be
disabled regardless of the configure options used.
Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit f7e74b6a3d19cd6e3edc3364783311e595f81b3d)
The prior implementation was simply wrong.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 53ebea12aa98ba5440924a8ee914f6707f63608b)