Some macros defined by the embedded hwloc ends up in opal_config.h
because hwloc configury m4 files are slurped into Open MPI. These
macros are not required here, and they might conflict with an external
hwloc install, so simply #undef them in hwloc/external/external.h
after including <opal_config.h> but before including the external
<hwloc.h>.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new mpool base module type: basic. This module can
be used with an opal_free_list_t to allocate space from a
pre-allocated block (such as a shared memory region). The new module
only supports allocation and is not meant for more dynamic use cases
at this time.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Update the OPAL glue configure code to correctly link the opal/pmix4
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios.
Signed-off-by: Ralph Castain <rhc@pmix.org>
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1
Refs. open-mpi/ompi#6247
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.
References #6201
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit contains the following changes:
- Remove the unused opal_test_init/opal_test_finalize
functions. These functions are not used by anything in the code
base or MTT. Tests use opal_init_util/opal_finalize_util instead.
- Get rid of gotos in opal_init_util and opal_init. Replaced them
with a cleaner solution.
- Automatically register cleanup functions in init functions. The
cleanup functions are executed in the reverse order of the
initialization functions. The cleanup functions are run in
opal_finalize_util() before tearing down the class system.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
remove whitespace around '=' when setting btl_uct_LIBS
Thanks Ake Sandgren for reporting this
Refs. open-mpi/ompi#6173
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Add the --pset option for app_contexts so the user can provide a string
name for each app_context. Use the new PMIx pset attribute to store the
names in the PMIx local storage for retrieval
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.
- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`
Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.
I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.
Fixes: #5810Fixes: #5914
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
In 457f058 I broke the TCP BTL with --enable-ipv6. This patch
fixes the compile error, so IPv6 works again.
Fixed#5996
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
- Always use the external component when configure'd with --with-libevent=external
- Fix the external libevent library version detection
by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
- Use the event2/event.h header (event.h is deprecated since libevent 2.0
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The monitoring PML hides it's existence from the OMPI infrastructure by
removing itself from the list of PML loaded components, remaining hidden
until MPI_Finalize.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code. Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure. This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.
Refs #5818
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6). There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads. The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.
This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT. It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.
The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race. THis patch is a work around until that patch can
be developed.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.
At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.
This commit also fixes some bugs in the active-message send path:
- Correctly set all fragment fields in prepare_src.
- Fix bug when using buffered-send. We were not reading the return
code correctly (which is in bytes). This resulted in a message
getting sent multiple times.
- Don't try to progress sends from the btl_send function when in an
active-message callback. It could lead to deep recursion and an
eventual crash if we get a trace like
send->progress->am_complete->ob1_callback->send->am_complete...
Closes#5820Closes#5821
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This reverts commit 6acebc40a1.
This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run. See
https://github.com/open-mpi/ompi/issues/5849 for more information.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Looks like a filename was missed when pmix sucked in the installdirs
framework. Fixing the typo fixes "make ctags" and "make cscope".
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.
Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.
Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:
_Atomic intptr x, y;
intptr_t z = 0;
x = y = z;
Will produce a compiler error of the form:
operand cannot have void type: op "="
assignment type mismatch:
long "=" void
To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.
Fixes#5814
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround. We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Open MPI doesn't support any transports on MacOS which require
memory manager hooks. The memory patcher component uses the
syscall interface, which has been deprecated in recent versions
of MacOS. Since we don't need it and it emits warnings about
deprecation, disable the memory patcher component on MacOS.
Fixes#5671
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In many cases, this was a simple string replace. In a few places, it
entailed:
1. Updating some comments and removing now-redundant foo[size-1]='\0'
statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
wants the entire destination buffer length).
This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Disable async receive for CUDA under OpenIB. While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit. This change has been validated
and approved by Akshay.
References #3972
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.
Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.
Fixes#5375, #5638
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit updates the patcher component to either use the
__clear_cache intrinsic or the correct assembly to flush the
instruction cache.
Fixes#5631
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Ensure that the project, framework, component, and variable names are
lower than max lengths. This is a follow-on to
992a8e8297, per discussion on
https://github.com/open-mpi/ompi/pull/5642.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
If someone specifies --with-verbs-usnic, actually do a configury check
to ensure that it will compile (vs. assuming that it will compile if
someone asks for it).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster
Make it compile in non-debug builds
Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks
Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.
Remove ULFM specific cases
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Autoconf AC_CONFIG_* macros can only be instantiated exacly once
for any given file, *and* they must be in a code execution path at run
time for the target file to be generated at the end of configure.
For example, if you want to generate file ABC at the end of configure,
you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that
will get executed when configure is run.
That's pretty straightforward.
What's not straightforward is two corner cases:
1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file
more than once. If you do, autoreconf will fail (even before you
can run configure).
2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by
configure, the file ABC is not registered properly, and ABC will
not be generated at the end of configure.
This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls
both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles
(etc.) so that targets like "make distclean" and "make distcheck" will
work properly. Hence, we *have* to invoke HWLOC_SETUP_CORE.
However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side
effects. It would be nice to do able to do something like this:
```
if hwloc:extern is going to be used:
Invoke minimal HWLOC_SETUP_CORE (with no side effects)
else
Invoke full HWLOC_SETUP_CORE (with side effects)
fi
```
But we can't, because autoreconf will detect that AC_CONFIG_FILES has
been invoked on the same files more than once (regardless of whether
those code paths will be executed at run time or not). Kaboom.
Similarly, we can't do this:
```
if hwloc:extern is not going to be used:
Invoke full HWLOC_SETUP_CORE (with side effects)
fi
```
Because then hwloc's AC_CONFIG_FILES won't be registered properly when
hwloc:external *is* used (i.e., when the HWLOC_SETUP_CORE macro is not
in a code path that is executed at run time), and targets like "make
distclean" will fail because hwloc's Makefiles won't have been setup.
Kaboom.
But remember that the hwloc framework is a bit special: there will
only ever be 2 comoponents: external and internal. External is
guaranteed to be configured first because of its priority. So the
internal component (i.e., this component) immediately knows if it is
going to be used or not based on whether the external component
configuration succeeded or failed.
Specifically: regardless of whether the internal component (i.e., this
component) is going to be used, we have to invoke HWLOC_SETUP_CORE.
But we can manage the side effects: allow the side effects when
this/internal component is going to be used, and avoid the side
effects when this/internal component is not going to be used.
This is a little less clean than I would have liked, but because of
Autoconf's oddity about its AC_CONFIG_* macros, this is the only
solution I could come up with.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In order to make "make distclean" (and friends) work, we need to
*always* invoke the embedded configure script -- even if we know that
we're not going to use this component.
But in cases where we know we're not going to use this component, we
also need to avoid the side effects of the code path that is used when
we *do* want to use this component. So split the two possibilities
into two different macros:
1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing
except invoke the underlying "configure" script.
2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real
work (including invoking the underlying "configure" script).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
We know that event:external will be configured first (because of its
priority). Take advantage of that here in libevent2022 by having it
refuse to configure / politely fail if event:external succeeded.
Also print out some additional lines in configure output indicating
what is going on (i.e., event:external succeeded, so this component
will be skipped, or event:external failed, so this component will be
used).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
We know that hwloc:external will be configured first (because of its
priority). Take advantage of that here in hwloc201 by having it
refuse to configure / politely fail if hwloc:external succeeded.
Also print out some additional lines in configure output indicating
what is going on (i.e., hwloc:external succeeded, so this component
will be skipped, or hwloc:external failed, so this component will be
used).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.
Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.
Thanks to @tonyreina for the error report
Signed-off-by: Ralph Castain <rhc@open-mpi.org>