According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior. So just change these params to int and avoid the
issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit disables the use of both the builtin and hand-written
atomics if proper C11 atomic support is detected. This is the first
step towards requiring the availability of C11 atomics for the C
compiler used to build Open MPI.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.
Fixes#5375, #5638
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit updates the patcher component to either use the
__clear_cache intrinsic or the correct assembly to flush the
instruction cache.
Fixes#5631
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Ensure that the project, framework, component, and variable names are
lower than max lengths. This is a follow-on to
992a8e8297, per discussion on
https://github.com/open-mpi/ompi/pull/5642.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
If someone specifies --with-verbs-usnic, actually do a configury check
to ensure that it will compile (vs. assuming that it will compile if
someone asks for it).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster
Make it compile in non-debug builds
Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks
Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.
Remove ULFM specific cases
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The important part of this fix is a couple places 5 was hard-coded that needed to be
strlen(OPAL_INFO_SAVE_PREFIX).
But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There
was an "if" statement making sure all the arguments had appropriate strlen(), but gcc
still complained about the following snprintf() because the size of the struct element
is iterator->ie_key[OPAL_MAX_INFO_KEY + 1].
Signed-off-by: Mark Allen <markalle@us.ibm.com>
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Autoconf AC_CONFIG_* macros can only be instantiated exacly once
for any given file, *and* they must be in a code execution path at run
time for the target file to be generated at the end of configure.
For example, if you want to generate file ABC at the end of configure,
you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that
will get executed when configure is run.
That's pretty straightforward.
What's not straightforward is two corner cases:
1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file
more than once. If you do, autoreconf will fail (even before you
can run configure).
2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by
configure, the file ABC is not registered properly, and ABC will
not be generated at the end of configure.
This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls
both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles
(etc.) so that targets like "make distclean" and "make distcheck" will
work properly. Hence, we *have* to invoke HWLOC_SETUP_CORE.
However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side
effects. It would be nice to do able to do something like this:
```
if hwloc:extern is going to be used:
Invoke minimal HWLOC_SETUP_CORE (with no side effects)
else
Invoke full HWLOC_SETUP_CORE (with side effects)
fi
```
But we can't, because autoreconf will detect that AC_CONFIG_FILES has
been invoked on the same files more than once (regardless of whether
those code paths will be executed at run time or not). Kaboom.
Similarly, we can't do this:
```
if hwloc:extern is not going to be used:
Invoke full HWLOC_SETUP_CORE (with side effects)
fi
```
Because then hwloc's AC_CONFIG_FILES won't be registered properly when
hwloc:external *is* used (i.e., when the HWLOC_SETUP_CORE macro is not
in a code path that is executed at run time), and targets like "make
distclean" will fail because hwloc's Makefiles won't have been setup.
Kaboom.
But remember that the hwloc framework is a bit special: there will
only ever be 2 comoponents: external and internal. External is
guaranteed to be configured first because of its priority. So the
internal component (i.e., this component) immediately knows if it is
going to be used or not based on whether the external component
configuration succeeded or failed.
Specifically: regardless of whether the internal component (i.e., this
component) is going to be used, we have to invoke HWLOC_SETUP_CORE.
But we can manage the side effects: allow the side effects when
this/internal component is going to be used, and avoid the side
effects when this/internal component is not going to be used.
This is a little less clean than I would have liked, but because of
Autoconf's oddity about its AC_CONFIG_* macros, this is the only
solution I could come up with.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In order to make "make distclean" (and friends) work, we need to
*always* invoke the embedded configure script -- even if we know that
we're not going to use this component.
But in cases where we know we're not going to use this component, we
also need to avoid the side effects of the code path that is used when
we *do* want to use this component. So split the two possibilities
into two different macros:
1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing
except invoke the underlying "configure" script.
2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real
work (including invoking the underlying "configure" script).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
We know that event:external will be configured first (because of its
priority). Take advantage of that here in libevent2022 by having it
refuse to configure / politely fail if event:external succeeded.
Also print out some additional lines in configure output indicating
what is going on (i.e., event:external succeeded, so this component
will be skipped, or event:external failed, so this component will be
used).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
We know that hwloc:external will be configured first (because of its
priority). Take advantage of that here in hwloc201 by having it
refuse to configure / politely fail if hwloc:external succeeded.
Also print out some additional lines in configure output indicating
what is going on (i.e., hwloc:external succeeded, so this component
will be skipped, or hwloc:external failed, so this component will be
used).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
libevent does not support multiple threads calling the event loop on
the same event base. This causes external libevent's to print out
re-entrant warning messages. This commit fixes the issue by protecting
the call to the event loop with an atomic swap check.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.
Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.
Thanks to @tonyreina for the error report
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Due to decreasing support by vendors/other orgs for the OpenIB BTL,
only look for iWarp/RoCE devices by default. Allow IB HCAs
with ports configured for ethernet.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
OFI BTL uses context for completion but never ask for it in
fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT
to eliminate any potential error.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes two bugs in the RMA/atomic emulation code:
1) Fix a fragment leak when using AMO emulation.
2) Always initialize the single-copy emulation code. This is required
to use the AMO support.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the atomic fifo code to fix a consistency issue
observed on Power9 systems when builtin atomics are used. The cause
was two things: 1) a missing write memory barrier in fifo push, and 2)
a read ordering issue when reading the fifo head non-atomically. This
commit fixes both issues and appears to correct then inconsistency.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If OMPI is configured with `--with-hwloc=external` or `--with-hwloc=DIR`
and gfortran is used, I see a lot of warnings when compiling files
under the `ompi/mpi/fortran` directory.
```
f951: Warning: Nonexistent include directory
'BUILD_DIR/opal/mca/hwloc/external/hwloc/include' [-Wmissing-include-dirs]
```
There is no such `include` directory in the source tree and `configure`-
created tree. I think these lines in the `configure.m4` file are wrongly
copied from that for the embedded `hwlocXXX` component in the past.
The `-Wmissing-include-dirs` option is enabled in gfortran by default
but it is not enabled by default (or even with `-Wall`) in gcc and g++.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
- added common logging infrastructure for all
UCX modules
- all UCX modules are switched to new infra
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
The descriptor flags field in a fragment were being ready after the
fragment may have been freed. This commit reads the flags before
calling the user callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for atomic operations as well as rdma for
systems without rdma support. This support is implemented using an
internal send tag.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
- some common functionality of del_procs calls is moved into
mca_common module
- blocking ucp_put call is replaced by non-blocking routine
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
The current cast is *functional*, but isn't really the way it should
be done. This commit makes the cast the way it should be done.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix two facepalms:
1. The "uint32" in the hash map functions refer to the *key* size, not
the *value* size. The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
to the value.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit changed the way btl/ofi call progress. Before, we force
progression with every rdma/atomic call. This gives performance boost in
some case and slow down on others. Now we only force progression after
some number of rdma calls which result in better performance overall.
Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set
the threshold number. The new default is 64.
Also:
Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion
is generated after the message has been received on the remote side.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit improves the injection rate and latency for RDMA
operations. This is done by the following improvements:
- If C11's _Thread_local keyword is available then always use the
same virtual device index for the same thread when using RDMA. If
the keyword is not available then attempt to use any device that
isn't already in use. The binding support is enabled by default but
can be disabled via the btl_ugni_bind_devices MCA variable.
- When posting FMA and RDMA operations always attempt to reap
completions after posting the operation. This allows us to
better balance the work of reaping completions across all
application threads.
- Limit the total number of outstanding BTE transactions. This
fixes a performance bug when using many threads.
- Split out RDMA and local SMSG completion queue sizes. The RDMA
queue size is better tuned for performance with RMA-MT.
- Split out put and get FMA limits. The old btl_ugni_fma_limit MCA
variable is deprecated. The new variable names are:
btl_ugni_fma_put_limit and btl_ugni_fma_get_limit.
- Change how post descriptors are handled. They are no longer
allocated seperately from the RDMA endpoints.
- Some cleanup to move error code out of the critical path.
- Disable the FMA sharing flag on the CDM when we detect that there
should be enough FMA descriptors for the number of virtual devices
we plan will create. If the user sets this flag we will not unset
it. This change should improve the small-message RMA performance by
~ 10%.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new btl for one-sided and two-sided. This btl
uses the uct layer in OpenUCX. This btl makes use of multiple uct
contexts and per-thread device pinning to provide good performance
when using threads and osc/rdma. This btl has been tested extensively
with osc/rdma and passes all MTT tests on aries and IB hardware.
For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_uct_memory_domains ib/mlx5_0. The specific transports used
can be selected using --mca btl_uct_transports. The default is to use
any available transport.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.
Instead, use a hash map. That way, it starts small and can grow if it
needs to. It also makes no assumptions about the values of the kernel
interface indexes.
Fixes#5292.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different). For example:
mpirun --bind-to cpu-list:ordered ...
Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.
Also add some minor updates to the orterun.1in man page for
clarification.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
cuda buffer
the existing interface in opal_datatype_cuda do not allow to distinguish whether a
buffer is a managed or unmanaged cuda buffer. Add an interface that allows to
retrieve this information throug a convertor, since the information is actually available
in the mca_common_cuda_* routines.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch.
Signed-off-by: Noah Evans <noah.evans@gmail.com>
Fix CID 1435996: use the proper % type to render the size.
Also use opal_output(), not fprintf(). For debug builds, abort
without dumping core (dumping core is very unfriendly when running
thousands of automated tests) -- the stderr output is sufficient to
find the coding error. For non-debug builds, truncate the key and
emit a warning that it almost certainly will not work properly.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit add support for scalable endpoint to enhance multithreaded
application performance. The BTL will detect the support from ofi
provider and will fallback to normal usage of scalable endpoint is not
supported.
NEW MCA parameters:
- mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint.
- mca_btl_ofi_num_contexts_per_module: number of communication context
to create (should be the same as number of thread).
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
configure: add checks for `__thread` on top of current check for `_Thread_local` and define OPAL_HAVE_THREAD_LOCAL if the compiler support TLS.
Added `opal_thread_local` keyword to unify the definition.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>