Per open-mpi/ompi#381, convert the specific intialization of opal_pmix
to use the generic "= { 0 }" initializer. This form can be used to
initialize any type when the intent is just to zero out / assign
*some* value.
Embedding libfabric is a temporary measure; I'm removing some warning
notifications so that the output isn't so cluttered (we're getting
the real warnings fixed upstream, but the OMPI community doesn't
really care/need to see the warnings in the meantime).
This commit adds an MCA variable to select Portals4 logical
addressing, populates the logical-to-physical mapping table and
initializes the NI in this mode.
For static builds, we need to also set
<framework>_<component>_WRAPPER_EXTRA_LIBS so that the wrappers know
what other libraries to add to link executables.
Ensure to count *this* process when checking for how many VFs we need
on the local server.
(cherry picked from commit 386c01934e98cb8dcb48ff648ecdfb0c8677baa9)
There was a mismatch between the structure for mca_rcache_vma_t and
the OBJ_CLASS_INSTANCE. One was opal_list_item_t and the other was
ompi_free_list_item_t. The super class in the structure looks like it
is the correct one. Changed the superclass in OBJ_CLASS_INSTANCE to
match.
If there are not enough resources (e.g., low VFs), we can end up
calling finalize_one_channel() on the same channel multiple times. So
ensure to NULL out fields that we have freed already so that we do not
try to free them a second time.
Fixes CSCus26648.
Fix the ordering so that we obtain the usnic netmask information
*before* we do the filtering based on CIDR-specified networks.
Also requires upstream Github libfabric commit 3976745.
Fixes CSCus22495.
We had several problems in the old code:
1. We were specifying an arbitrary timeout (100 ms) and then abandoning
all remaining pending AV insert operations. We would then free the
endpoint buffer that we gave to fi_av_insert(), usually causing
libfabric's progress thread to write to a freed buffer.
2. We were claiming in a show_help message that the timeout was
controllable via an MCA parameter. This commit removes that
parameter, since there's no good method for us to specify a timeout
like this to libfabric right now.
3. We also weren't waiting for the correct number of fi_av_insert()
operations to complete. We were waiting for nprocs, which is
accidentally fine for 2 procs on separate hosts, but not for most
other proc counts.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
There was a bug in the openib btl handling this valid sequence of
calls:
desc = btl_alloc ();
btl_free (desc);
When triggered the bug would cause either fragment loss or undefined
behavior (SEGV, etc). The problem occured because btl_alloc contained
the logic to modify the pending fragment (length, etc) and these
changes were not corrected if the fragment was freed instead of sent.
To fix this issue I 1) moved some of the coalescing logic to the
btl_send function, and 2) retry the coalesced fragment on btl_free
if it was never sent. This appears to completely address the issue.
For systems with OFED's lacking XRC support, commit b3617e73
broke the build of the openib btl. This commit addresses
the issues introduced by this commit.
This was more complicated than I would like, but it's just an
unfortunate GCC/clang difference. I don't have access to all the C
compilers out there, so this may still have problems with other
compilers that implement some form of `#pragma GCC diagnostic` support
but don't actually behave the same as some versions of GCC.
fixes#323
Use the pkg-config related m4 functions to find out where
Cray's xpmem.h and libxpmem are located on a system.
With this commit, there is no longer any need to have to
explicitly indicate an xpmem install location on the configure
line, at least for Cray systems running CLE 4.X and 5.X.
Replace temporary environment variables with a MCA
parameter for the ugni btl. A user wishing to
use the ugni btl async. progress thread needs to
set the request_progress_thread param to true.
For example, using env. variable format:
export OMPI_MCA_btl_ugni_request_progress_thread=1
Verified via testing with unit tests, etc. that
in fact BTE TX descriptors using CQs configured to
generate IRQs were in fact working correctly on Cray XC. Disable
send message back to self and just use IRQs generated
by completion of TX descriptors posted to BTE.
Honor enable_mpi_threads setting to enable the ugni btl
async progress thread. If the app doesn't request thread-multiple
the thread will not be created.
- by default allow to register maximum possible (i.e 2 * total_memory)
memory. This beheviour can be turned off using mca parameter
"btl_openib_allow_max_memory_registration"
- In fallback case, use device specific parameters to calulate
memory limit.
Properly test for some dependent libraries; don't just assume
elsewhere in Open MPI's configury will find those libraries. Also
consolidate some CPPFLAGS and clarify some comments.
As pointed out by @ggouaillardet, we were adding some unnecessary -I
flags to CPPFLAFGS when --without-hwloc was being used. This commit
slightly updates the hwloc191 component configury to only add such
things when the component is, in fact, going to be
compiled/installed.
Ensure that the <provider>_happy shell variables are initialized to
0. Without this, the --without-libfabric case would leave them
initialized, resulting in "test: -eq operator expecting a value" kinds
of errors.
while cleaning up after receiving a zero byte on the connect socket
(localyy started connection), while another was trying to accept a
new connection from the same peer. Create a zero-timed event and
delocalize the accept into a timer_event.
Add support for registering an error callback, that can be used when a
connection is discovered as failed during the initialization process.
This is a minor update to
open-mpi/ompi@c52601f0c5.
If we have vsnprintf(), we might as well not have the rest of the
guess_strlen() routine. Also document the nifty trick/behavior of
vsnprintf() that enables this shortcut (it was new to me!).
was quite subtle, and only happened on the process with the smallest
guid (as this process will tear down the connection created locally and
replace it with the result of accept). If multiple threads are active in
the system, the deadlock occurs during the recv event deletion as one
thread will hold the recv event lock of the endpoint and try to access
the TCP event base lock, while the other thread will hold the TCP event
base lock while trying to access the recv event lock (in case data is
available on the socket).
The proposed solution let the event callback fail to process the data,
preventing the deadlock and allowing the other thread to always complete
it's job. As the event is not execute the same triggered will trigger
again at the next opportunity, so this solution introduce a minimal
delay in the connection establishement.
On x86_64 reading a 128-bit value requires multiple instructions.
Under some conditions if the counted pointer counter is read before
the item pointer the fifo can be left in an inconsistent state. This
commit forces the read of the counter to always be read first.
The fifo does not appear to suffer from the same race.
It is possible the compiler can reorder the read of the head item and
the head itself. This could lead to a situation where the item
returned was not really the head item.
Thanks to Nathan for pointing out that I missed snipping one line in
2f9c69f016 (I removed the trailing
comment, but not the trailing pragma -- oops!).
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.
Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
1. Ensure to override CFLAGS properly. Move the setting of CFLAGS outside the AM_CONDITIONAL so that Automake doesn't get confused (because CFLAGS is already set inside an AM_CONDITIONAL -- moving it outside the conditional ensure that this local CFLAGS override trumps all other CFLAGS overrides).
2. Only build libfabric on Linux. Add a little more configury to ensure that we only try to build libfabric on Linux.
3. Remove a dead/unused file
4. Fix typo in condition check
5. Use "false", not "/bin/false"
This commit represents the conversion of the usnic BTL from verbs to
libfabric.
For the moment, libfabric is embedded in Open MPI (currently in the
usnic BTL). This is because the libfabric API is still changing, and
also has not yet been released. Ultimately, this embedded copy of
libfabric will likely disappear and the usnic BTL will rely on an
external installation of libfabric.
New configure options:
* --with-libfabric: will cause configure to fail if libfabric support
cannot be built
* --without-libfabric: will prevent libfabric support from being built
* --with-libfabric=DIR: use an external libfabric installation
* --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR,
use LIBDIR for the libfabric installation library dir
The --with-libnl3[-libdir] arguments are now gone.
This commit adds a new class: opal_fifo.h. The new class has atomic, non-atomic,
and opal_using_threads() conditoned routines. It should be used when first-in
first-out is required and should perform much better than using locks and an
opal_list_t. Like with opal_lifo_t there are two versions of the atomic
implementation: 128-bit compare-and-swap, and spin-locked. More implementations
can be added later (LL/SC comes to mind).
This commit also adds a unit test for the opal_fifo_t class. This test verifies
the fifo implementation when using multiple threads.
- Rename opal_atomic_lifo_t to opal_lifo_t to reflect both atomic and
non-atomic usage. Added new routines (opal_lifo_*_st) for non-atomic
usage as well as routines conditioned off opal_using_threads(). The
atomic versions are always thread safe and the non-atomic are always
not thread safe.
- Add a new atomic lifo implementation that makes use of 128-bit
compare-and-swap. The new implementation should scale better with
larger numbers of threads.
- Add threading unit test for opal_lifo_t.
There currently is no standard support for 128-bit integer types. Any use
of the __int128 and int128_t types can lead to warnings from the compiler
when using -Wpedantic. Additionally, some compilers may support __int128
and other may support int128_t. This commit addresses both issues by
defining opal_int128_t if there is a supported 128-bit type. In the
case of GCC a pragma has been added to suppress warnings about __int128
not being a standard C type.
A 128-bit compare-and-swap will enable a better atomic lifo implementation
that uses the pointer + counter method to avoid ABA issues. This commit
adds configury to check for the instruction (cmpxchg16b) and adds an
implementation that uses the __int128 type available in C99.