Resolve a race condition between registering for a file to be removed upon termination and actual creation of that file by providing attributes that identify whether the path is a file or directory. This removes the need for PMIx to detect the difference.
Refs #4686
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.
References #4260
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
set the key of all mpool_tree_item objects, so they can be retrieved
in mpool_base_free and then returned back to the
mca_mpool_base_tree_item_free_list free list.
Refs. open-mpi/ompi#4567
Thanks Philip Blakely for the bug report.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru
the PMIx connection. Unfortunately, PMIx v1 does not support this
feature, but we can remove the stddiag support once PMIx v1 slides out
of the support window
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
Store the pointer to the object handle and not the pointer to the
pointer.
We should not assert(0) in the code !
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock.
However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct.
Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Search external libevent library in both DIR/lib64 and DIR/lib
when --with-libevent=DIR is specified but --with-libevent-libdir=DIR is not
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix an apparent typo in external libevent configury
Require external libevent for install of separate libpmix
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
gcc 7.[1,2] (at least) fails to correctly parse the OSX 10.13 sys/syslog.h
header. As a results we need to potect syslog support in OPAL, PMIX and
ORTE.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Turns out UCX PML calls opal_pmix.fence in its del procs
method without checking whether or not the fence method
for the pmix component was defined. Rather than patch
UCX PML, actually define a fence method for the cray pmix.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Give packagers a configure CLI option to set the value of the MCA
variable mca_base_component_show_load_errors.
The --disable form of this option is intended for Open MPI packagers
who tend to enable support for many different types of networks and
systems in their packages. For example, consider a packager who
includes support for both the FOO and BAR networks in their Open MPI
package, both of which require support libraries (libFOO.so and
libBAR.so). If an end user only has BAR hardware, they likely only
have libBAR.so available on their systems -- not libFOO.so. Disabling
load errors by default will prevent the user from seeing potentially
confusing warnings about the FOO components failing to load because
libFOO.so is not available on their systems.
Conversely, system administrators tend to build an Open MPI that is
targeted at their specific environment, and contains few (if any)
components that are not needed. In such cases, they might want their
users to be warned that the FOO network components failed to load
(e.g., if libFOO.so was mistakenly unavailable), because Open MPI may
otherwise silently failover to a slower network path for MPI traffic.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Debounce "unreachable" notifications for tools when they disconnect
Enable the -x cmd line option for prun
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)
Remove some build product. Tell PMIx that we don't need a new nspace generated when OMPI calls connect
Add missing Makefile
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Their is racing condition in TCP connection establishment
during simultaneous handshake. This PR handles the fix for
it.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Make sure hostnames are null terminated, even when they were
too long to fit in the hostname buffer.
Fixes: CID 1418232
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Be a little more deliberate about convering OMPI's --with-cuda CLI
value to hwloc's --enable-cuda configure option.
Also, unconditionally disable hwloc NVML support (because Open MPI is
not currently using it).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
purpose. Continue to link the new library back to libopen-pal to resolve the renamed symbols.
Update opal configure logic to set disable_dlopen when disable_mca_dso is given. Fix typos in disable_dlopen when setting variables (incorrect inclusion of quotes)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Amazon is going to use the reachable framework to fix some connection
bugs in the TCP BTL, so claim support ownership of the weighted and
netlink components.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Wire up the libnl utilities Jeff and Ralph added previously to
the netlink reachable component so that it actually does work.
The algorithm is a bit simplistic, but should work for our use
cases. If there's a route, assume the two interfaces can talk.
If there's no gateway, assume the two interfaces are in the
same subnet, and give preference to that connection. If there's
a gateway, assume there's a route, but the interfaces are not
in the same subnet.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The netlink component's libnl wrapper code returned the
next hop in the route table to allow the calling code
to differentiate between same and different networks,
which is a fine comparison for IPv4, but is pretty
expensive for IPv6 (coming soon to a netlink component
near you). Rather than provide extra information
(the address of the next hop), just provide whether
there is a gateway or not, which is all the netlink
component actually needs.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The netlink reachable component has never been released in a usable
form, but had code copied from usNIC to support both libnl-1 and
libnl-3. If nothing else, this code was a little buggy in
handling the case where libnl-3 but not libnl-route-3 were
installed. Jeff and I decided to drop libnl-1 support from the
netlink reachable component, given that it's getting pretty old
and the weighted component provides the same information that
the TCP BTL and OOB are using today, so libnl-1 customers won't
see a step backwards from where they are today.
Signed-off-by: Brian Barrett <bbarrett@mazon.com>
Based on work from usNIC, the best way to use the reachability
information the reachable components return is to build a
connectivity graph between the two peers and run a bipartite
graph solver. Rather than returning the "best" pairing,
the reachability framework now returns the entire mapping,
allowing a (soon to be added) graph solver to build the
"optimal" connectivity pairing.
Practically, this means changing the return type of the
reachable() function and rewriting the weighted_reachable()
function to return the full mapping. The netlink_reachable()
function still always returns NULL.
At the same time, fix bit-rot in the weighted component and
enable builds of the component by removing the opal_ignore.
Also, add IPv6 support to the weighted component to support
both use cases in the TCP BTL.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Ralph and Jeff created the reachable framework and added the
netlink component based on code copied from the usnic btl.
However, they never renamed all the symbols from the libnl
compatibility code. This patch finishes the rename.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This fixes a hang of immediate PMIx request. PMIx v1.2 does not support
the info key `PMIX_IMMEDIATE` that leads to hanging. For that request
the fix uses the key `PMIX_OPTIONAL` for not go to the server.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Still in the "needs to be done" category:
* mapping/ranking/binding options aren't correctly supported
* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Cisco wrote a bipartite graph solver to properly solve
interface pair selection for usNIC. Using the reachable
framework, the TCP BTL (and possibly the runtime network
code) can use the graph solver to make more optimal pair
selection. Jeff was happy to have the code more broadly
used, but didn't have time to do the move, hence this
commit.
There are a couple of minor changes to the code compared
to the usNIC version. Obviously, the functions have
been renamed to match naming convention for their new
home. Since it's easier to write unit tests for
util/ code, the unit tests have been made first class
tests run at "make check" time. This last bit required
moving some of the definitions into a new header,
bipartite_graph_internal.h, so that they could be
included in both the library code and the test code.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit adds the code necessary to support forming connections across
subnets. The primary changes are to 1) add the gid to the modex, and 2)
use the gid to create the address handle.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.
This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Adopting can fail if the server-side hole isn't available on the client.
We can fallback to other ways to load the topology.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
This commit has two changes
1. Adding magic string during handshake can cause
issue when used with older version of MPI. Hence set
RCVTIMEO paramter to 2 second
2. Using single call during handshake instead of
two calls
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards tcp debugging
we are moving few BTL_ERROR to show_help and also
update the function behaviour of
mca_btl_tcp_endpoint_complete_connect to return
SUCCESS and ERROR cases.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards handling failure case
in btl tcp we are using magic string to verify mpi
connection. In case if there is mismatch or missing
magic string we can identify that we are trying to
connect with someother process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Moving non-blocking send/receive function to btl_tcp
will help reusing these function where ever needed.
In this case we plan to reuse receive function to
retrive magic string to validate established connection
is from mpi process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Not sure how/when this got deleted, but put back the "Cisco usNIC"
line in the transport summary at the end of configure.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Turns out that supplying NULL to group_register in the
mca_base_var_group_component_register is not a good
idea if one wants for ompi_info to work as intended.
The ugni and vader btl's both call this before
registering component variables. This borks up
the ompi_info works since NULL is supplied as the project
name. So, now supply the project name rather than
just NULL to group register.
Fixes#4020.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
usnic endpoints was always created with default send credit value of 8. This
commit assign the correct number from the hardware instead.
Signed-off-by: Thananon Patinyasakdikul <apatinya@cisco.com>
It is not possible to use the patcher based memory hooks without
hooking madvise (MADV_DONTNEED). This commit updates the patcher
memory hooks to always hook madvise. This should be safe with recent
rcache updates.
References #3685. Close when merged into v2.0.x, v2.x, and v3.0.x.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The current VMA cache implementation backing rcache/grdma can run into
a deadlock situation in multi-threaded code when madvise is hooked and
the c library uses locks. In this case we may run into the following
situation:
Thread 1:
...
free () <- Holding libc lock
madvice_hook ()
vma_iteration () <- Blocked waiting for vma lock
Thread 2:
...
vma_insert () <- Holding vma lock
vma_item_new ()
malloc () <- Blocked waiting for libc lock
To fix this problem we chose to remove the madvise () hook but that
fix is causing issue #3685. This commit aims to greatly reduce the
chance that the deadlock will be hit by putting vma items into a free
list. This moves the allocation outside the vma lock. In general there
are a relatively small number of vma items so the default is to
allocate 2048 vma items. This default is configurable but it is likely
the number is too large not too small.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>