Hwloc upstream has fixed a problem with embedded "make distcheck" that
was breaking that surfaced when you ran autogen in an Open MPI
tarball.
This submodule update takes in the upstream hwloc fixes for this
issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
If someone gives us a namespace that doesn't easily translate to an
integer, we have to create a mechanism for working around the
disconnect. PRRTE has been updated to give us a flag so we know we were
"natively" launched. If we don't see it, then fall back to generating a
hash of the nspace as our jobid. We then have to translate back/forth
between nspace and jobid using a lookup table.
Probably not the right long-term solution, but hopefully helps get us
thru for a bit.
Includes update of PRRTE pointer
Signed-off-by: Ralph Castain <rhc@pmix.org>
- remove stale s390 and MIPS atomics
- ensure envars from spawn are propagated
- fix make tarball
- ensure cleanup of default hostfile
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit fixes an issue in the MCA base variable system. The
code was retrieving the user home directory (from HOME) and
attempting to use it to build a search path for config files.
In this case user-level configuration directories have been
enabled so the appropriate thing to do is to print an error
message and return. This commit makes that change. It does not
ensure that HOME is set correctly.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
If you autogen.pl --without-prrte, we wouldn't configure or build PRRTE
support. However, configuring with --disable-internal-rte wasn't working
as it was being ignored. This led to some false errors when compiling
with an earlier PMIx v2.2 release.
That said, there were a couple of places that needed protection against
PMIx v2.2.
Signed-off-by: Ralph Castain <rhc@pmix.org>
Fix a couple of spots in OMPI to resolve warnings. The one in comm_cid
in particular may be responsible for some/all of the comm_spawn issues
as it was passing an incorrect pointer to a macro, thus causing memory
corruption.
Update PRRTE and PMIx to deal with v3/v4 differences.
Signed-off-by: Ralph Castain <rhc@pmix.org>
When a job fails very quickly, it is possible that the spawning tool
won't get notified that the spawn completed prior to be told that the
job terminated. This can cause the tool to "hang" in PMIx_Spawn. Ensure
that PRRTE handles this case by guaranteeing we notify the spawner.
Track both PRRTE and PMIx masters as both have changed, though only the
PRRTE one is involved in this particular fix.
Signed-off-by: Ralph Castain <rhc@pmix.org>
- Ensure we accurately handle node name aliases
- Apply the local launch environ to apps prior to spawn
- Add error check if PMIx_Spawn fails
- Fix compiler warning
- Fix PGI vendor check
- Prevent mpirun from hanging if prte segfaults
- Fix absolute/relative path names to ensure "prte" and
"prted" are taken from same distribution as "mpirun"
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit fixes a potential deadlock that can occur between the
memory hooks and region registation. This deadlock occurs because
of a hold and wait error between two mutexes. The first mutex is
the VMA lock used to protect internal rcache/grdma structures and
the reader/writer lock in the interval tree.
In the case of the memory hooks a reader lock is obtained on the
interval tree then the VMA lock is obtained to remove the
registration from the LRU. In the case of LRU evictions the VMA
lock is obtained then the writer lock on the interval tree is
obtained. This leads to the deadlock.
To fix the issue the code that evicts from the LRU has been
updated to only invalidate the registration while the VMA lock
is held then remove the registration from the VMA after the
lock is released. This should completely eliminate the above
deadlock.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.
remove orte: misc fixes
- UCX fixes
- VPATH issue
- oshmem fixes
- remove useless definition
- Add PRRTE submodule
- Get autogen.pl to traverse PRRTE submodule
- Remove stale orcm reference
- Configure embedded PRRTE
- Correctly pass the prefix to PRRTE
- Correctly set the OMPI_WANT_PRRTE am_conditional
- Move prrte configuration to the end of OMPI's configure.ac
- Make mpirun a symlink to prun, when available
- Fix makedist with --no-orte/--no-prrte option
- Add a `--no-prrte` option which is the same as the legacy
`--no-orte` option.
- Remove embedded PMIx tarball. Replace it with new submodule
pointing to OpenPMIx master repo's master branch
- Some cleanup in PRRTE integration and add config summary entry
- Correctly set the hostname
- Fix locality
- Fix singleton operations
- Fix support for "tune" and "am" options
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.
This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.
The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).
With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.
Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.
In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.
Signed-off-by: William Zhang <wilzhang@amazon.com>
Advance to hwloc-2.1.0rc2-33-g38433c0f, which includes a .gitignore
update that we want here in Open MPI.
Be warned; this is actually 33 commits beyond the hwloc v2.1.0 tag.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This PR removes the constant defining the max attachment address and
replaces it with the largest address that shows up in /proc/self/maps.
This should address issues found on AARCH64 where the max address
may differ based on the configuration.
Since the calculated max address may differ between processes the
max address is sent as part of the modex and stored in the endpoint
data.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
The opal_gethostname() function provides a more robust mechanism
to retrieve the hostname than gethostname(), which can return
results that are not null-terminated, and which can vary in its
behavior from system to system.
opal_gethostname() just returns the value in opal_process_info.nodename;
this is populated in opal_init_gethostname() inside opal_init.c.
-Changed all gethostname calls in opal subtree to opal_gethostname
-Changed all gethostname calls in orte subtree to opal_gethostname
-Changed all gethostname calls in ompi subdir to opal_gethostname
-Changed all gethostname calls in oshmem subdir to opal_gethostname
-Changed opal_if.c in test subdir to use opal_gethostname
-Changed opal_init.c to include opal_init_gethostname. This function
returns an int and directly sets opal_process_info.nodename per
jsquyres' modifications.
Relates to open-mpi#6801
Signed-off-by: Charles Shereda <cpshereda@lanl.gov>
Make sure to get an RDM provider that can provide both local and
remote communication. We need this check because some providers could
be selected via RXD or RXM, but can't provide local communication, for
example.
Add OPAL_CHECK_OFI_VERSION_GE() m4 macro to check that the Libfabric
we're building against is >= a target version. Use this check in two
places:
1. MTL/OFI: Make sure it is >= v1.5, because the FI_LOCAL_COMM /
FI_REMOTE_COMM constants were introduced in Libfabric API v1.5.
2. BTL/usnic: It already had similar configury to check for Libfabric
>= v1.1, but the usnic component was checking for >= v1.3. So
update the btl/usnic configury to use the new macro and check for
>= v1.3.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes an issue discovered in the XPMEM registration cache. It
was possible for a registration to be invalidated by multiple threads
leading to a double-free situation or re-use of an invalidated registration.
This commit fixes the issue by setting the INVALID flag on a registation
when it will be deleted. The flag is set while iterating over the tree
to take advantage of the fact that a registration can not be removed
from the VMA tree by a thread while another thread is traversing the VMA
tree.
References #6524
References #7030Closes#6534
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Rename the component to be "hwloc2" (since it can now be any v2.x.y
version of hwloc), and make the embedded copy of hwloc be a git
submodule.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Clarify in README what --with-hwloc does in its different use cases.
Also, ensure that the behavior when specifying `--with-hwloc` is the
same as if that option is not specified at all. This is what we did
in Open MPI <= v3.x; looks like we inadvertantly caused `--with-hwloc`
to be synonymous with `--with-hwloc=external` in v4.0.0.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>