Amazon is going to use the reachable framework to fix some connection
bugs in the TCP BTL, so claim support ownership of the weighted and
netlink components.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Wire up the libnl utilities Jeff and Ralph added previously to
the netlink reachable component so that it actually does work.
The algorithm is a bit simplistic, but should work for our use
cases. If there's a route, assume the two interfaces can talk.
If there's no gateway, assume the two interfaces are in the
same subnet, and give preference to that connection. If there's
a gateway, assume there's a route, but the interfaces are not
in the same subnet.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The netlink component's libnl wrapper code returned the
next hop in the route table to allow the calling code
to differentiate between same and different networks,
which is a fine comparison for IPv4, but is pretty
expensive for IPv6 (coming soon to a netlink component
near you). Rather than provide extra information
(the address of the next hop), just provide whether
there is a gateway or not, which is all the netlink
component actually needs.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The netlink reachable component has never been released in a usable
form, but had code copied from usNIC to support both libnl-1 and
libnl-3. If nothing else, this code was a little buggy in
handling the case where libnl-3 but not libnl-route-3 were
installed. Jeff and I decided to drop libnl-1 support from the
netlink reachable component, given that it's getting pretty old
and the weighted component provides the same information that
the TCP BTL and OOB are using today, so libnl-1 customers won't
see a step backwards from where they are today.
Signed-off-by: Brian Barrett <bbarrett@mazon.com>
Based on work from usNIC, the best way to use the reachability
information the reachable components return is to build a
connectivity graph between the two peers and run a bipartite
graph solver. Rather than returning the "best" pairing,
the reachability framework now returns the entire mapping,
allowing a (soon to be added) graph solver to build the
"optimal" connectivity pairing.
Practically, this means changing the return type of the
reachable() function and rewriting the weighted_reachable()
function to return the full mapping. The netlink_reachable()
function still always returns NULL.
At the same time, fix bit-rot in the weighted component and
enable builds of the component by removing the opal_ignore.
Also, add IPv6 support to the weighted component to support
both use cases in the TCP BTL.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Initialize the reachable framework during opal_init() and tear
it back down during opal_finalize(). The framework was never
used, so the lack of initialization didn't matter, but this is
a required step in using the framework.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Ralph and Jeff created the reachable framework and added the
netlink component based on code copied from the usnic btl.
However, they never renamed all the symbols from the libnl
compatibility code. This patch finishes the rename.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Add a check for link-local IPv6 addresses to the net
interface to support better computation of network
pairings in the weighted reachable component.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This fixes a hang of immediate PMIx request. PMIx v1.2 does not support
the info key `PMIX_IMMEDIATE` that leads to hanging. For that request
the fix uses the key `PMIX_OPTIONAL` for not go to the server.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Fix an allocation bug that could occur on non-LP64 platforms.
match_edges_out is an array of integers representing the
edges of the graph (where vertices are ints), with two ints
for every edge. The previous code allocated enough space
for num_dges * sizeof(int*), which happens to be the same
as num_edges * 2 * sizeof(int) on LP64 platforms, but would
be wrong on all other platforms.
Fixes: CID 1417754
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Still in the "needs to be done" category:
* mapping/ranking/binding options aren't correctly supported
* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Cisco wrote a bipartite graph solver to properly solve
interface pair selection for usNIC. Using the reachable
framework, the TCP BTL (and possibly the runtime network
code) can use the graph solver to make more optimal pair
selection. Jeff was happy to have the code more broadly
used, but didn't have time to do the move, hence this
commit.
There are a couple of minor changes to the code compared
to the usNIC version. Obviously, the functions have
been renamed to match naming convention for their new
home. Since it's easier to write unit tests for
util/ code, the unit tests have been made first class
tests run at "make check" time. This last bit required
moving some of the definitions into a new header,
bipartite_graph_internal.h, so that they could be
included in both the library code and the test code.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit adds the code necessary to support forming connections across
subnets. The primary changes are to 1) add the gid to the modex, and 2)
use the gid to create the address handle.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.
This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The recent changes to remove non-inline atomics have caused
a cascade of issues with cmpset_64 on IA32. cmpxchg8 requires
the use of a bunch of registers (2 for every operand, 3 operands),
and one of them is ebx, which is used by the compiler to do
shared library things. Some compilers don't deal well with
ebx being clobbered (I'm looking at you, gcc 4.1). Rather than
continue trying to fight, remove cmpset_64 from the supported
atomic operations on IA32. Other 32 bit platforms (MIPS32,
SPARC32, ARM, etc.) already don't support a 64 bit compare-and-
swap, so while this might slightly reduce performance, it will
at least be correct.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.
This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This reverts commit b5ea5e0994a827915107e03d6744e73156534a04
This commit reverts a change that is hopefully not necessary. If this
is the case this will fix#4146.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Adopting can fail if the server-side hole isn't available on the client.
We can fallback to other ways to load the topology.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
This commit has two changes
1. Adding magic string during handshake can cause
issue when used with older version of MPI. Hence set
RCVTIMEO paramter to 2 second
2. Using single call during handshake instead of
two calls
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>