1
1
Граф коммитов

27738 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
3493c43468 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-22 10:48:00 -07:00
Ralph Castain
b4ad81da85 Silence warnings about verbose output
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 2c9655bb631742fd7693e00289d1949f4b2fc155)
2017-09-22 09:05:03 -07:00
Ralph Castain
9edea02b46 Merge pull request #4246 from rhc54/topic/spawn
Fully support OMPI spawn options.
2017-09-21 11:23:34 -07:00
Jeff Squyres
9708e9dd21 Merge pull request #4245 from jsquyres/pr/disable-hwloc-cuda
hwloc: do not build hwloc CUDA support if --without-cuda used (and also always disable hwloc GL and OpenCL support)
2017-09-21 13:43:01 -04:00
Ralph Castain
fe9b584c05 Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)
2017-09-21 10:29:27 -07:00
Brice Goglin
84a721d17a hwloc: disable GL and OpenCL in the hwloc component
Open MPI doesn't use GL or OpenCL OS devices, so just disable them in
hwloc.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-09-21 08:25:46 -07:00
Jeff Squyres
f5d51dc2f5 hwloc: do not build hwloc CUDA support if --without-cuda used
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-09-21 08:24:54 -07:00
Gilles Gouaillardet
d704712bad Merge pull request #4242 from ggouaillardet/topic/libnl3
configury: do not use libnl-3 when it is half broken
2017-09-21 16:16:00 +09:00
Gilles Gouaillardet
94747a1d28 configury: do not use libnl-3 when it is half broken
Refs. open-mpi/ompi#4211

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-21 15:27:59 +09:00
Jeff Squyres
a182b4fbaa Merge pull request #3883 from jsquyres/pr/readme-pathscale-update
README: Pathscale updates
2017-09-20 10:07:48 -04:00
Gilles Gouaillardet
da2966ace1 Merge pull request #2191 from ggouaillardet/topic/remove_disable_mpi_io
configury: remove the --disable-mpi-io option
2017-09-20 15:55:48 +09:00
Gilles Gouaillardet
b9315edb85 configury: remove the --disable-mpi-io option
Fixes open-mpi/ompi#2185

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-20 14:39:09 +09:00
bosilca
ab68aced23 Merge pull request #3738 from bosilca/topic/tcp_event_count
Fix the TCP performance impact when BTL not used
2017-09-19 23:08:58 -04:00
Brian Barrett
2c59fb7a58 Merge pull request #4221 from AntoineD/master
Fix: Outdated README link #4220
2017-09-19 19:48:13 -07:00
Gabe Saba
c6235a9a0f reachable: add tests
Add test suite for netlink and weighted reachable components.  We
don't have a great way of running components through unit tests
today, so make them stand-alone tests that are run with mpirun
and such.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Brian Barrett
ae122c4b17 reachable: Change ownership to Amazon
Amazon is going to use the reachable framework to fix some connection
bugs in the TCP BTL, so claim support  ownership of the weighted and
netlink components.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Gabe Saba
9e53605a6f reachable: Implement netlink component
Wire up the libnl utilities Jeff and Ralph added previously to
the netlink reachable component so that it actually does work.
The algorithm is a bit simplistic, but should work for our use
cases.  If there's a route, assume the two interfaces can talk.
If there's no gateway, assume the two interfaces are in the
same subnet, and give preference to that connection.  If there's
a gateway, assume there's a route, but the interfaces are not
in the same subnet.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Gabe Saba
4d81006222 reachable: Add IPv6 support to libnl code
Add IPv6 support to the netlink component's utility
wrappers around libnl-3.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Brian Barrett
4d5bfd0429 reachable: Simplify gateway check in netlink
The netlink component's libnl wrapper code returned the
next hop in the route table to allow the calling code
to differentiate between same and different networks,
which is a fine comparison for IPv4, but is pretty
expensive for IPv6 (coming soon to a netlink component
near you).  Rather than provide extra information
(the address of the next hop), just provide whether
there is a gateway or not, which is all the netlink
component actually needs.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Brian Barrett
a543e7f130 reachable: remove libnl-1 support from netlink
The netlink reachable component has never been released in a usable
form, but had code copied from usNIC to support both libnl-1 and
libnl-3.  If nothing else, this code was a little buggy in
handling the case where libnl-3 but not libnl-route-3 were
installed.  Jeff and I decided to drop libnl-1 support from the
netlink reachable component, given that it's getting pretty old
and the weighted component provides the same information that
the TCP BTL and OOB are using today, so libnl-1 customers won't
see a step backwards from where they are today.

Signed-off-by: Brian Barrett <bbarrett@mazon.com>
2017-09-19 19:42:54 -07:00
Gabe Saba
3f8d294191 reachable: Enable weighted component / fix interface
Based on work from usNIC, the best way to use the reachability
information the reachable components return is to build a
connectivity graph between the two peers and run a bipartite
graph solver.  Rather than returning the "best" pairing,
the reachability framework now returns the entire mapping,
allowing a (soon to be added) graph solver to build the
"optimal" connectivity pairing.

Practically, this means changing the return type of the
reachable() function and rewriting the weighted_reachable()
function to return the full mapping.  The netlink_reachable()
function still always returns NULL.

At the same time, fix bit-rot in the weighted component and
enable builds of the component by removing the opal_ignore.
Also, add IPv6 support to the weighted component to support
both use cases in the TCP BTL.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Gabe Saba
8f2df42055 reachable: Initialize / Finalize reachable framework
Initialize the reachable framework during opal_init() and tear
it back down during opal_finalize().  The framework was never
used, so the lack of initialization didn't matter, but this is
a required step in using the framework.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Brian Barrett
6048c543fa reachable: Rename code copied from usnic
Ralph and Jeff created the reachable framework and added the
netlink component based on code copied from the usnic btl.
However, they never renamed all the symbols from the libnl
compatibility code.  This patch finishes the rename.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Brian Barrett
502f383f4d util: Add link-local check to net interface
Add a check for link-local IPv6 addresses to the net
interface to support better computation of network
pairings in the weighted reachable component.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-19 19:42:54 -07:00
Ralph Castain
a09c090709 Merge pull request #4237 from rhc54/topic/cnct
Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc.
2017-09-19 14:27:43 -07:00
Ralph Castain
e575c4d6f9 Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 7c755e01004f8b86c71f1729662979ea45ab1adb)
2017-09-19 13:35:46 -07:00
Howard Pritchard
bfd5ed6e98 Merge pull request #1910 from hpcraink/pr/shmem_fix_f77
Fix shmem.fh: fails to compile with F77 fixed-form compiled programs...
2017-09-19 14:28:08 -06:00
Ralph Castain
16de607607 Merge pull request #4234 from rhc54/topic/upstream
Ensure we update the total_slots_alloc field on each job. Correct the client example
2017-09-19 09:03:04 -07:00
Jeff Squyres
daa48906f5 Merge pull request #4233 from jsquyres/pr/remove-extraneous-done-output-from-rmaps-base
rmaps/base: remove debugging "DONE" message
2017-09-19 11:34:52 -04:00
Ralph Castain
658c3d1d51 Ensure we update the total_slots_alloc field on each job. Correct the client example
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcedd12a8a24dd246f04ff13b4fd2f1bbac6ce5a)
2017-09-19 08:14:14 -07:00
Jeff Squyres
7cccee9d92 rmaps/base: remove debugging "DONE" message
Thanks for Ben Menadue for reporting and supplying the patch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-09-19 07:10:00 -07:00
Ralph Castain
3b3ce243bb Merge pull request #4214 from karasevb/pmix1_hang_fix
pmix: fixed immediate request for PMIx v1.2
2017-09-19 06:51:25 -07:00
Ralph Castain
48bbf707c3 Merge pull request #4232 from rhc54/topic/local
Implement support for "local" range when publishing data
2017-09-18 20:18:06 -07:00
Ralph Castain
5708872112 Implement support for "local" range when publishing data
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 2d54f7e0dd3a47260b0b2634aae3361316005933)
2017-09-18 19:34:08 -07:00
Jeff Squyres
2e5e7b8891 Merge pull request #4224 from bwbarrett/graph-coverity
util: Fix graph allocation size
2017-09-18 15:04:34 -04:00
Ralph Castain
08c93091f7 Merge pull request #4223 from rhc54/topic/stale
Remove stale tools
2017-09-18 09:43:06 -07:00
Josh Hursey
252be7ffb0 Merge pull request #4215 from jjhursey/fix/plm-lsf-rc
plm/lsf: Improve error message if lsb_launch fails
2017-09-18 11:14:25 -05:00
Josh Hursey
5cb5eb68f5 Merge pull request #4204 from jjhursey/fix/master/without-lsf
Fix --without-lsf and LSF in default search path
2017-09-18 11:04:02 -05:00
Ralph Castain
ed508010b4 Remove stale tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-18 07:30:47 -07:00
Antoine Dechaume
08e5ab4d9a Fix: Outdated README link #4220 2017-09-18 11:31:07 +02:00
Boris Karasev
2929f52ffc pmix1: fixed immediate request
This fixes a hang of immediate PMIx request. PMIx v1.2 does not support
the info key `PMIX_IMMEDIATE` that leads to hanging. For that request
the fix uses the key `PMIX_OPTIONAL` for not go to the server.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2017-09-18 09:17:44 +03:00
Brian Barrett
abbe2ffb9f util: Fix graph allocation size
Fix an allocation bug that could occur on non-LP64 platforms.
match_edges_out is an array of integers representing the
edges of the graph (where vertices are ints), with two ints
for every edge.  The previous code allocated enough space
for num_dges * sizeof(int*), which happens to be the same
as num_edges * 2 * sizeof(int) on LP64 platforms, but would
be wrong on all other platforms.

Fixes: CID 1417754

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-17 19:49:26 +00:00
Ralph Castain
79f82f2c6d Merge pull request #4217 from rhc54/topic/dvm
Complete the fix of the ORTE DVM.
2017-09-16 14:53:24 -07:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Brian Barrett
bffcc3bca0 util: move graph solver from usnic to util
Cisco wrote a bipartite graph solver to properly solve
interface pair selection for usNIC.  Using the reachable
framework, the TCP BTL (and possibly the runtime network
code) can use the graph solver to make more optimal pair
selection.  Jeff was happy to have the code more broadly
used, but didn't have time to do the move, hence this
commit.

There are a couple of minor changes to the code compared
to the usNIC version.  Obviously, the functions have
been renamed to match naming convention for their new
home.  Since it's easier to write unit tests for
util/ code, the unit tests have been made first class
tests run at "make check" time.  This last bit required
moving some of the definitions into a new header,
bipartite_graph_internal.h, so that they could be
included in both the library code and the test code.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-15 15:08:47 -07:00
Joshua Hursey
89c1aaf646 plm/lsf: Improve error message if lsb_launch fails
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-15 09:45:58 -05:00
Rainer Keller
d529c289db Fails to compile with F77 fixed-form compiled programs...
Convert to F77 notation and split into two (shorter) lines.
Also, make usage of the SHMEM_MAX_NAME_LEN definition, by moving
that first.

Signed-off-by: Rainer Keller <rainer.keller@hft-stuttgart.de>
2017-09-15 15:09:43 +02:00
Ralph Castain
f69466d633 Merge pull request #4213 from rhc54/topic/dvm2
Backport changes from PMIx reference server
2017-09-14 13:17:53 -07:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Nathan Hjelm
0851122cce btl/openib/udcm: add support for connection across subnets
This commit adds the code necessary to support forming connections across
subnets. The primary changes are to 1) add the gid to the modex, and 2)
use the gid to create the address handle.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2017-09-14 06:42:06 -10:00