1
1
Граф коммитов

30036 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
fae374ce54 OFI: patch OFI MTL for GNI provider
Uncovered a problem using the GNI provider with the OFI MTL.
See https://github.com/ofiwg/libfabric/issues/6194.

Related to #8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit d6ac41cbbd)
2020-08-28 14:30:31 -06:00
Brian Barrett
8b2bcbc9ca
Merge pull request #8021 from wckzhang/v4.1.x
coll/tuned: Revert RSB and RS default algorithms
2020-08-27 10:33:42 -07:00
Howard Pritchard
c4128b21a0 suppress icc long double message
improve configury to check whether icc is handling no long double.
This prevents seeing 100s of messages like this:

icc: command line warning #10148: option '-Wno-long-double' not supported

A similar patch will be needed for pmix.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit 6df0e53421)
2020-08-27 16:04:50 +00:00
William Zhang
dceea5ad87 coll/tuned: Revert RSB and RS default algorithms
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.

See #8010

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 57b95bcb45)
2020-08-25 15:48:45 -07:00
Jeff Squyres
c4eb2fe40a
Merge pull request #8014 from hppritcha/topic/fix_ofi_mtl_mrecv_v41x
ofi mtl: fix problem with mrecv
2020-08-19 13:50:25 -04:00
Howard Pritchard
833f8b2b41 ofi mtl: fix problem with mrecv
the ofi mtl mrecv was not properly setting the message in/out
arg to MPI_MRECV to MPI_MESSAGE_NULL.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit e6f81ed6d6)
2020-08-19 10:22:37 -06:00
Brian Barrett
f63d6e2335
Merge pull request #7998 from wckzhang/v4.1.x
V4.1.x btl/ofi backport disabling EFA and ofi_rxm providers
2020-08-17 12:12:07 -07:00
Jeff Squyres
2a0757f997
Merge pull request #8006 from bosilca/fix/smcuda_alignment
Fix the cacheline usage in the CUDA BTL.
2020-08-17 15:08:42 -04:00
George Bosilca
a81c16b9d0 Fix the cacheline usage in the CUDA BTL.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-08-14 14:36:33 -04:00
William Zhang
27c252e211 btl/ofi: Disable ofi_rxm provider
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 41acfee2bb)
2020-08-12 14:03:08 -07:00
William Zhang
357ac2e2f5 btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit a7dcfd9874)
2020-08-12 14:02:51 -07:00
Jeff Squyres
e7b5e2550b
Merge pull request #7997 from bwbarrett/backports/v4.1.x-7957
Use the unaligned SSE memory access primitive.
2020-08-12 10:12:44 -04:00
George Bosilca
c20626e60d Check unaligned ops for correctness.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit c4e88a43a3)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-08-12 02:32:10 +00:00
George Bosilca
eb9ced786a Use the unaligned SSE memory access primitive.
Alter the test to validate misaligned data.

Fixes #7954.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit b6d71aa893)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-08-12 02:32:01 +00:00
Brian Barrett
22163d37b5
Merge pull request #7991 from wckzhang/v4.1.x
v4.1.x: btl/ofi: Use common provider include/exclude list
2020-08-11 11:21:07 -07:00
Brian Barrett
ec04896d70
Merge pull request #7995 from mkurnosov/fix-parsing-locality-strng
v4.1.x: opal/hwloc: fix a typo in parsing locality string: L0 changed to L1
2020-08-11 11:20:08 -07:00
Mikhail Kurnosov
11620038f9 Fix a typo in parsing locality string: L0 changed to L1
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
(cherry picked from commit 4708458d6b)
2020-08-11 22:35:00 +07:00
William Zhang
abe3aaa4a7 btl/ofi: Use common provider include/exclude list
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 9b8f463a76)
2020-08-10 14:28:59 -07:00
Brian Barrett
704d019dc1
Merge pull request #7952 from wckzhang/v4.1.x
coll/tuned: Change the default collective algorithm selection
2020-08-03 12:05:39 -07:00
William Zhang
5a13c5352f coll/tuned: Change the default collective algorithm selection
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit ce40cfbaa5)
2020-07-28 15:48:20 -07:00
Brian Barrett
e08e15cf93
Merge pull request #7959 from cniethammer/configure-leak-fix-v4.1.x
v4.1: Fix memory leak in configure, which prevents leak sanitizer usage
2020-07-28 08:06:11 -07:00
Jeff Squyres
36bcc48dc1
Merge pull request #7902 from vspetrov/v4.1.x_hcoll_reduce_scatter
V4.1.x hcoll reduce scatter
2020-07-27 15:50:09 -04:00
Jeff Squyres
6b4e2f02f8
Merge pull request #7956 from tomhers/topic/fix_ofi_btl_include_file
v4.1.x: BTL/OFI: Fix missing include file.
2020-07-27 15:26:01 -04:00
Christoph Niethammer
e0bd64f843 Fix memory leak in configure, which prevents leak sanitizer usage
If building Open MPI with sanitizers, e.g
$ configure CC=clang CFLAGS=-fsanitize=address ....
configure test programs are also build with the sanitizers and will
report errors resulting in configure to fail.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-22 14:49:10 +02:00
Brian Barrett
18418bfdf6
Merge pull request #7943 from dancejic/multi-4.1.x
v4.1.x: common/ofi: added address format check to fix provider selection
2020-07-20 12:22:20 -07:00
Jeff Squyres
981b8858e7
Merge pull request #7951 from devreal/v4.1.x
(v4.1.x) osc/rdma: fail query_btls if no endpoint for non-local peer is found
2020-07-20 15:12:59 -04:00
Joseph Schuchart
3d08d790e9 osc/rdma: fail query_btls if no endpoint for non-local peer is found
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit eebc451ec8)
2020-07-17 08:38:37 +02:00
Brian Barrett
bd16024a0b
Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding
Adding SLURM binding policy change to README
2020-07-16 12:50:15 -07:00
Geoffrey Paulsen
390772023a Adding SLURM binding policy change to README
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2020-07-15 18:15:47 -05:00
tomhers
229724799e BTL/OFI: Fix missing include file.
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <tom.herschberg@gmail.com>
(cherry picked from commit 88f9d2c90f)
2020-07-15 17:14:55 -04:00
Nikola Dancejic
c018337d03 v4.1.x: common/ofi: added address format check to fix provider selection
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 7e46371301)
2020-07-15 00:32:37 +00:00
Jeff Squyres
57044aa0b0
Merge pull request #7934 from jjhursey/v4-jsm-no-bind
v4.1.x: schizo/jsm: Disable binding when direct launched
2020-07-14 11:19:19 -04:00
Jeff Squyres
dfac9fae16
Merge pull request #7933 from artpol84/fix/4.1.x/schizo_slurm
v4.1.x: schizo/slurm: Fix binding detection
2020-07-14 11:19:07 -04:00
Jeff Squyres
91c28f136b
Merge pull request #7935 from jsquyres/pr/v4.1.x/avx-for-strength
v4.1.x: Add supports for MPI_OP using AVX512, AVX2 and MMX
2020-07-14 11:04:11 -04:00
dongzhong
b4e04bbd8a Add supports for MPI_OP using AVX512, AVX2 and MMX
Add logic to handle different architectural capabilities
Detect the compiler flags necessary to build specialized
versions of the MPI_OP. Once the different flavors (AVX512,
AVX2, AVX) are built, detect at runtime which is the best
match with the current processor capabilities.

Add validation checks for loadu 256 and 512 bits.
Add validation tests for MPI_Op.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 14b3c70628)
2020-07-13 13:49:00 -07:00
Joshua Hursey
c71e1fa1db
v4.1.x: schizo/jsm: Disable binding when direct launched
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-07-13 15:31:21 -05:00
Artem Polyakov
8b5b0c5a2a schizo/slurm: Disable binding in case of Slurm direct launch
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit c72f295dfaced7e9f879b1eba3eebb7c35cb5bf5)
2020-07-13 12:54:24 -07:00
Brian Barrett
0d1af43851
Merge pull request #7924 from hkuno/hkuno/cherry-pick_5655d64b
mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
2020-07-13 12:02:12 -07:00
Brian Barrett
4fd3cdc848
Merge pull request #7923 from jsquyres/pr/v4.1.x/disallow-when-cint-not-equal-to-finteger
v4.1.x: fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
2020-07-13 12:00:56 -07:00
Gilles Gouaillardet
79a737ca94 mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
do not check some input parameters when an {in,out}degree is zero

Thanks Junchao Zhang for analyzing and reporting this issue.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 5655d64bd3)
2020-07-10 16:58:52 -06:00
Jeff Squyres
94922937c2 fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
NOTE: This is intentionally not a cherry pick from master.  Instead,
this is a cherry-pick from the equivalent commit on the v4.0.x branch.
See below.

There is a problem with the mpi_f08 module when sizeof(int) !=
sizeof(INTEGER): the size of TYPE(MPI_Status) is too small.  This
causes buffer overruns when Open MPI is configured with (for example)
sizeof(int)==4 and sizeof(INTEGER)==8, and then you call the mpi_f08
MPI_RECV subroutine.  This will end up copying the resulting C
MPI_Status to the buffer pointing to the Fortran status, but the code
does not know if the Fortran status is an mpif.h status or a
TYPE(MPI_Status) -- it just blindly copies over as if the Fortran
status is an INTEGER array of length MPI_STATUS_SIZE.  Unfortunately,
TYPE(MPI_Status) is actually smaller than this, so we overrun the
buffer.  Hilarity ensues.

The simple fix for this is to make TYPE(MPI_Status) the same size as
INTEGER(MPI_STATUS_SIZE), but we can't do that here on the release
branch because it will break ABI.

This commit does the following:

- checks to see if we're in a sizeof(int) != sizeof(INTEGER) scenario
- if so, if the user has not specifically excluded building the
  mpi_f08 module, display a Giant Error Message (GEM) and abort
  configure.

This is unusual; we don't usually abort configure when feature XYZ
can't be built -- if the user didn't specifically ask for XYZ, we
just emit a notice that we won't build XYZ and continue.

This situation is a little different because we're on a release
branch: prior releases have built mpi_f08 by default -- even in this
"bad" scenario.  Hence, in this case, we explicitly tell the user that
this is now a known-bad scenario and abort.  In the GEM, we give the
user two options:

1. Change their compiler flags so that sizeof(int) == sizeof(INTEGER)
   and re-run configure, or
2. Explicitly disable the mpi_f08 module via --enable-mpi-fortran=usempi

Thanks to @ahaichen for reporting the issue.

Note: the proper fix has been implemented on master (i.e., what will
become v5.0.0), but since that breaks ABI, we can't cherry pick it
back here to an existing release branch series. Hence, we
cherry-picked this fix from the v4.0.x branch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 27836a614b9c29d7636cdf1a9b838b1532281a8a)
2020-07-10 14:39:34 -07:00
Brian Barrett
64c5e2158c
Merge pull request #7911 from bwbarrett/dist/v4.1.x
Prep for v4.1.0rc1 release
2020-07-06 14:15:20 -07:00
Brian Barrett
9dc3a9e85d dist: Update NEWS file for 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:45 +00:00
Brian Barrett
55eab422b5 dist: Move version to 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:38 +00:00
Jeff Squyres
80eebbee58
Merge pull request #7896 from rhc54/cmr41/rk
v4.1.x: Increment the vpid after assignment
2020-07-06 13:38:32 -04:00
Jeff Squyres
d17685a6c9
Merge pull request #7890 from tkordenbrock/topic/v4.1.x/portals4.call-pml-add_procs
v4.1.x: mtl-portals4: use the active PML to call add_procs()
2020-07-06 07:34:41 -04:00
Jeff Squyres
13b7844513
Merge pull request #7881 from cniethammer/uct-supported-version-update
v4.1.x: Accept UCX 1.8 in configure of btl/uct
2020-07-06 07:34:17 -04:00
Jeff Squyres
a878569386
Merge pull request #7895 from tkordenbrock/topic/v4.1.x/portals4.fix-inappropriate-use-of-abort
v4.1.x: portals4: fix inappropriate use of abort() in mtl-portals4 and coll-portals4 components
2020-07-06 07:32:32 -04:00
Valentin Petrov
6f401186f7 coll/hcoll: compile warning fix
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:25 +03:00
Valentin Petrov
2441fb2baf coll/hcoll: reduce_scatter(block) interface
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:22 +03:00