1
1

30126 Коммитов

Автор SHA1 Сообщение Дата
William Zhang
357ac2e2f5 btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit a7dcfd98742d7f44d96d74a60a38529894d4099c)
2020-08-12 14:02:51 -07:00
Jeff Squyres
e7b5e2550b
Merge pull request #7997 from bwbarrett/backports/v4.1.x-7957
Use the unaligned SSE memory access primitive.
2020-08-12 10:12:44 -04:00
George Bosilca
c20626e60d Check unaligned ops for correctness.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit c4e88a43a3a1f6a13f8eaa8b90d2a0a8caba5adf)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-08-12 02:32:10 +00:00
George Bosilca
eb9ced786a Use the unaligned SSE memory access primitive.
Alter the test to validate misaligned data.

Fixes #7954.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit b6d71aa893db51c393df9c993fde597eff5457e8)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-08-12 02:32:01 +00:00
Brian Barrett
22163d37b5
Merge pull request #7991 from wckzhang/v4.1.x
v4.1.x: btl/ofi: Use common provider include/exclude list
2020-08-11 11:21:07 -07:00
Brian Barrett
ec04896d70
Merge pull request #7995 from mkurnosov/fix-parsing-locality-strng
v4.1.x: opal/hwloc: fix a typo in parsing locality string: L0 changed to L1
2020-08-11 11:20:08 -07:00
Mikhail Kurnosov
11620038f9 Fix a typo in parsing locality string: L0 changed to L1
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
(cherry picked from commit 4708458d6b12d1b7a8e11fa3c3f784c545780707)
2020-08-11 22:35:00 +07:00
William Zhang
abe3aaa4a7 btl/ofi: Use common provider include/exclude list
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 9b8f463a768206a26e5b1dcea0612a403462b1d0)
2020-08-10 14:28:59 -07:00
Brian Barrett
704d019dc1
Merge pull request #7952 from wckzhang/v4.1.x
coll/tuned: Change the default collective algorithm selection
2020-08-03 12:05:39 -07:00
William Zhang
5a13c5352f coll/tuned: Change the default collective algorithm selection
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit ce40cfbaa53406be71319041e13e893b0def7ad9)
2020-07-28 15:48:20 -07:00
Brian Barrett
e08e15cf93
Merge pull request #7959 from cniethammer/configure-leak-fix-v4.1.x
v4.1: Fix memory leak in configure, which prevents leak sanitizer usage
2020-07-28 08:06:11 -07:00
Jeff Squyres
36bcc48dc1
Merge pull request #7902 from vspetrov/v4.1.x_hcoll_reduce_scatter
V4.1.x hcoll reduce scatter
2020-07-27 15:50:09 -04:00
Jeff Squyres
6b4e2f02f8
Merge pull request #7956 from tomhers/topic/fix_ofi_btl_include_file
v4.1.x: BTL/OFI: Fix missing include file.
2020-07-27 15:26:01 -04:00
Christoph Niethammer
e0bd64f843 Fix memory leak in configure, which prevents leak sanitizer usage
If building Open MPI with sanitizers, e.g
$ configure CC=clang CFLAGS=-fsanitize=address ....
configure test programs are also build with the sanitizers and will
report errors resulting in configure to fail.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-22 14:49:10 +02:00
Brian Barrett
18418bfdf6
Merge pull request #7943 from dancejic/multi-4.1.x
v4.1.x: common/ofi: added address format check to fix provider selection
2020-07-20 12:22:20 -07:00
Jeff Squyres
981b8858e7
Merge pull request #7951 from devreal/v4.1.x
(v4.1.x) osc/rdma: fail query_btls if no endpoint for non-local peer is found
2020-07-20 15:12:59 -04:00
Joseph Schuchart
3d08d790e9 osc/rdma: fail query_btls if no endpoint for non-local peer is found
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit eebc451ec8313975998a63e25938fb4e0b4d6c44)
2020-07-17 08:38:37 +02:00
Brian Barrett
bd16024a0b
Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding
Adding SLURM binding policy change to README
2020-07-16 12:50:15 -07:00
Geoffrey Paulsen
390772023a Adding SLURM binding policy change to README
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2020-07-15 18:15:47 -05:00
tomhers
229724799e BTL/OFI: Fix missing include file.
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <tom.herschberg@gmail.com>
(cherry picked from commit 88f9d2c90f1730f2e0b5bc4951893f60ff5a1332)
2020-07-15 17:14:55 -04:00
Nikola Dancejic
c018337d03 v4.1.x: common/ofi: added address format check to fix provider selection
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 7e463713014ae58f7e78d7a5b49e9e63d62e374e)
2020-07-15 00:32:37 +00:00
Jeff Squyres
57044aa0b0
Merge pull request #7934 from jjhursey/v4-jsm-no-bind
v4.1.x: schizo/jsm: Disable binding when direct launched
2020-07-14 11:19:19 -04:00
Jeff Squyres
dfac9fae16
Merge pull request #7933 from artpol84/fix/4.1.x/schizo_slurm
v4.1.x: schizo/slurm: Fix binding detection
2020-07-14 11:19:07 -04:00
Jeff Squyres
91c28f136b
Merge pull request #7935 from jsquyres/pr/v4.1.x/avx-for-strength
v4.1.x: Add supports for MPI_OP using AVX512, AVX2 and MMX
2020-07-14 11:04:11 -04:00
dongzhong
b4e04bbd8a Add supports for MPI_OP using AVX512, AVX2 and MMX
Add logic to handle different architectural capabilities
Detect the compiler flags necessary to build specialized
versions of the MPI_OP. Once the different flavors (AVX512,
AVX2, AVX) are built, detect at runtime which is the best
match with the current processor capabilities.

Add validation checks for loadu 256 and 512 bits.
Add validation tests for MPI_Op.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 14b3c706289cfc26e53b13efd3eb85641636e459)
2020-07-13 13:49:00 -07:00
Joshua Hursey
c71e1fa1db
v4.1.x: schizo/jsm: Disable binding when direct launched
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-07-13 15:31:21 -05:00
Artem Polyakov
8b5b0c5a2a schizo/slurm: Disable binding in case of Slurm direct launch
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit c72f295dfaced7e9f879b1eba3eebb7c35cb5bf5)
2020-07-13 12:54:24 -07:00
Brian Barrett
0d1af43851
Merge pull request #7924 from hkuno/hkuno/cherry-pick_5655d64b
mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
2020-07-13 12:02:12 -07:00
Brian Barrett
4fd3cdc848
Merge pull request #7923 from jsquyres/pr/v4.1.x/disallow-when-cint-not-equal-to-finteger
v4.1.x: fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
2020-07-13 12:00:56 -07:00
Gilles Gouaillardet
79a737ca94 mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
do not check some input parameters when an {in,out}degree is zero

Thanks Junchao Zhang for analyzing and reporting this issue.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 5655d64bd3e064cdd6925b994e555e75d00d0f08)
2020-07-10 16:58:52 -06:00
Jeff Squyres
94922937c2 fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
NOTE: This is intentionally not a cherry pick from master.  Instead,
this is a cherry-pick from the equivalent commit on the v4.0.x branch.
See below.

There is a problem with the mpi_f08 module when sizeof(int) !=
sizeof(INTEGER): the size of TYPE(MPI_Status) is too small.  This
causes buffer overruns when Open MPI is configured with (for example)
sizeof(int)==4 and sizeof(INTEGER)==8, and then you call the mpi_f08
MPI_RECV subroutine.  This will end up copying the resulting C
MPI_Status to the buffer pointing to the Fortran status, but the code
does not know if the Fortran status is an mpif.h status or a
TYPE(MPI_Status) -- it just blindly copies over as if the Fortran
status is an INTEGER array of length MPI_STATUS_SIZE.  Unfortunately,
TYPE(MPI_Status) is actually smaller than this, so we overrun the
buffer.  Hilarity ensues.

The simple fix for this is to make TYPE(MPI_Status) the same size as
INTEGER(MPI_STATUS_SIZE), but we can't do that here on the release
branch because it will break ABI.

This commit does the following:

- checks to see if we're in a sizeof(int) != sizeof(INTEGER) scenario
- if so, if the user has not specifically excluded building the
  mpi_f08 module, display a Giant Error Message (GEM) and abort
  configure.

This is unusual; we don't usually abort configure when feature XYZ
can't be built -- if the user didn't specifically ask for XYZ, we
just emit a notice that we won't build XYZ and continue.

This situation is a little different because we're on a release
branch: prior releases have built mpi_f08 by default -- even in this
"bad" scenario.  Hence, in this case, we explicitly tell the user that
this is now a known-bad scenario and abort.  In the GEM, we give the
user two options:

1. Change their compiler flags so that sizeof(int) == sizeof(INTEGER)
   and re-run configure, or
2. Explicitly disable the mpi_f08 module via --enable-mpi-fortran=usempi

Thanks to @ahaichen for reporting the issue.

Note: the proper fix has been implemented on master (i.e., what will
become v5.0.0), but since that breaks ABI, we can't cherry pick it
back here to an existing release branch series. Hence, we
cherry-picked this fix from the v4.0.x branch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 27836a614b9c29d7636cdf1a9b838b1532281a8a)
2020-07-10 14:39:34 -07:00
Brian Barrett
64c5e2158c
Merge pull request #7911 from bwbarrett/dist/v4.1.x
Prep for v4.1.0rc1 release
2020-07-06 14:15:20 -07:00
Brian Barrett
9dc3a9e85d dist: Update NEWS file for 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:45 +00:00
Brian Barrett
55eab422b5 dist: Move version to 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:38 +00:00
Jeff Squyres
80eebbee58
Merge pull request #7896 from rhc54/cmr41/rk
v4.1.x: Increment the vpid after assignment
2020-07-06 13:38:32 -04:00
Jeff Squyres
d17685a6c9
Merge pull request #7890 from tkordenbrock/topic/v4.1.x/portals4.call-pml-add_procs
v4.1.x: mtl-portals4: use the active PML to call add_procs()
2020-07-06 07:34:41 -04:00
Jeff Squyres
13b7844513
Merge pull request #7881 from cniethammer/uct-supported-version-update
v4.1.x: Accept UCX 1.8 in configure of btl/uct
2020-07-06 07:34:17 -04:00
Jeff Squyres
a878569386
Merge pull request #7895 from tkordenbrock/topic/v4.1.x/portals4.fix-inappropriate-use-of-abort
v4.1.x: portals4: fix inappropriate use of abort() in mtl-portals4 and coll-portals4 components
2020-07-06 07:32:32 -04:00
Valentin Petrov
6f401186f7 coll/hcoll: compile warning fix
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:25 +03:00
Valentin Petrov
2441fb2baf coll/hcoll: reduce_scatter(block) interface
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:22 +03:00
Ralph Castain
12468349d2
Increment the vpid after assignment
Fix the rank-by operation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-30 06:54:04 -07:00
Jeff Squyres
bc6587d3fa
Merge pull request #7873 from devreal/osc-ucx-rget-rput-fetch-alignment-v4.1.x
OSC UCX: make sure no-op fetch in rget/rput is properly aligned (v4.1.x)
2020-06-29 15:20:12 -04:00
Jeff Squyres
249c57a4bc
Merge pull request #7889 from devreal/osc-rdma-noncontig-requests-v4.1.x
osc rdma: check for outstanding fragments before completing a request (II) (v4.1.x)
2020-06-29 15:19:49 -04:00
Todd Kordenbrock
20f9ed98f2 mtl-portals4: replace abort() with ompi_rte_abort()
coll-portals4: replace abort() with ompi_rte_abort()

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit 04b94637dd2c5e05edf0917d02b9d1e48316d063)
2020-06-29 10:06:12 -05:00
Todd Kordenbrock
540b14fc32 Use the active PML to call add_procs()
ompi_mtl_portals4_get_endpoint() was incorrectly making a direct
call to ompi_mtl_portals4_add_procs().  Instead use the actve PML
to call add_procs().  If add_procs() fails, call ompi_rte_abort()
to terminate the job.

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>

(cherry picked from commit 0a637967fac24246c84da0952423b3dc1b2f62f1)
2020-06-29 09:55:23 -05:00
Joseph Schuchart
2d3f862f1d osc rdma: check for outstanding fragments before completing a request in ompi_osc_rdma_put_complete_flush as well
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit caed3b2eed478c76f34d56b5d0495bf26e44a9bb)
2020-06-29 15:44:44 +02:00
Jeff Squyres
220f9ed99f
Merge pull request #7885 from jsquyres/pr/v4.1.x/die-enable-install-libpmix-die-die-die
v4.1.x: pmix3x: Remove --enable-install-libpmix option
2020-06-28 10:41:37 -04:00
Jeff Squyres
b4106e94e1 pmix3x: Remove --enable-install-libpmix option
This option is problematic, and has never worked in an Open MPI v4.0.x
release tarball.  Given that PMIx is now available elsewhere, it isn't
worth fixing this option.

See https://github.com/open-mpi/ompi/issues/6228 for more detail.

NOTE: This is a v4.0.x-specific commit because this option no longer
exists on master because we deleted the entire pmix3x component.
Hence, it's not possible to cherry-pick anything from master back to
the v4.0.x branch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 447b14061880e218371f9eb0cbe427b8358d45b8)
2020-06-27 10:01:23 -07:00
Brian Barrett
a6d97e2d6d
Merge pull request #7815 from raafatfeki/topic/ime
Topic/ime: Bring over IME component from master to v4.1
2020-06-26 12:40:35 -07:00
Christoph Niethammer
ad1d427d60 Accept UCX 1.8 in configure of btl/uct
The configure script for the btl uct component reports an error for
the new UCX 1.8.0 versions as it was fixed up to UCX 1.7.

This fixes #7612

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
(cherry picked from commit 9b10f46126b0a5aa796d9fec063c2d454a9a1bc9)
2020-06-26 21:20:49 +02:00