1
1
Commit Graph

29969 Commits

Author SHA1 Message Date
Howard Pritchard
c4128b21a0 suppress icc long double message
improve configury to check whether icc is handling no long double.
This prevents seeing 100s of messages like this:

icc: command line warning #10148: option '-Wno-long-double' not supported

A similar patch will be needed for pmix.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit 6df0e53421)
2020-08-27 16:04:50 +00:00
Brian Barrett
704d019dc1
Merge pull request #7952 from wckzhang/v4.1.x
coll/tuned: Change the default collective algorithm selection
2020-08-03 12:05:39 -07:00
William Zhang
5a13c5352f coll/tuned: Change the default collective algorithm selection
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit ce40cfbaa5)
2020-07-28 15:48:20 -07:00
Brian Barrett
e08e15cf93
Merge pull request #7959 from cniethammer/configure-leak-fix-v4.1.x
v4.1: Fix memory leak in configure, which prevents leak sanitizer usage
2020-07-28 08:06:11 -07:00
Jeff Squyres
36bcc48dc1
Merge pull request #7902 from vspetrov/v4.1.x_hcoll_reduce_scatter
V4.1.x hcoll reduce scatter
2020-07-27 15:50:09 -04:00
Jeff Squyres
6b4e2f02f8
Merge pull request #7956 from tomhers/topic/fix_ofi_btl_include_file
v4.1.x: BTL/OFI: Fix missing include file.
2020-07-27 15:26:01 -04:00
Christoph Niethammer
e0bd64f843 Fix memory leak in configure, which prevents leak sanitizer usage
If building Open MPI with sanitizers, e.g
$ configure CC=clang CFLAGS=-fsanitize=address ....
configure test programs are also build with the sanitizers and will
report errors resulting in configure to fail.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-22 14:49:10 +02:00
Brian Barrett
18418bfdf6
Merge pull request #7943 from dancejic/multi-4.1.x
v4.1.x: common/ofi: added address format check to fix provider selection
2020-07-20 12:22:20 -07:00
Jeff Squyres
981b8858e7
Merge pull request #7951 from devreal/v4.1.x
(v4.1.x) osc/rdma: fail query_btls if no endpoint for non-local peer is found
2020-07-20 15:12:59 -04:00
Joseph Schuchart
3d08d790e9 osc/rdma: fail query_btls if no endpoint for non-local peer is found
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit eebc451ec8)
2020-07-17 08:38:37 +02:00
Brian Barrett
bd16024a0b
Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding
Adding SLURM binding policy change to README
2020-07-16 12:50:15 -07:00
Geoffrey Paulsen
390772023a Adding SLURM binding policy change to README
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2020-07-15 18:15:47 -05:00
tomhers
229724799e BTL/OFI: Fix missing include file.
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <tom.herschberg@gmail.com>
(cherry picked from commit 88f9d2c90f)
2020-07-15 17:14:55 -04:00
Nikola Dancejic
c018337d03 v4.1.x: common/ofi: added address format check to fix provider selection
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 7e46371301)
2020-07-15 00:32:37 +00:00
Jeff Squyres
57044aa0b0
Merge pull request #7934 from jjhursey/v4-jsm-no-bind
v4.1.x: schizo/jsm: Disable binding when direct launched
2020-07-14 11:19:19 -04:00
Jeff Squyres
dfac9fae16
Merge pull request #7933 from artpol84/fix/4.1.x/schizo_slurm
v4.1.x: schizo/slurm: Fix binding detection
2020-07-14 11:19:07 -04:00
Jeff Squyres
91c28f136b
Merge pull request #7935 from jsquyres/pr/v4.1.x/avx-for-strength
v4.1.x: Add supports for MPI_OP using AVX512, AVX2 and MMX
2020-07-14 11:04:11 -04:00
dongzhong
b4e04bbd8a Add supports for MPI_OP using AVX512, AVX2 and MMX
Add logic to handle different architectural capabilities
Detect the compiler flags necessary to build specialized
versions of the MPI_OP. Once the different flavors (AVX512,
AVX2, AVX) are built, detect at runtime which is the best
match with the current processor capabilities.

Add validation checks for loadu 256 and 512 bits.
Add validation tests for MPI_Op.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 14b3c70628)
2020-07-13 13:49:00 -07:00
Joshua Hursey
c71e1fa1db
v4.1.x: schizo/jsm: Disable binding when direct launched
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-07-13 15:31:21 -05:00
Artem Polyakov
8b5b0c5a2a schizo/slurm: Disable binding in case of Slurm direct launch
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit c72f295dfaced7e9f879b1eba3eebb7c35cb5bf5)
2020-07-13 12:54:24 -07:00
Brian Barrett
0d1af43851
Merge pull request #7924 from hkuno/hkuno/cherry-pick_5655d64b
mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
2020-07-13 12:02:12 -07:00
Brian Barrett
4fd3cdc848
Merge pull request #7923 from jsquyres/pr/v4.1.x/disallow-when-cint-not-equal-to-finteger
v4.1.x: fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
2020-07-13 12:00:56 -07:00
Gilles Gouaillardet
79a737ca94 mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
do not check some input parameters when an {in,out}degree is zero

Thanks Junchao Zhang for analyzing and reporting this issue.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 5655d64bd3)
2020-07-10 16:58:52 -06:00
Jeff Squyres
94922937c2 fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
NOTE: This is intentionally not a cherry pick from master.  Instead,
this is a cherry-pick from the equivalent commit on the v4.0.x branch.
See below.

There is a problem with the mpi_f08 module when sizeof(int) !=
sizeof(INTEGER): the size of TYPE(MPI_Status) is too small.  This
causes buffer overruns when Open MPI is configured with (for example)
sizeof(int)==4 and sizeof(INTEGER)==8, and then you call the mpi_f08
MPI_RECV subroutine.  This will end up copying the resulting C
MPI_Status to the buffer pointing to the Fortran status, but the code
does not know if the Fortran status is an mpif.h status or a
TYPE(MPI_Status) -- it just blindly copies over as if the Fortran
status is an INTEGER array of length MPI_STATUS_SIZE.  Unfortunately,
TYPE(MPI_Status) is actually smaller than this, so we overrun the
buffer.  Hilarity ensues.

The simple fix for this is to make TYPE(MPI_Status) the same size as
INTEGER(MPI_STATUS_SIZE), but we can't do that here on the release
branch because it will break ABI.

This commit does the following:

- checks to see if we're in a sizeof(int) != sizeof(INTEGER) scenario
- if so, if the user has not specifically excluded building the
  mpi_f08 module, display a Giant Error Message (GEM) and abort
  configure.

This is unusual; we don't usually abort configure when feature XYZ
can't be built -- if the user didn't specifically ask for XYZ, we
just emit a notice that we won't build XYZ and continue.

This situation is a little different because we're on a release
branch: prior releases have built mpi_f08 by default -- even in this
"bad" scenario.  Hence, in this case, we explicitly tell the user that
this is now a known-bad scenario and abort.  In the GEM, we give the
user two options:

1. Change their compiler flags so that sizeof(int) == sizeof(INTEGER)
   and re-run configure, or
2. Explicitly disable the mpi_f08 module via --enable-mpi-fortran=usempi

Thanks to @ahaichen for reporting the issue.

Note: the proper fix has been implemented on master (i.e., what will
become v5.0.0), but since that breaks ABI, we can't cherry pick it
back here to an existing release branch series. Hence, we
cherry-picked this fix from the v4.0.x branch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 27836a614b9c29d7636cdf1a9b838b1532281a8a)
2020-07-10 14:39:34 -07:00
Brian Barrett
64c5e2158c
Merge pull request #7911 from bwbarrett/dist/v4.1.x
Prep for v4.1.0rc1 release
2020-07-06 14:15:20 -07:00
Brian Barrett
9dc3a9e85d dist: Update NEWS file for 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:45 +00:00
Brian Barrett
55eab422b5 dist: Move version to 4.1.0rc1
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-07-06 19:35:38 +00:00
Jeff Squyres
80eebbee58
Merge pull request #7896 from rhc54/cmr41/rk
v4.1.x: Increment the vpid after assignment
2020-07-06 13:38:32 -04:00
Jeff Squyres
d17685a6c9
Merge pull request #7890 from tkordenbrock/topic/v4.1.x/portals4.call-pml-add_procs
v4.1.x: mtl-portals4: use the active PML to call add_procs()
2020-07-06 07:34:41 -04:00
Jeff Squyres
13b7844513
Merge pull request #7881 from cniethammer/uct-supported-version-update
v4.1.x: Accept UCX 1.8 in configure of btl/uct
2020-07-06 07:34:17 -04:00
Jeff Squyres
a878569386
Merge pull request #7895 from tkordenbrock/topic/v4.1.x/portals4.fix-inappropriate-use-of-abort
v4.1.x: portals4: fix inappropriate use of abort() in mtl-portals4 and coll-portals4 components
2020-07-06 07:32:32 -04:00
Valentin Petrov
6f401186f7 coll/hcoll: compile warning fix
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:25 +03:00
Valentin Petrov
2441fb2baf coll/hcoll: reduce_scatter(block) interface
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2020-07-02 08:44:22 +03:00
Ralph Castain
12468349d2
Increment the vpid after assignment
Fix the rank-by operation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-30 06:54:04 -07:00
Jeff Squyres
bc6587d3fa
Merge pull request #7873 from devreal/osc-ucx-rget-rput-fetch-alignment-v4.1.x
OSC UCX: make sure no-op fetch in rget/rput is properly aligned (v4.1.x)
2020-06-29 15:20:12 -04:00
Jeff Squyres
249c57a4bc
Merge pull request #7889 from devreal/osc-rdma-noncontig-requests-v4.1.x
osc rdma: check for outstanding fragments before completing a request (II) (v4.1.x)
2020-06-29 15:19:49 -04:00
Todd Kordenbrock
20f9ed98f2 mtl-portals4: replace abort() with ompi_rte_abort()
coll-portals4: replace abort() with ompi_rte_abort()

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit 04b94637dd)
2020-06-29 10:06:12 -05:00
Todd Kordenbrock
540b14fc32 Use the active PML to call add_procs()
ompi_mtl_portals4_get_endpoint() was incorrectly making a direct
call to ompi_mtl_portals4_add_procs().  Instead use the actve PML
to call add_procs().  If add_procs() fails, call ompi_rte_abort()
to terminate the job.

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>

(cherry picked from commit 0a637967fa)
2020-06-29 09:55:23 -05:00
Joseph Schuchart
2d3f862f1d osc rdma: check for outstanding fragments before completing a request in ompi_osc_rdma_put_complete_flush as well
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit caed3b2eed)
2020-06-29 15:44:44 +02:00
Jeff Squyres
220f9ed99f
Merge pull request #7885 from jsquyres/pr/v4.1.x/die-enable-install-libpmix-die-die-die
v4.1.x: pmix3x: Remove --enable-install-libpmix option
2020-06-28 10:41:37 -04:00
Jeff Squyres
b4106e94e1 pmix3x: Remove --enable-install-libpmix option
This option is problematic, and has never worked in an Open MPI v4.0.x
release tarball.  Given that PMIx is now available elsewhere, it isn't
worth fixing this option.

See https://github.com/open-mpi/ompi/issues/6228 for more detail.

NOTE: This is a v4.0.x-specific commit because this option no longer
exists on master because we deleted the entire pmix3x component.
Hence, it's not possible to cherry-pick anything from master back to
the v4.0.x branch.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 447b14061880e218371f9eb0cbe427b8358d45b8)
2020-06-27 10:01:23 -07:00
Brian Barrett
a6d97e2d6d
Merge pull request #7815 from raafatfeki/topic/ime
Topic/ime: Bring over IME component from master to v4.1
2020-06-26 12:40:35 -07:00
Christoph Niethammer
ad1d427d60 Accept UCX 1.8 in configure of btl/uct
The configure script for the btl uct component reports an error for
the new UCX 1.8.0 versions as it was fixed up to UCX 1.7.

This fixes #7612

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
(cherry picked from commit 9b10f46126)
2020-06-26 21:20:49 +02:00
Brian Barrett
25abbb219a
Merge pull request #7808 from bwbarrett/backports/v4.1.x-collectives-updates
Backport Collective changes from master to v4.1.x
2020-06-26 12:01:51 -07:00
raafatfeki
6e145188d9 fs/ime & fbtl/ime: Support of IME file system
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
2020-06-26 12:26:51 -04:00
Jeff Squyres
7987a7f56e common_ofi: fix preprocessor macro typo
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f64c30e93c)
2020-06-26 07:56:53 -07:00
Brian Barrett
339ee6378a dist: Add Collectives backports to NEWS
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-06-25 23:07:26 +00:00
William Zhang
db6ed187b2 coll/tuned: Add NULL check to prevent segfault
Signed-off-by: William Zhang <wilzhang@amazon.com>

cr https://code.amazon.com/reviews/CR-23837553

(cherry picked from commit 771f9c011d)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-06-25 23:06:51 +00:00
William Zhang
03758b1ef7 coll/tuned: Fix typos
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 50640402ab)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-06-25 23:06:51 +00:00
Mikhail Brinskii
7eb94164a0 COLL/TUNED: Add linear scatter using isend for mlnx platform
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit f2cbd4806e)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-06-25 23:06:51 +00:00