1
1

30919 Коммитов

Автор SHA1 Сообщение Дата
Aurelien Bouteiller
efbc6ff6a5
Merge pull request #7798 from abouteiller/mpi-next/unbounderr-self
MPI-4 error handling: 'unbound' errors to MPI_COMM_SELF
2020-08-03 15:59:14 -04:00
Aurelien Bouteiller
ee149fcfcb
MPI3 (unchanged in 4) says that errors after MPI_REQUEST_FREE are FATAL
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-31 17:49:38 -04:00
Aurelien Bouteiller
bec7dfc1b1
Errors in non-api calls remain fatal
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-31 17:49:35 -04:00
Aurelien Bouteiller
e0df0f4bd9
Make errors_mpi3 compat a global mpi-3 compatibility flag
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-31 17:48:47 -04:00
Aurelien Bouteiller
7dfe6c1adc
Thread-shift errors reported by PMIx to the main MPI progress engine
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

make things happen before the terminal call

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-31 17:48:44 -04:00
Artem Polyakov
dfb0ae748f
Merge pull request #7681 from janjust/master-tls-refactor_v3
ompi/osc/ucx: remove global TLS tables
2020-07-31 10:39:53 -07:00
Tommy Janjusic
2c8da2c0a9 Further code reduction and simplifications.
Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 20:00:22 +03:00
Tomislav Janjusic
cbfc9a3263 opal/mca/common/ucx: Use new TSD api
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
72296e12f4 opal/common/ucx:
-mutex lock/unlock suggestions
-common destructor/cleanup

Co-authored-with: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Tomislav Janjusic
27ba4b612f ompi/osc/ucx: Remove workerpool's global thread storage tables.
Co-authored-by: Artem Y. Polyakov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-30 00:21:26 +03:00
Brian Barrett
41df122083
Merge pull request #7730 from wckzhang/newdefaults
coll/tuned: Change the default collective algorithm selection
2020-07-28 15:27:46 -07:00
William Zhang
ce40cfbaa5 coll/tuned: Change the default collective algorithm selection
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-07-28 10:41:48 -07:00
Austen Lauria
d0152eb51e
Merge pull request #7940 from awlauria/revert_libevent_commit
Revert "Address a race condition in libevent select."
2020-07-28 11:34:59 -04:00
Jeff Squyres
c07d77fbf2
Merge pull request #7957 from bosilca/fix/avx_alignment
Use the unaligned SSE memory access primitive.
2020-07-27 15:50:40 -04:00
Artem Polyakov
e5ef80fe8c
Merge pull request #7936 from janjust/master-new-tsd-thread-api
Master: new thread-specific-data (tsd) api
2020-07-24 14:58:03 -07:00
Ralph Castain
863a058f8d
Merge pull request #7964 from rhc54/topic/sync
Sync to PRRTE master
2020-07-24 14:57:32 -07:00
Ralph Castain
8c0269cd4f
Sync to PRRTE master
Pickup the FT and libev cleanups

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-07-24 14:11:34 -07:00
Tomislav Janjusic
d809f6ba27 New TSD API interface fix for various components
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:40 +03:00
Tomislav Janjusic
cba5a0e117 Rename tsd interface function calls
Co-authored by: Artem Polykaov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
cb1955bb53 Fix renamed interface functions for argo, q, and pthreads
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Tomislav Janjusic
07dc86eb3a opal/thread: New TSD API
Co-authored-by: Artem Polyakov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Ralph Castain
06c585c316
Merge pull request #7962 from rhc54/topic/sync
Sync to PMIx and PRRTE master
2020-07-23 16:22:32 -07:00
Ralph Castain
c0bc89dc50
Sync to PMIx and PRRTE master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-07-23 12:35:17 -07:00
Aurelien Bouteiller
06c563625a
Add a test for mpi_errors_mpi3 behavior and non-catastrophic errors
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-23 05:09:29 -04:00
Aurélien Bouteiller
b37202c74e
Add compliance mode with MPI-4 routing of errors to MPI_COMM_SELF by
default
And other streamlining of aborting behavior.

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

Remove OMPI_COMM_ERRORS and use NOHANDLE macros instead.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

route unbound errors to self error handler

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Do not raise the error handler from within components

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-23 05:09:29 -04:00
George Bosilca
c4e88a43a3
Check unaligned ops for correctness.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-22 11:26:07 -04:00
Joshua Ladd
366e92ce54
Merge pull request #7860 from vspetrov/hcoll_reduce_scatter
Coll/Hcoll: reduce_scatter(block) interface
2020-07-22 09:45:34 -04:00
George Bosilca
b6d71aa893
Use the unaligned SSE memory access primitive.
Alter the test to validate misaligned data.

Fixes #7954.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-22 01:19:12 -04:00
Jeff Squyres
30ba603c2c
Merge pull request #7953 from cniethammer/configure-leak-fix
Fix memory leak in configure, which prevents leak sanitizer usage
2020-07-21 16:34:27 -04:00
Christoph Niethammer
6564c1b942 Fix memory leak in configure, which prevents leak sanitizer usage
If building Open MPI with sanitizers, e.g
$ configure CC=clang CFLAGS=-fsanitize=address ....
configure test programs are also build with the sanitizers and will
report errors resulting in configure to fail.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-21 21:28:29 +02:00
Aurelien Bouteiller
816acbdfb1
Merge pull request #7840 from abouteiller/mpi-next/init-errh
MPI-4: Initial error handler
2020-07-21 11:55:14 -04:00
Joseph Schuchart
60aa97b301
Merge pull request #7948 from devreal/osc-rdma-check-endpoints
osc/rdma: fail query_btls if no endpoint for non-local peer is found
2020-07-20 15:14:25 +02:00
bosilca
1139d9ecae
Merge pull request #7931 from bosilca/fix/7928
Fix the BTL API conversion for the SMCUDA BTL
2020-07-18 17:35:39 -04:00
Joseph Schuchart
eebc451ec8 osc/rdma: fail query_btls if no endpoint for non-local peer is found
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-07-16 17:06:35 +02:00
Aurelien Bouteiller
7118755ae8
Add a tester for the initial error handler
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Aurelien Bouteiller
5f1f7fe313
route errors to self/initial error handler depending upon the state of
MPI initialization

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Aurélien Bouteiller
bed909c3ba
Read the info key mpi_initial_errhandler from spawn/spawn_multiple
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

Use the same env to transmit the initial error handler to spawnees

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Aurélien Bouteiller
83d0f92152
Set the initial error handler onto predefined communicators
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

update to the predefined initial error handler selection

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Aurélien Bouteiller
3cd85a9ec5
Add the initial_errhandler info key to MPI_INFO_ENV and populate the
value from prun populated paremeters

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

Allow errhandlers to invoke the initial error handler before MPI_INIT

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Indentation

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Aurélien Bouteiller
703b8c356f
Make error_class and error_string callable before/after
MPI_INIT/FINALIZE

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

make lazy initialization opal unlikely

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Ralph Castain
7702dfcdd2
Merge pull request #7942 from rhc54/topic/init
Ensure we init and protect values
2020-07-15 07:59:01 -07:00
George Bosilca
8bc1f3d8fb Don't allow any asynchronous CUDA operations.
There are 2 reasons for this:
- pending CUDA events are not progressed by this BTL, so anything that becomes
  asychronous will never be completed.
- we use the packed data on the shared memory backing file, and this will be
  returned to the peer process upon return (thus if we copy asynchronously we
  might not copy the right data).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:37:09 -04:00
George Bosilca
0e32b0acef Avoid a lock if no CUDA IPC operations are pending.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-15 01:35:34 -04:00
Ralph Castain
a574addce9
Ensure we init and protect values
Scrub the entire ompi_rte.c file to initialize and protect values
received from PMIx.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-07-14 15:25:14 -07:00
Austen Lauria
67d90166cf Revert "Address a race condition in libevent select."
We do not want to be patching upstream components anymore.
The proper method is to get this merged upstream, then
pull it in the next upstream release.

This reverts commit c39fb5758a772c062e20db9b42f2b06805884802.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-07-14 16:23:21 -04:00
George Bosilca
fd4ca394e2 Make the smcuda BTL great again.
It has been broken for months because of the lack of initialization of the
HWLOC library. The smcuda process creating the backing file (local rank 0)
uses opal_cache_line_size to align the objects in the backing file, and the
opal_cache_line_size is initialized by default to 128. Later on, when the rest
of the processes attach the same backing file, HWLOC has been called and the
cache size has now been updated to the correct value. If this value is
different than the default one (and they are as most cache sizes are 64 bytes
right now) the objects in the backing file will be misaligned.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-14 01:48:08 -04:00
George Bosilca
96e8cbe25f First step on fixing the BTL API conversion for the SMCUDA BTL
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-07-13 14:46:10 -04:00
Joshua Ladd
aa8f7f4ede
Merge pull request #7893 from bureddy/cuda-ucx
UCX: initialize cuda from ucx pml component
2020-07-13 14:18:48 -04:00
bosilca
1f237f5fc9
Merge pull request #7419 from bosilca/topic/avx512
Add support for AVX512/AVX2/SSE/MMX
2020-07-13 11:56:50 -04:00
Devendar Bureddy
2547e24c55 UCX: initialize cuda from ucx pml component
Signed-off-by: Devendar Bureddy <devendar@mellanox.com>
2020-07-12 18:41:40 +03:00