1
1

10449 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
4f754d0156
Optimized datatype description.
Move toward a base type of vector (count, type, blocklen, extent, disp)
with disp and extent applying toward the count repertition and blocklen
being a contiguous memory of type type.
Implement 2 optimizations on this description used during type_commit:
- collapse: successive similar datatype descriptions are collapsed
together with an increased count.
- fusion: fuse successive datatype descriptions in order to minimize the
number of resulting memcpy during pack/unpack.

Fixes at the OMPI datatype level including:
 - Fix the create_hindexed and vector creation.
 - Fix the handling of [get|set]_elements and _count.
 - Correctly compute the dispacement for block indexed types.
 - Support the MPI_LB and MPI_UB deprecation, aka. OMPI_ENABLE_MPI1_COMPAT.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-05 09:35:07 -04:00
George Bosilca
f68b06e9ee
Fix incorrect behavior with length == 0
Fixes #6575.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-05 09:33:28 -04:00
Howard Pritchard
e547a2b94d
Merge pull request #6838 from ggouaillardet/topic/v4.0.x/misc_fortran_bindings
v4.0.x: misc Fortran related backports
2019-08-02 13:00:31 -06:00
Howard Pritchard
31aa52f11a
Merge pull request #6846 from nysal/topic/v4.0.x/ucx_accumulate_fix
v4.0.x: osc/ucx: Fix data corruption with non-contiguous accumulates
2019-08-02 12:43:40 -06:00
Nysal Jan K.A
359cdf2b53 osc/ucx: Fix data corruption with non-contiguous accumulates
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
(cherry picked from commit 3529d447020684ab305411caa97423826bb40906)
2019-07-26 14:41:08 +05:30
Mikhail Brinskii
b9998a14dc COLL/TUNED: Minor var names/comments fixes
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 65618f8db848613c95cbe112033df94721d326a8)
2019-07-26 11:29:12 +03:00
Mikhail Brinskii
3d5b7b4a1b COLL/TUNED: Update alltoall selection rule for mlx
Use linear with sync alltoall algorithm for certain message/comm size
ranges. Does not affect default fixed decision, unless HPCX (with its
custom parameters) is used or corresponding mca is set.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 404c4800688548b021bda68bdf10792424e6b1c5)
2019-07-26 11:28:47 +03:00
KAWASHIMA Takahiro
1ffb9b10bb pcollreq/mpif-h: fix MPIX_Alltoallw_init() binding
These issues were introduced in the recent commit b71af0eca0.
This commit fixes Coverity CID 1451661 and 1451660.

Though `c_info` part was an actual bug, the `c_sendtypes` part was not.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>

(cherry picked from commit open-mpi/ompi@facf8c5e98)
2019-07-24 17:12:10 +09:00
Gilles Gouaillardet
13ba2b0d75 pcollreq/mpif-h: fix MPIX_Alltoallw_init() binding
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b71af0eca0)
2019-07-24 17:11:44 +09:00
Gilles Gouaillardet
5ab26e490a fortran/mpif-h: fix [i]alltoallw bindings
Fix a regression introduced in open-mpi/ompi@cdaed89d04

Fixes CID 1451610, 1451611 and 1451612

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@ed703bec1b)
2019-07-24 17:10:58 +09:00
Gilles Gouaillardet
fbf7d31fd1 fortran/mpif-h: fix MPI_[I]Alltoallw() binding
- ignore sendcounts, sendispls and sendtypes arguments when MPI_IN_PLACE is used
 - use the right size when an inter-communicator is used.

Thanks Markus Geimer for reporting this.

Refs. open-mpi/ompi#5459

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@cdaed89d04)
2019-07-24 17:10:27 +09:00
Gilles Gouaillardet
aae73d9cf7 fortran/mpif-h: fix C to Fortran error code conversion
- remove incorrect use of OMPI_INT_2_FINT()
 - use homogenous syntax (e.g. c_ierr = PMPI_...())

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@223e6cc537)
2019-07-24 17:10:00 +09:00
Howard Pritchard
667aba9913
Merge pull request #6810 from janjust/v4.0.x
v4.0.x OSC: Reset external request to NULL
2019-07-23 09:05:03 -06:00
Tomislav Janjusic
63605fc466 v4.0.x OSC: Reset external request to NULL to avoid double request
completion
Co-authored with Artem Polyakov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2019-07-12 22:49:34 +03:00
Gilles Gouaillardet
c9e4240e70 mpi: retain operation and datatype in non blocking collectives
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi/ompi#2151
Fixes open-mpi/ompi#1304

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@0fe756d416)
2019-07-12 10:27:04 +09:00
Aurelien Bouteiller
9499dcfe41 Manage errors in NBC collective ops
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Correctly bubble up errors in NBC collective operations

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

The error field of requests needs to be rearmed at start, not at create

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

(cherry picked from commit open-mpi/ompi@65660e5999)
2019-07-12 10:26:08 +09:00
Nysal Jan K.A
b6da090090 pml/ucx: Fix the max tag and context id values
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
(cherry picked from commit fe4ef147f81b2ac56661175005de6c330eace690)
2019-07-03 16:38:07 +03:00
Geoff Paulsen
514e273968
Merge pull request #6770 from devreal/osc_winalloc_err_v4.0.x
OSC rdma win allocate: propagate errors to avoid deadlocks (v4.0.x)
2019-06-28 14:04:12 -05:00
Howard Pritchard
6424857029
Merge pull request #6634 from jsquyres/pr/v4.0.x/ob1-fixes
v4.0.x: Cherry pick ob1 fixes from master
2019-06-26 10:49:32 -06:00
Harald Klimach
16e1d74c8f Suggestion to fix division by zero in file view.
In common_ompi_aggregators calc_cost routine:
do not cast the real division to an int intermediately.
This patch removes the obsolete int variable c and assigns
the result of the P_a/P_x division directly to n_as.

With the intermediate int c variable, n_as gets 0 if P_a < P_x,
resulting in a division by 0 when computing n_s.

Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>
(cherry picked from commit e222a04ae57e5d09b8559f3c111de1f10a47246a)
2019-06-25 09:29:08 -06:00
Howard Pritchard
28d300915f
Merge pull request #6725 from bosilca/cherrypick/6683
Cherrypick/6683
2019-06-24 13:24:02 -06:00
Joseph Schuchart
c5cf3432b9 OSC rdma win allocate: synchronize error codes across shared memory group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit 8f27cc26d9845b5b207979b2a4621ef1089d1afb)
2019-06-24 17:49:26 +02:00
Howard Pritchard
73c4aac12d
Merge pull request #6750 from brminich/topic/all2all_linear_sync_fix_v4.0
COLL/BASE: Fix linear sync all2all - v4.0.x
2019-06-17 13:45:52 -06:00
Howard Pritchard
cb8dd569ff
Merge pull request #6747 from devreal/rdma-fetchop-local-v4.0.x
OSC rdma: make sure accumulating in shared memory is safe
2019-06-13 18:55:53 -06:00
Mikhail Brinskii
adba7f55f7 COLL/BASE: Fix linear sync all2all
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 79006f4e5a578d32bfa08de7b98e747ae18706f6)
2019-06-09 21:31:19 +03:00
Joseph Schuchart
900f0fa21f OSC rdma: make sure accumulating in shared memory is safe
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit c67e2291937a09947c421dc84c6b3a8d07bec07f)
2019-06-07 12:45:00 +02:00
Tsubasa Yanagibashi
5dd8830dca mpiext/pcollreq: Add _f08 to procedure names
The procedure names don't contain "_f08" of Fortran 2008 bindings of
Persistent Collective Operations(mpiext/pcollreq/use-mpi-f08).
This fix adds "_f08" to the procedure names of pcollreq/use-mpi-f08,
same as other Fortran 2008 routines in `ompi/mpi/fortran/use-mpi-f08/mod`.

Signed-off-by: Tsubasa Yanagibashi <fj2505dt@aa.jp.fujitsu.com>
(cherry picked from commit 3148b0cfaa04843e7219acb8c7e04f43f6d219fe)
2019-06-07 10:59:01 +09:00
Geoff Paulsen
a04f5f0c70
Merge pull request #6692 from vspetrov/v4.0.x
V4.0.x Coll/hcoll: don't init opal memhooks unless explicitely requested
2019-06-03 15:00:36 -05:00
George Bosilca
a8d5da67db
Fix the man pages for some of the MPI_T_* functions.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-05-31 00:19:14 -04:00
George Bosilca
dbf89404d7
Fix the SPC initialization.
Use the PVAR ctx to save the SPC index, so that no lookup nor
restriction on the SPC vars position is imposed.
Make sure the PVAR are always registered.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-05-31 00:19:14 -04:00
George Bosilca
cadf315ca9
Fixed SPC/MPI_T initialization error.
Signed-off-by: Yong Qin <yongq@mellanox.com>
2019-05-30 17:54:26 -04:00
Howard Pritchard
e78851a6c7
Merge pull request #6704 from edgargabriel/pr/v4.0.x-empty-fileview-fix
common/ompio: fix division by zero problem with empty fview
2019-05-26 09:45:52 -06:00
Howard Pritchard
386ed07d54
Merge pull request #6689 from hoopoepg/topic/suppressed-pml-ucx-mt-warning-v4.0
PML/UCX: disable PML UCX if MT is requested but not supported - v4.0
2019-05-26 09:44:05 -06:00
Edgar Gabriel
c7250cd11d common/ompio: fix division by zero problem with empty fview
When using an empty fileview, a division by zero bug can occur in ompio. Not entirely sure why the problem did not show up previously, but some recent changes trigger that bug in one of our tests.

This pr is part of a fix applied in commit f6b3a0a

Fixes Issue #6703

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-23 13:48:57 -05:00
Valentin Petrov
8f82c899bc Coll/hcoll: don't init opal memhooks unless explicitely requested by user
If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal
    memory framework and register a mem release cb. Otherwise, rely on ucx.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-05-20 14:00:50 +03:00
Sergey Oblomov
1edd36638b PML/UCX: disable PML UCX if MT is requested but not supported
- in case if multithreading requested but not supported
  disable PML UCX

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a3578d9ece2b40a349529e7b223df50b0aac64aa)
2019-05-20 09:59:59 +03:00
Yossi Itigin
4f9fb3e9ce OSC/UCX: Fix deadlock with atomic lock
Atomic lock must progress local worker while obtaining the remote lock,
otherwise an active message which actually releases the lock might not
be processed while polling on local memory location.

(picked from master 9d1994b)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2019-05-20 09:54:01 +03:00
George Bosilca
4946570b24 Remove few warnings identified by @rhc in #5514.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

(cherry picked from commit open-mpi/ompi@6d11a45f44)
2019-05-11 16:38:31 +09:00
Geoff Paulsen
73f9bcc374
Merge pull request #6632 from brminich/topic/shmem_all2all_put_4.0.x
SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h 4.0.x
2019-05-07 08:05:01 -05:00
Howard Pritchard
8e968f16a6
Merge pull request #6626 from ggouaillardet/topic/v4.0.x/mpi_combiner_xyz_integer
v4.0.x: mpi: mark MPI_COMBINER_{HVECTOR,HINDEXED,STRUCT}_INTEGER removed
2019-05-04 07:25:40 -06:00
George Bosilca
48f824327c Fix the leak of fragments for persistent sends.
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.

Be more explicit on the comment.

The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.

NOTE: Had to add a typedef for "opal_atomic_size_t" from master into
opal/threads/thread_usage.h into this cherry pick (it is in
opal/include/opal_stdatomic.h on master, but that file does not exist
here on the v4.0.x branch).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a16cf0e4dd6df4dea820fecedd5920df632935b8)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-05-03 06:20:02 -07:00
Brelle Emmanuel
c44821aef5 pml/ob1: fixed local handle sent during PUT control message
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.

I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.

Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
(cherry picked from commit e630046a4b82bc01379fb055af4c0e414c2a8e8f)
2019-05-03 05:53:35 -07:00
Mikhail Brinskii
e4ee56d1f3 SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 2ef5bd8b3671f1e10caf00d06d66d120eac9c5be)
2019-05-02 21:25:59 +03:00
Howard Pritchard
3cafd02c7f
Merge pull request #6572 from markalle/v40x_fortran_macro
in-place conversion macro writes into INPUT argument
2019-05-01 11:54:12 -06:00
Howard Pritchard
41ef5c7a10
Merge pull request #6594 from vspetrov/osc_ucx_rget_rkey_fix
OSC/UCX: use correct rkey for atomic_fadd in rget/rput
2019-05-01 11:53:17 -06:00
Gilles Gouaillardet
e2638dbbf2 mpi: mark MPI_COMBINER_{HVECTOR,HINDEXED,STRUCT}_INTEGER removed
unless configure'd with --enable-mpi1-compatibility

This is a one-off commit for the v4.0.x branch since these symbols were
simply removed from master.

Thanks Lisandro Dalcin for reporting this.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-05-01 10:50:57 +09:00
Mark Allen
c081757462 fixing an unsafe usage of integer disps[] (romio321 gpfs)
There are a couple MPI_Alltoallv calls in ad_gpfs_aggrs.c where the
send/recv data comes from places like req[r].lens, and the send
buffer and send displacements for example were being calculated as
    sbuf = pick one of the reqs: req[bottom].lens
    sdisps[r] = req[r].lens - req[bottom].lens
which might be okay if the .lens was data inside of req[] so they'd
all be close to each other. But each .lens field is just a pointer
that's malloced, so those addresses can be all over the place, so the
integer-sized sdisps[] isn't safe.

I changed it to have a new extra array sbuf and rbuf for those two
Alltoallv calls, and copied the data into the sbuf from the same
locations it used to be setting up the sdisps[] at, and after the
Alltoallv I copy the data out of the new rbuf into the same
locations it used to be setting up the rdisps[] at.

For what it's worth I was able to get this to fail -np 2 on a GPFS
filesystem with hints romio_cb_write enable. I didn't whittle the
test down to something small, but it was failing in an
MPI_File_write_all call.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit d85cac8f1a11495415b67ecab69d2ae1cd19d155)
2019-04-25 14:22:19 -04:00
Brelle Emmanuel
2a4bc0cb58 pml/ob1: fixed exit from get_frag_fail when falling back on btl_put
In the case the btl_get fails Ob1 tries to fallback on btl_put first but
the return code was ignored. So the code fell back on both btl_put and
btl_send.

Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
(cherry picked from commit 9c689f2225d29aa152627f39bab841afead254af)
2019-04-22 14:25:34 -07:00
Valentin Petrov
2947ab2dbc OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-04-17 10:35:34 +03:00
Valentin Petrov
68c88e86f2 OSC/UCX: use correct rkey for atomic_fadd in rget/rput
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-04-16 15:24:57 +03:00