1
1

10357 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
e17203b4f7
Silence Coverity warning
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-08-12 12:42:41 -07:00
Ralph Castain
14f3fbb8c1
Provide locality for all procs on node
Update PMIx to latest master to get supporting updates. For
connect/accept (part of comm_spawn as well), lookup locality for all
participating procs on the node and compute the relative locality so it
can be used for MPI operations.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit d202e10c1407d2f9177e9b871eadde1f25526676)
2019-08-12 12:42:40 -07:00
Tomislav Janjusic
e9a0343780 osc/ucx: Fix possible win creation/destruction race condition
To avoid fully initializing the osc/ucx component for MPI application
that are not using One-Sided functionality, the initialization happens
at the first MPI window creation.

This commit ensures atomicity of global state modifications.

ported from: 6678ac0f557935b291ec2310216b7ea46e0c13b1
Signed-off-by: Artem Polyakov <artpol84@gmail.com>

fix alignment, and fix error path
2019-08-12 22:23:17 +03:00
Gilles Gouaillardet
39ec580b76 coll/base: only retain datatypes/op if the request has not yet completed
a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@63d3ccde9d)
2019-08-13 00:13:40 +09:00
Gilles Gouaillardet
ae26957619 coll/base: cleanup ompi_coll_base_nbc_request_t elements
Since ompi_coll_base_nbc_request_t is to be used in an
opal_free_list_t, it must be returned into a "clean" state.
So cleanup some data in the callback completion subroutines.

This fixes a regression introduced in open-mpi/ompi@0fe756d416

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@0862c409f1)
2019-08-13 00:13:40 +09:00
Gilles Gouaillardet
b37c85dcca coll/libnbc: fixes ompi ompi_coll_libnbc_request_t parent
base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t
to correctly support the retention of datatypes/operators

This fixes a regression introduced in open-mpi/ompi@0fe756d416

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@f8eef0fde9)
2019-08-13 00:13:40 +09:00
Sergey Oblomov
2fa112c0a6 UCX: added PPN hint for UCX context
- added PPN hint for UCX context init

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 43186e494b47ca29e8d5e7a864b6b98b8e873195)

Conflicts:
	opal/mca/common/ucx/common_ucx_wpool.c
2019-08-09 11:51:30 +03:00
George Bosilca
8b794235b8
Update the datatype dump to match the actual types.
Update the comments to better reflect what is going on.
Minor indentations.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-05 09:37:47 -04:00
George Bosilca
4f754d0156
Optimized datatype description.
Move toward a base type of vector (count, type, blocklen, extent, disp)
with disp and extent applying toward the count repertition and blocklen
being a contiguous memory of type type.
Implement 2 optimizations on this description used during type_commit:
- collapse: successive similar datatype descriptions are collapsed
together with an increased count.
- fusion: fuse successive datatype descriptions in order to minimize the
number of resulting memcpy during pack/unpack.

Fixes at the OMPI datatype level including:
 - Fix the create_hindexed and vector creation.
 - Fix the handling of [get|set]_elements and _count.
 - Correctly compute the dispacement for block indexed types.
 - Support the MPI_LB and MPI_UB deprecation, aka. OMPI_ENABLE_MPI1_COMPAT.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-05 09:35:07 -04:00
George Bosilca
f68b06e9ee
Fix incorrect behavior with length == 0
Fixes #6575.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-05 09:33:28 -04:00
Howard Pritchard
e547a2b94d
Merge pull request #6838 from ggouaillardet/topic/v4.0.x/misc_fortran_bindings
v4.0.x: misc Fortran related backports
2019-08-02 13:00:31 -06:00
Howard Pritchard
31aa52f11a
Merge pull request #6846 from nysal/topic/v4.0.x/ucx_accumulate_fix
v4.0.x: osc/ucx: Fix data corruption with non-contiguous accumulates
2019-08-02 12:43:40 -06:00
Nysal Jan K.A
359cdf2b53 osc/ucx: Fix data corruption with non-contiguous accumulates
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
(cherry picked from commit 3529d447020684ab305411caa97423826bb40906)
2019-07-26 14:41:08 +05:30
Mikhail Brinskii
b9998a14dc COLL/TUNED: Minor var names/comments fixes
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 65618f8db848613c95cbe112033df94721d326a8)
2019-07-26 11:29:12 +03:00
Mikhail Brinskii
3d5b7b4a1b COLL/TUNED: Update alltoall selection rule for mlx
Use linear with sync alltoall algorithm for certain message/comm size
ranges. Does not affect default fixed decision, unless HPCX (with its
custom parameters) is used or corresponding mca is set.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 404c4800688548b021bda68bdf10792424e6b1c5)
2019-07-26 11:28:47 +03:00
KAWASHIMA Takahiro
1ffb9b10bb pcollreq/mpif-h: fix MPIX_Alltoallw_init() binding
These issues were introduced in the recent commit b71af0eca0.
This commit fixes Coverity CID 1451661 and 1451660.

Though `c_info` part was an actual bug, the `c_sendtypes` part was not.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>

(cherry picked from commit open-mpi/ompi@facf8c5e98)
2019-07-24 17:12:10 +09:00
Gilles Gouaillardet
13ba2b0d75 pcollreq/mpif-h: fix MPIX_Alltoallw_init() binding
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b71af0eca0)
2019-07-24 17:11:44 +09:00
Gilles Gouaillardet
5ab26e490a fortran/mpif-h: fix [i]alltoallw bindings
Fix a regression introduced in open-mpi/ompi@cdaed89d04

Fixes CID 1451610, 1451611 and 1451612

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@ed703bec1b)
2019-07-24 17:10:58 +09:00
Gilles Gouaillardet
fbf7d31fd1 fortran/mpif-h: fix MPI_[I]Alltoallw() binding
- ignore sendcounts, sendispls and sendtypes arguments when MPI_IN_PLACE is used
 - use the right size when an inter-communicator is used.

Thanks Markus Geimer for reporting this.

Refs. open-mpi/ompi#5459

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@cdaed89d04)
2019-07-24 17:10:27 +09:00
Gilles Gouaillardet
aae73d9cf7 fortran/mpif-h: fix C to Fortran error code conversion
- remove incorrect use of OMPI_INT_2_FINT()
 - use homogenous syntax (e.g. c_ierr = PMPI_...())

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@223e6cc537)
2019-07-24 17:10:00 +09:00
Howard Pritchard
667aba9913
Merge pull request #6810 from janjust/v4.0.x
v4.0.x OSC: Reset external request to NULL
2019-07-23 09:05:03 -06:00
Tomislav Janjusic
63605fc466 v4.0.x OSC: Reset external request to NULL to avoid double request
completion
Co-authored with Artem Polyakov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2019-07-12 22:49:34 +03:00
Gilles Gouaillardet
c9e4240e70 mpi: retain operation and datatype in non blocking collectives
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi/ompi#2151
Fixes open-mpi/ompi#1304

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@0fe756d416)
2019-07-12 10:27:04 +09:00
Aurelien Bouteiller
9499dcfe41 Manage errors in NBC collective ops
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Correctly bubble up errors in NBC collective operations

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

The error field of requests needs to be rearmed at start, not at create

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

(cherry picked from commit open-mpi/ompi@65660e5999)
2019-07-12 10:26:08 +09:00
Nysal Jan K.A
b6da090090 pml/ucx: Fix the max tag and context id values
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
(cherry picked from commit fe4ef147f81b2ac56661175005de6c330eace690)
2019-07-03 16:38:07 +03:00
Geoff Paulsen
514e273968
Merge pull request #6770 from devreal/osc_winalloc_err_v4.0.x
OSC rdma win allocate: propagate errors to avoid deadlocks (v4.0.x)
2019-06-28 14:04:12 -05:00
Howard Pritchard
6424857029
Merge pull request #6634 from jsquyres/pr/v4.0.x/ob1-fixes
v4.0.x: Cherry pick ob1 fixes from master
2019-06-26 10:49:32 -06:00
Harald Klimach
16e1d74c8f Suggestion to fix division by zero in file view.
In common_ompi_aggregators calc_cost routine:
do not cast the real division to an int intermediately.
This patch removes the obsolete int variable c and assigns
the result of the P_a/P_x division directly to n_as.

With the intermediate int c variable, n_as gets 0 if P_a < P_x,
resulting in a division by 0 when computing n_s.

Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>
(cherry picked from commit e222a04ae57e5d09b8559f3c111de1f10a47246a)
2019-06-25 09:29:08 -06:00
Howard Pritchard
28d300915f
Merge pull request #6725 from bosilca/cherrypick/6683
Cherrypick/6683
2019-06-24 13:24:02 -06:00
Joseph Schuchart
c5cf3432b9 OSC rdma win allocate: synchronize error codes across shared memory group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit 8f27cc26d9845b5b207979b2a4621ef1089d1afb)
2019-06-24 17:49:26 +02:00
Howard Pritchard
73c4aac12d
Merge pull request #6750 from brminich/topic/all2all_linear_sync_fix_v4.0
COLL/BASE: Fix linear sync all2all - v4.0.x
2019-06-17 13:45:52 -06:00
Howard Pritchard
cb8dd569ff
Merge pull request #6747 from devreal/rdma-fetchop-local-v4.0.x
OSC rdma: make sure accumulating in shared memory is safe
2019-06-13 18:55:53 -06:00
Mikhail Brinskii
adba7f55f7 COLL/BASE: Fix linear sync all2all
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 79006f4e5a578d32bfa08de7b98e747ae18706f6)
2019-06-09 21:31:19 +03:00
Joseph Schuchart
900f0fa21f OSC rdma: make sure accumulating in shared memory is safe
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
(cherry picked from commit c67e2291937a09947c421dc84c6b3a8d07bec07f)
2019-06-07 12:45:00 +02:00
Tsubasa Yanagibashi
5dd8830dca mpiext/pcollreq: Add _f08 to procedure names
The procedure names don't contain "_f08" of Fortran 2008 bindings of
Persistent Collective Operations(mpiext/pcollreq/use-mpi-f08).
This fix adds "_f08" to the procedure names of pcollreq/use-mpi-f08,
same as other Fortran 2008 routines in `ompi/mpi/fortran/use-mpi-f08/mod`.

Signed-off-by: Tsubasa Yanagibashi <fj2505dt@aa.jp.fujitsu.com>
(cherry picked from commit 3148b0cfaa04843e7219acb8c7e04f43f6d219fe)
2019-06-07 10:59:01 +09:00
Geoff Paulsen
a04f5f0c70
Merge pull request #6692 from vspetrov/v4.0.x
V4.0.x Coll/hcoll: don't init opal memhooks unless explicitely requested
2019-06-03 15:00:36 -05:00
George Bosilca
a8d5da67db
Fix the man pages for some of the MPI_T_* functions.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-05-31 00:19:14 -04:00
George Bosilca
dbf89404d7
Fix the SPC initialization.
Use the PVAR ctx to save the SPC index, so that no lookup nor
restriction on the SPC vars position is imposed.
Make sure the PVAR are always registered.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-05-31 00:19:14 -04:00
George Bosilca
cadf315ca9
Fixed SPC/MPI_T initialization error.
Signed-off-by: Yong Qin <yongq@mellanox.com>
2019-05-30 17:54:26 -04:00
Howard Pritchard
e78851a6c7
Merge pull request #6704 from edgargabriel/pr/v4.0.x-empty-fileview-fix
common/ompio: fix division by zero problem with empty fview
2019-05-26 09:45:52 -06:00
Howard Pritchard
386ed07d54
Merge pull request #6689 from hoopoepg/topic/suppressed-pml-ucx-mt-warning-v4.0
PML/UCX: disable PML UCX if MT is requested but not supported - v4.0
2019-05-26 09:44:05 -06:00
Edgar Gabriel
c7250cd11d common/ompio: fix division by zero problem with empty fview
When using an empty fileview, a division by zero bug can occur in ompio. Not entirely sure why the problem did not show up previously, but some recent changes trigger that bug in one of our tests.

This pr is part of a fix applied in commit f6b3a0a

Fixes Issue #6703

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-23 13:48:57 -05:00
Valentin Petrov
8f82c899bc Coll/hcoll: don't init opal memhooks unless explicitely requested by user
If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal
    memory framework and register a mem release cb. Otherwise, rely on ucx.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-05-20 14:00:50 +03:00
Sergey Oblomov
1edd36638b PML/UCX: disable PML UCX if MT is requested but not supported
- in case if multithreading requested but not supported
  disable PML UCX

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a3578d9ece2b40a349529e7b223df50b0aac64aa)
2019-05-20 09:59:59 +03:00
Yossi Itigin
4f9fb3e9ce OSC/UCX: Fix deadlock with atomic lock
Atomic lock must progress local worker while obtaining the remote lock,
otherwise an active message which actually releases the lock might not
be processed while polling on local memory location.

(picked from master 9d1994b)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2019-05-20 09:54:01 +03:00
George Bosilca
4946570b24 Remove few warnings identified by @rhc in #5514.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

(cherry picked from commit open-mpi/ompi@6d11a45f44)
2019-05-11 16:38:31 +09:00
Geoff Paulsen
73f9bcc374
Merge pull request #6632 from brminich/topic/shmem_all2all_put_4.0.x
SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h 4.0.x
2019-05-07 08:05:01 -05:00
Howard Pritchard
8e968f16a6
Merge pull request #6626 from ggouaillardet/topic/v4.0.x/mpi_combiner_xyz_integer
v4.0.x: mpi: mark MPI_COMBINER_{HVECTOR,HINDEXED,STRUCT}_INTEGER removed
2019-05-04 07:25:40 -06:00
George Bosilca
48f824327c Fix the leak of fragments for persistent sends.
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.

Be more explicit on the comment.

The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.

NOTE: Had to add a typedef for "opal_atomic_size_t" from master into
opal/threads/thread_usage.h into this cherry pick (it is in
opal/include/opal_stdatomic.h on master, but that file does not exist
here on the v4.0.x branch).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a16cf0e4dd6df4dea820fecedd5920df632935b8)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-05-03 06:20:02 -07:00
Brelle Emmanuel
c44821aef5 pml/ob1: fixed local handle sent during PUT control message
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.

I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.

Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
(cherry picked from commit e630046a4b82bc01379fb055af4c0e414c2a8e8f)
2019-05-03 05:53:35 -07:00