1
1
Граф коммитов

6974 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
d6d73b7724 mtl/ofi: replace OMPI_UNLIKELY with OPAL version
one off patch for v4.0.x.  for some reason commit on master
didn't have this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 5f3dbdb5c8)

Note that this commit is actually a cherry-pick from the v4.0.x
branch.  This is the opposite direction than what we nornmally do: we
usually commit to master first and then cherry-pick to the release
branches (vs. the other way around).

As is probably evident from the original commit message above, through
a comedy of errors, this commit was actually applied to the v4.0.x
branch first and then cherry-picked back to master (i.e., the problem
*did* exist in the original master commit
3aca4af548, but it was not recongized at
the time).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-01 09:52:27 -07:00
Jeff Squyres
ee3564a2dc
Merge pull request #7004 from mwheinz/REFS6976-master
REF6976 Silent failure of OMPI over OFI with large messages sizes
2019-09-23 17:31:21 -04:00
Michael Heinz
3aca4af548 REF6976 Silent failure of OMPI over OFI with large messages sizes
INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

Change-Id: I05aa71d332f2df897133b30c28bf37d98f061996
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
2019-09-23 15:23:48 -04:00
Geoff Paulsen
5ff6cb6e6a
Merge pull request #6756 from markalle/romio_info
romio info: letting romio keep its internal setup
2019-09-05 15:43:07 -05:00
Raafat Feki
7877743784
Merge pull request #6857 from raafatfeki/pr/ompio_coll_write_clean
Pr/ompio_fcoll_write_clean
2019-09-04 11:06:56 -05:00
Mark Allen
14e3d7b8b0 romio info: letting romio keep its internal setup
I'm restoring the info function pointers to the IO module
but allowing the function pointers to be NULL (eg in ompio).
And letting romio321 set its function pointers for those
routines.

This means the info system uses the new OMPI-level info
system for most things, but skips it and uses the pre-existing
romio info system just for the romio module.

It's possible to convert romio, but I went a ways down that
path and found it kind of convoluted.  Having pointers from
the lower level ADIO_File back to the higher level ompi_file_t
wasn't too bad, but I got stuck trying to figure out where/how
to register the infosubscribe_subscribe callbacks vs the way
initial k/v values are scattered around the romio code currently.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-09-03 14:08:19 -04:00
Nathan Hjelm
c4d0752036
Merge pull request #6803 from guserav/fix-osc-sm-post-32-bit-atomics
Fix osc sm posts when only 32 bit atomics support
2019-08-27 18:23:48 -07:00
Valentin Petrov
a0d99ad190 Coll/hcoll: fixes hcoll non-blocking colls support
open-mpi/ompi@0fe756d416 Introduced
    a bug in coll/hcoll component. The ompi_requests allocated by
    libhcoll would be treated as coll_base_nbc_request during
    ompi_coll_base_retain_<> call. Afterwards this would lead to a
    segv in the request cleanup.

    Fix: since libhcoll interface does not distinguish between the
    blocling/non-blocking requests use coll_base_nbc_request all the
    time and initialize it properly in
    coll/hcoll/get_coll_handle(). It is still within 2 cache lines.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-08-27 17:22:58 +03:00
raafatfeki
2c6a5eed29 fcoll/dynamic_gen2: Adjustment of displacement index in collective write
Within the shuffle iteration, the aggregators have to set a displacement array needed to receive data from other processes. The array had 1 extra element. We adjust the displacement index to match the number of elements.

Signed-off-by: raafatfeki <fekiraafat@gmail.com>
2019-08-26 10:03:23 -05:00
raafatfeki
f45e9cfdbe fcoll/vulcan: Adjustment of displacement index in collective write
Within the shuffle iteration, the aggregators have to set a displacement array needed to receive data from other processes. The array had 1 extra element. We adjust the displacement index to match the number of elements.

Signed-off-by: raafatfeki <fekiraafat@gmail.com>
2019-08-26 10:03:23 -05:00
George Bosilca
2930bd9d21 Whitespace cleanup
No code or logic changes.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-08-14 11:06:47 -04:00
Artem Polyakov
d58c59eb71
Merge pull request #6893 from janjust/osc_error_path_fix
osc/ucx: Fix error path
2019-08-12 21:23:57 -07:00
Jeff Squyres
ae1f7e0c3b
Merge pull request #6879 from mwheinz/REF6877-master
PSM MTL is obsolete and should be removed
2019-08-12 15:08:25 -04:00
Tomislav Janjusic
d5f6b088ae osc/ucx: Fix error path
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2019-08-12 21:54:01 +03:00
Gilles Gouaillardet
63d3ccde9d coll/base: only retain datatypes/op if the request has not yet completed
a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-08-09 09:57:56 +09:00
Gilles Gouaillardet
0862c409f1 coll/base: cleanup ompi_coll_base_nbc_request_t elements
Since ompi_coll_base_nbc_request_t is to be used in an
opal_free_list_t, it must be returned into a "clean" state.
So cleanup some data in the callback completion subroutines.

This fixes a regression introduced in open-mpi/ompi@0fe756d416

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-08-08 10:48:06 +09:00
Gilles Gouaillardet
f8eef0fde9 coll/libnbc: fixes ompi ompi_coll_libnbc_request_t parent
base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t
to correctly support the retention of datatypes/operators

This fixes a regression introduced in open-mpi/ompi@0fe756d416

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-08-08 10:47:48 +09:00
Michael Heinz
0348d14ff3 PSM MTL is obsolete and should be removed
The PSM MTL for Intel's TrueScale Infiniband HCAs is not being actively
maintained and should be removed from the master branch.

Fixes issue: #6877

Signed-off-by: Michael Heinz <michael.william.heinz@intel.com:
2019-08-07 11:43:03 -04:00
Yossi Itigin
ec9def1406
Merge pull request #6864 from hoopoepg/topic/ucx-ppn-hint
UCX: added PPN hint for UCX context
2019-08-07 13:45:38 +03:00
Edgar Gabriel
34b06dc8bd io_ompio_file_open: fix offset calculation with SEEK_END
and SEEK_CUR. fixes an issue reported by Wei-keng Liao

Fixes Issue #6858

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-08-05 15:56:25 -05:00
Sergey Oblomov
43186e494b UCX: added PPN hint for UCX context
- added PPN hint for UCX context init

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2019-08-05 18:07:06 +03:00
Nysal Jan K A
3c45542c51
Merge pull request #6840 from nysal/ucx_accumulate_fix
osc/ucx: Fix data corruption with non-contiguous accumulates
2019-07-25 22:11:52 +05:30
Yossi Itigin
98d0ecfe14
Merge pull request #6814 from brminich/tuned_all2all_select
COLL/TUNED: Update alltoall selection rule for mellanox platform
2019-07-25 17:51:55 +03:00
Mikhail Brinskii
65618f8db8 COLL/TUNED: Minor var names/comments fixes
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-07-24 10:23:38 +00:00
Nysal Jan K.A
3529d44702 osc/ucx: Fix data corruption with non-contiguous accumulates
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
2019-07-24 13:07:59 +05:30
Nysal Jan K.A
14808922cf osc/ucx: Add support for the no_locks info key
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
2019-07-18 17:29:01 +05:30
Mikhail Brinskii
404c480068 COLL/TUNED: Update alltoall selection rule for mlx
Use linear with sync alltoall algorithm for certain message/comm size
ranges. Does not affect default fixed decision, unless HPCX (with its
custom parameters) is used or corresponding mca is set.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-07-13 23:27:40 +03:00
Gilles Gouaillardet
0fe756d416 mpi: retain operation and datatype in non blocking collectives
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.

Thanks Thomas Ponweiser for reporting this

Fixes open-mpi/ompi#2151
Fixes open-mpi/ompi#1304

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-07-12 09:15:45 +09:00
guserav
3c9f4e6823 Fix osc sm posts when only 32 bit atomics support
Signed-off-by: guserav <erik.zeiske@hpe.com>
2019-07-09 15:13:25 -07:00
Nysal Jan K.A
fe4ef147f8 pml/ucx: Fix the max tag and context id values
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
2019-07-03 14:33:01 +05:30
Geoff Paulsen
f1b2a09675
Merge pull request #6649 from devreal/rdma-fetchop-local
OSC rdma: make sure accumulating in shared memory is safe
2019-06-28 14:46:21 -05:00
Artem Polyakov
6678ac0f55 osc/ucx: Fix possible win creation/destruction race condition
To avoid fully initializing the osc/ucx component for MPI application
that are not using One-Sided functionality, the initialization happens
at the first MPI window creation.

This commit ensures atomicity of global state modifications.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-06-20 09:05:03 -07:00
Artem Polyakov
0857742624 osc/ucx: Fix worker pool finalization
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2019-06-20 09:05:03 -07:00
Nathan Hjelm
560886f095
Merge pull request #6746 from devreal/osc_winalloc_err
OSC rdma win allocate: propagate errors to avoid deadlocks
2019-06-18 17:57:53 -07:00
Harald Klimach
e222a04ae5 Suggestion to fix division by zero in file view.
In common_ompi_aggregators calc_cost routine:
do not cast the real division to an int intermediately.
This patch removes the obsolete int variable c and assigns
the result of the P_a/P_x division directly to n_as.

With the intermediate int c variable, n_as gets 0 if P_a < P_x,
resulting in a division by 0 when computing n_s.

Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>
2019-06-13 18:47:32 +02:00
Jeff Squyres
7c3aeb3061
Merge pull request #6686 from alex-anenkov/coll-iallreduce-recursivedoubling
coll/libnbc: add recursive doubling algorithm for MPI_Iallreduce
2019-06-10 10:09:51 -04:00
Yossi Itigin
a46e5da3ca
Merge pull request #6744 from brminich/topic/all2all_linear_sync_fix
COLL/BASE: Fix linear sync all2all
2019-06-09 21:23:38 +03:00
Joseph Schuchart
8f27cc26d9 OSC rdma win allocate: synchronize error codes across shared memory group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-06-07 11:03:21 +02:00
Mikhail Brinskii
79006f4e5a COLL/BASE: Fix linear sync all2all
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-06-06 19:22:42 +03:00
Yossi Itigin
8535dd570b
Merge pull request #6732 from dmitrygladkov/topic/pml/ucx_init
PML/UCX: Don't destroy UCP worker if it wasn't created
2019-06-06 10:41:33 +03:00
Dmitry Gladkov
c864ca51d2 PML/UCX: Don't destroy UCP worker if it wasn't created
Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>
2019-06-03 10:49:36 +03:00
Tomislav Janjusic
6ea920e225 Coll/hcoll: adding scatterv interface
Signed-off-by: Valentin Petrov valentinp@mellanox.com
2019-05-27 12:27:43 +03:00
Edgar Gabriel
8eda9f2ecd common/ompio: fix coverty warnings
this commmit fixes coverty warnings CID 1445198 and CID 1445197
For a reason that is a bit unclear to me, coverty only complained about the read
files, but the write operations had the same issue, so I fixed that within the
same commit as well.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-23 13:40:39 -05:00
Edgar Gabriel
27b2ec71a7 common/ompio: add support for read operations and collective I/O
external32 data representation is now support by ompio for everything
but non-blocking collective I/O operations. The support can further be improved
in a second step to limit the temporary buffer size (at least for blocking operations),
but it does work now for many scenarios.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 17:56:16 -05:00
Edgar Gabriel
ab56e6f0db common/ompio: make individual read operations work.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 17:22:33 -05:00
Edgar Gabriel
f6b3a0af52 common/ompio: individual write of external32 works
both blocking and non-blocking. collective write and read operations not yet.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 16:26:14 -05:00
Edgar Gabriel
d955753cb8 common/ompio: abstraction for different convertor types
introduce separate convertors for memory vs. file representation. Adjust the interfaces for decode_datatype to provide the convertor to be used for that.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 13:35:38 -05:00
Edgar Gabriel
35be18b266 common/ompio: rename ompio_cuda* to ompio_buffer*
the infrastructure put in place to manage cuda buffers is actually
a lot more generic than just for cuda buffers. Specifically, we ca
reuse much of the code to implement the external32 data representation.
This commit converts the code from common_ompio_cuda* to
common_ompio_buffer*. There are just very few places where we actually need to keep the OPAL_CUDA_SUPPORT ifdef in place.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 12:50:04 -05:00
Edgar Gabriel
a96efb7620 common/ompio: add comm_ompio_read_all/write_all functions
in preparation for adding support for the external32 data
representation.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 12:49:36 -05:00
Valentin Petrov
f19f6f432a Coll/hcoll: don't init opal memhooks unless explicitely requested by user
If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal
    memory framework and register a mem release cb. Otherwise, rely on ucx.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2019-05-20 11:17:44 +03:00