This patch fixes#7147 by preventing overflow when multiplying
the count and the blocklen. The count reflects MPI count and is
therefore bound to the size of an int (it is an uint32_t) while the
blocklen can be merged together to represent the largest contiguous
memory layout and it is therefore promoted to a size_t.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 59fb02618e)
This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 41e6f55807)
Start optimizing the code.
This commit divides the operations in 2 parts, the first, outside the
critical part, deals with partial blocks of predefined elements, and the
second, inside the critical path, only deals with full blocks of
elements. This reduces the number of expensive operations in the
critical path and results in a decent performance increase.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Amazing how a bad instruction scheduling can have such a drastic impact
on the code performance. With this change, the get a boost of at least
50% on the performance of data with a small blocklen and/or count.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Optimize contiguous loops by collapsing them into a single element.
During datatype optimization collapse similar elements into larger
blocks.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Upon detecting a datatype loop representation skip the entire loop
according the the remaining space.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
- optimize handling of contiguous with gaps datatypes.
- fixes a performance issue for all datatypes with a count of 1.
- optimize the pack/unpack of contiguous with gaps datatype.
- optimize the case of blocklen == 1
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Move toward a base type of vector (count, type, blocklen, extent, disp)
with disp and extent applying toward the count repertition and blocklen
being a contiguous memory of type type.
Implement 2 optimizations on this description used during type_commit:
- collapse: successive similar datatype descriptions are collapsed
together with an increased count.
- fusion: fuse successive datatype descriptions in order to minimize the
number of resulting memcpy during pack/unpack.
Fixes at the OMPI datatype level including:
- Fix the create_hindexed and vector creation.
- Fix the handling of [get|set]_elements and _count.
- Correctly compute the dispacement for block indexed types.
- Support the MPI_LB and MPI_UB deprecation, aka. OMPI_ENABLE_MPI1_COMPAT.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.
Thanks Ben Menadue for the initial bug report
Refs open-mpi/ompi#6016
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Reset ptypes when cloning a datatype in order to prevent
a double free() in the opal_datatype_t destructor.
This fixes a bug introduced in open-mpi/ompi@7c938f070fFixesopen-mpi/ompi#6346
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@b395342c9f)
The issue was a little complicated due to the internal stack used in the
convertor. The main issue was that in the case where we run out of iov
space to save the raw description of the data while hanbdling a
repetition (loop), instead of saving the current position and bailing out
directly we reading of the next predefined type element. It worked in
most cases, except the one identified by the HDF5 test. However, the
biggest issue here was the drop in performance for all ensuing calls to
the convertor pack/unpack, as instead of handling contiguous loops as a
whole (and minimizing the number of memory copies) we copied data
description by data description.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(back-ported from commit open-mpi/ompi@5a82c4fd07)
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop
Refs. open-mpi/ompi#6285
Thanks Axel Huebl for reporting this
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(back-ported from commit open-mpi/ompi@0832ab5acc)
correctly free ptypes if the datatype is not pre-defined.
Thanks Axel Huebl for reporting this.
Refs. open-mpi/ompi#6291
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 7c938f070f)
cuda buffer
the existing interface in opal_datatype_cuda do not allow to distinguish whether a
buffer is a managed or unmanaged cuda buffer. Add an interface that allows to
retrieve this information throug a convertor, since the information is actually available
in the mca_common_cuda_* routines.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Fixes issue #5069, which relates a BigMPI bug with the use of
MPI_Type_vectpor to construct very large datatypes (>2GB).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
return true if the datatype has non-negative displacements and
monotonically nondecreasing, and false otherwise.
Thanks George for the guidance.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.
This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
* Don't overflow the internal datatype count.
Change the type of the count to be a size_t (it does not alter the total
size of the internal structures, so has no impact on the ABI).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Optimize the datatype creation.
The internal array of counts of predefined types is now only created
when needed, which is either in a heterogeneous environment, or when
one call get_elements. It saves space and makes the convertor creation a
little faster in some cases.
Rearrange the fields in the datatype description structs.
The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the
static array was only partially created. All predefined types should
have the ptypes array created and initialized.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Fix the boundary computation.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* test/datatype: add test for short unpack on heteregeneous cluster
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Trying to reduce the cost of creating a convertor.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Respect the unpack boundaries.
As Gilles suggested on #2535 the opal_unpack_general_function was
unpacking based on the requested count and not on the amount of packed
data provided.
Fixes#2535.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
since Open MPI now requires a C99, and ptrdiff_t type is part of C99,
there is no more need for the abstract OPAL_PTRDIFF_TYPE type.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
clang 7.0 with the picky option on is extremely verbose, and complains
about almost everything. Trying to make him happy, at least regarding
the datatype engine.
The heterogeneous code need to gracefully handly the contiguous
datatype loops in order to have the "#if 0" code path enabled again.
This is a performance issue (the correctness is guaranteed by the
current code).
provide external32 support). Add a pack function allowing to
provide send conversion (needed on little endian machine in
order to pack in the external32 format).