This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Start optimizing the code.
This commit divides the operations in 2 parts, the first, outside the
critical part, deals with partial blocks of predefined elements, and the
second, inside the critical path, only deals with full blocks of
elements. This reduces the number of expensive operations in the
critical path and results in a decent performance increase.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Amazing how a bad instruction scheduling can have such a drastic impact
on the code performance. With this change, the get a boost of at least
50% on the performance of data with a small blocklen and/or count.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Optimize contiguous loops by collapsing them into a single element.
During datatype optimization collapse similar elements into larger
blocks.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Upon detecting a datatype loop representation skip the entire loop
according the the remaining space.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
- optimize handling of contiguous with gaps datatypes.
- fixes a performance issue for all datatypes with a count of 1.
- optimize the pack/unpack of contiguous with gaps datatype.
- optimize the case of blocklen == 1
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Move toward a base type of vector (count, type, blocklen, extent, disp)
with disp and extent applying toward the count repertition and blocklen
being a contiguous memory of type type.
Implement 2 optimizations on this description used during type_commit:
- collapse: successive similar datatype descriptions are collapsed
together with an increased count.
- fusion: fuse successive datatype descriptions in order to minimize the
number of resulting memcpy during pack/unpack.
Fixes at the OMPI datatype level including:
- Fix the create_hindexed and vector creation.
- Fix the handling of [get|set]_elements and _count.
- Correctly compute the dispacement for block indexed types.
- Support the MPI_LB and MPI_UB deprecation, aka. OMPI_ENABLE_MPI1_COMPAT.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
... and add `MPI_COMPLEX4`.
This commit changes values of existing `OMPI_DATATYPE_MPI_*` macros.
This change does not affect ABI compatibility of `libmpi.so` and the
like because these values are only used in OMPI internal code.
On the other hand, `ompi_datatype_t::id` values of existing datatypes
are not changed and 73 is newly assigned to for `MPI_COMPLEX4` to
retain ABI compatibility.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
... and `ompi_mpi_c_short_float_complex` and `ompi_mpi_cxx_sfltcplex`.
These are Open MPI internal variables intended to be defined as
`MPI_SHORT_FLOAT`, `MPI_C_SHORT_FLOAT_COMPLEX`, and
`MPI_CXX_SHORT_FLOAT_COMPLEX` in the future.
`OMPI_DATATYPE_MPI_C_SHORT_FLOAT_COMPLEX` is also required to
support `MPI_COMPLEX4` in the next commit.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The type `short float`, which is proposed in ISO/IEC JTC 1/SC 22 WG 14
(C WG), is not supported by most compilers yet. But some compilers
(including gcc 7 for AArch64 and clang 6) support `_Float16`, which
is defined in ISO/IEC TS 18661-3:2015 (ISO/IEC JTC 1/SC 22/WG 14 N1945)
as an extensions for C. If it is detected in `configure`, it is used
as an alternate type of `short float` in Open MPI internal code.
This commit adds a `configure` option `--enable-alt-short-float=TYPE`.
It can be used to specify a type other than `short float` and `_Float16`
as the alternate type.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The type `short float` is proposed for the C language in ISO/IEC JTC
1/SC 22 WG 14 (C WG) for mainly IEEE 754-2008 binary16, a.k.a.
half-precision floating point or FP16.
By this commit, `short float` and `short float _Complex` are detected
in `configure` and used in Open MPI internal code. `MPI_SHORT_FLOAT`
and its complex number version are not added yet.
This commit changes values of existing `OPAL_DATATYPE_*` macros.
This change does not affect ABI compatibility of `libmpi.so` and the
like because these values are only used in OPAL and OMPI internal code.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Reset ptypes when cloning a datatype in order to prevent
a double free() in the opal_datatype_t destructor.
This fixes a bug introduced in open-mpi/ompi@7c938f070fFixesopen-mpi/ompi#6346
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The issue was a little complicated due to the internal stack used in the
convertor. The main issue was that in the case where we run out of iov
space to save the raw description of the data while hanbdling a
repetition (loop), instead of saving the current position and bailing out
directly we reading of the next predefined type element. It worked in
most cases, except the one identified by the HDF5 test. However, the
biggest issue here was the drop in performance for all ensuing calls to
the convertor pack/unpack, as instead of handling contiguous loops as a
whole (and minimizing the number of memory copies) we copied data
description by data description.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop
Refs. open-mpi/ompi#6285
Thanks Axel Huebl for reporting this
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
correctly free ptypes if the datatype is not pre-defined.
Thanks Axel Huebl for reporting this.
Refs. open-mpi/ompi#6291
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit contains the following changes:
- Remove the unused opal_test_init/opal_test_finalize
functions. These functions are not used by anything in the code
base or MTT. Tests use opal_init_util/opal_finalize_util instead.
- Get rid of gotos in opal_init_util and opal_init. Replaced them
with a cleaner solution.
- Automatically register cleanup functions in init functions. The
cleanup functions are executed in the reverse order of the
initialization functions. The cleanup functions are run in
opal_finalize_util() before tearing down the class system.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.
Thanks Ben Menadue for the initial bug report
Refs open-mpi/ompi#6016
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.
- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`
Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.
I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
cuda buffer
the existing interface in opal_datatype_cuda do not allow to distinguish whether a
buffer is a managed or unmanaged cuda buffer. Add an interface that allows to
retrieve this information throug a convertor, since the information is actually available
in the mca_common_cuda_* routines.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Fixes issue #5069, which relates a BigMPI bug with the use of
MPI_Type_vectpor to construct very large datatypes (>2GB).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
return true if the datatype has non-negative displacements and
monotonically nondecreasing, and false otherwise.
Thanks George for the guidance.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.
This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
* Don't overflow the internal datatype count.
Change the type of the count to be a size_t (it does not alter the total
size of the internal structures, so has no impact on the ABI).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Optimize the datatype creation.
The internal array of counts of predefined types is now only created
when needed, which is either in a heterogeneous environment, or when
one call get_elements. It saves space and makes the convertor creation a
little faster in some cases.
Rearrange the fields in the datatype description structs.
The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the
static array was only partially created. All predefined types should
have the ptypes array created and initialized.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Fix the boundary computation.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* test/datatype: add test for short unpack on heteregeneous cluster
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Trying to reduce the cost of creating a convertor.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Respect the unpack boundaries.
As Gilles suggested on #2535 the opal_unpack_general_function was
unpacking based on the requested count and not on the amount of packed
data provided.
Fixes#2535.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
since Open MPI now requires a C99, and ptrdiff_t type is part of C99,
there is no more need for the abstract OPAL_PTRDIFF_TYPE type.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>