In fcoll_two_phase_supprot_fns.c: calculation of the aggregator index
failed for large offsets on 32bit machine, due to improper handling of
64bit offsets.
Fixes Issue #7110
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
(cherry picked from commit ea1355beae)
individual read/write operations exceeding 2GB fail in ompio
due to improper conversions from size_t to int in two different
locations. This commit fixes an issue reported by Richard Warren
from the HDF5 group.
Fixes Issue #7045
Cherry-picked from commit a130f569df
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Change the ncounts argument to MPI_Count and use
MPI_Status_set_elements_x for enabling read/write operations beyond
the 2GB limit.
Thanks to Richard Warren from the HDF5 group for reporting the issue
and providing the suggested fix for romio.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
(cherry picked from commit 8a3abbf803)
INTERNAL: STL-59403
The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.
This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.
(cherry-picked from commit 3aca4af548)
Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
open-mpi/ompi@0fe756d416 Introduced
a bug in coll/hcoll component. The ompi_requests allocated by
libhcoll would be treated as coll_base_nbc_request during
ompi_coll_base_retain_<> call. Afterwards this would lead to a
segv in the request cleanup.
Fix: since libhcoll interface does not distinguish between the
blocling/non-blocking requests use coll_base_nbc_request all the
time and initialize it properly in
coll/hcoll/get_coll_handle(). It is still within 2 cache lines.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
To avoid fully initializing the osc/ucx component for MPI application
that are not using One-Sided functionality, the initialization happens
at the first MPI window creation.
This commit ensures atomicity of global state modifications.
ported from: 6678ac0f55
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
fix alignment, and fix error path
a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@63d3ccde9d)
Since ompi_coll_base_nbc_request_t is to be used in an
opal_free_list_t, it must be returned into a "clean" state.
So cleanup some data in the callback completion subroutines.
This fixes a regression introduced in open-mpi/ompi@0fe756d416
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@0862c409f1)
base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t
to correctly support the retention of datatypes/operators
This fixes a regression introduced in open-mpi/ompi@0fe756d416
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@f8eef0fde9)
Use linear with sync alltoall algorithm for certain message/comm size
ranges. Does not affect default fixed decision, unless HPCX (with its
custom parameters) is used or corresponding mca is set.
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 404c480068)
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd
after a call to a non blocking collective and before the non-blocking
collective completes.
Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is
invoked, and set a request callback so they are free'd when the MPI_Request
completes.
Thanks Thomas Ponweiser for reporting this
Fixesopen-mpi/ompi#2151Fixesopen-mpi/ompi#1304
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@0fe756d416)
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
Correctly bubble up errors in NBC collective operations
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The error field of requests needs to be rearmed at start, not at create
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
(cherry picked from commit open-mpi/ompi@65660e5999)
In common_ompi_aggregators calc_cost routine:
do not cast the real division to an int intermediately.
This patch removes the obsolete int variable c and assigns
the result of the P_a/P_x division directly to n_as.
With the intermediate int c variable, n_as gets 0 if P_a < P_x,
resulting in a division by 0 when computing n_s.
Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>
(cherry picked from commit e222a04ae5)
When using an empty fileview, a division by zero bug can occur in ompio. Not entirely sure why the problem did not show up previously, but some recent changes trigger that bug in one of our tests.
This pr is part of a fix applied in commit f6b3a0a
Fixes Issue #6703
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal
memory framework and register a mem release cb. Otherwise, rely on ucx.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
- in case if multithreading requested but not supported
disable PML UCX
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a3578d9ece)
Atomic lock must progress local worker while obtaining the remote lock,
otherwise an active message which actually releases the lock might not
be processed while polling on local memory location.
(picked from master 9d1994b)
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.
Be more explicit on the comment.
The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.
NOTE: Had to add a typedef for "opal_atomic_size_t" from master into
opal/threads/thread_usage.h into this cherry pick (it is in
opal/include/opal_stdatomic.h on master, but that file does not exist
here on the v4.0.x branch).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a16cf0e4dd)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.
I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
(cherry picked from commit e630046a4b)
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 2ef5bd8b36)
There are a couple MPI_Alltoallv calls in ad_gpfs_aggrs.c where the
send/recv data comes from places like req[r].lens, and the send
buffer and send displacements for example were being calculated as
sbuf = pick one of the reqs: req[bottom].lens
sdisps[r] = req[r].lens - req[bottom].lens
which might be okay if the .lens was data inside of req[] so they'd
all be close to each other. But each .lens field is just a pointer
that's malloced, so those addresses can be all over the place, so the
integer-sized sdisps[] isn't safe.
I changed it to have a new extra array sbuf and rbuf for those two
Alltoallv calls, and copied the data into the sbuf from the same
locations it used to be setting up the sdisps[] at, and after the
Alltoallv I copy the data out of the new rbuf into the same
locations it used to be setting up the rdisps[] at.
For what it's worth I was able to get this to fail -np 2 on a GPFS
filesystem with hints romio_cb_write enable. I didn't whittle the
test down to something small, but it was failing in an
MPI_File_write_all call.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit d85cac8f1a)