openmpi

Автор	SHA1	Сообщение	Дата
Edgar Gabriel	02da54c174	fcoll/two_phase: fix error in calculating aggregators in 32bit mode In fcoll_two_phase_supprot_fns.c: calculation of the aggregator index failed for large offsets on 32bit machine, due to improper handling of 64bit offsets. Fixes Issue #7110 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ea1355beae918b3acd67d5c0ccc44afbcc5b7ca9)	2019-11-25 09:06:36 -06:00
Howard Pritchard	608502ff85	Merge pull request #7102 from edgargabriel/pr/v4.0.x-romio321-status-set-elements-fix MPIR_Status_set_bytes: fix for large count sizes	2019-11-01 13:08:35 -06:00
Edgar Gabriel	a3e1ecc14b	comomn_ompio_file_read/write: fix 2GB limiting issue individual read/write operations exceeding 2GB fail in ompio due to improper conversions from size_t to int in two different locations. This commit fixes an issue reported by Richard Warren from the HDF5 group. Fixes Issue #7045 Cherry-picked from commit a130f569df6badebe639d17c0a1a0cb79a596094 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-10-22 12:12:55 -05:00
Edgar Gabriel	6185fa1946	MPIR_Status_set_bytes: fix for large count sizes Change the ncounts argument to MPI_Count and use MPI_Status_set_elements_x for enabling read/write operations beyond the 2GB limit. Thanks to Richard Warren from the HDF5 group for reporting the issue and providing the suggested fix for romio. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit 8a3abbf80373bb36eb78688954118f1fb0cc2e9d)	2019-10-22 09:51:29 -05:00
Howard Pritchard	5f3dbdb5c8	mtl/ofi: replace OMPI_UNLIKELY with OPAL version one off patch for v4.0.x. for some reason commit on master didn't have this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-09-26 16:01:28 -05:00
Geoff Paulsen	32984ceb65	Merge pull request #7005 from mwheinz/REFS6976-4.0.x v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes	2019-09-24 12:25:43 -05:00
Michael Heinz	89be953cfd	REF6976 Silent failure of OMPI over OFI with large messages sizes INTERNAL: STL-59403 The OFI (libfabric) MTL does not respect the maximum message size parameter that OFI provides in the fi_info data. This patch adds this missing max_msg_size field to the mca_ofi_module_t structure and adds a length check to the low-level send routines. (cherry-picked from commit 3aca4af548a3d781b6b52f89f4d6c7e66d379609) Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60 Signed-off-by: Michael Heinz <michael.william.heinz@intel.com> Reviewed-by: Adam Goldman <adam.goldman@intel.com> Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>	2019-09-23 17:19:10 -04:00
guserav	9bf1873215	Fix osc sm posts when only 32 bit atomics support Signed-off-by: guserav <erik.zeiske@hpe.com> (cherry picked from commit 3c9f4e682369e6fd5860b46ba81d79f2d1599a35)	2019-08-31 12:31:19 -07:00
Geoff Paulsen	2d515f747f	Merge pull request #6934 from devreal/osc-ucx-excl-lock-v4.0.x UCX osc: properly release exclusive lock to avoid lockup (v4.0.x)	2019-08-29 13:41:03 -05:00
Joseph Schuchart	8d130e1964	UCX osc: properly release exclusive lock to avoid lockup Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 08cb6389e034c1a70368671f745f20904c774a1e)	2019-08-27 23:12:56 +02:00
Valentin Petrov	83a2518994	Coll/hcoll: fixes hcoll non-blocking colls support open-mpi/ompi@0fe756d416 Introduced a bug in coll/hcoll component. The ompi_requests allocated by libhcoll would be treated as coll_base_nbc_request during ompi_coll_base_retain_<> call. Afterwards this would lead to a segv in the request cleanup. Fix: since libhcoll interface does not distinguish between the blocling/non-blocking requests use coll_base_nbc_request all the time and initialize it properly in coll/hcoll/get_coll_handle(). It is still within 2 cache lines. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-08-27 17:23:52 +03:00
Howard Pritchard	e4adbeefe7	Merge pull request #6905 from edgargabriel/pr/file-seek-end-fix-v4.0.x io_ompio_file_open: fix offset calculation with SEEK_END	2019-08-23 13:11:33 -06:00
Geoff Paulsen	390e0bc5b2	Merge pull request #6863 from bosilca/topic/backport_6695 Refresh of the datatype engine from Topic/backport 6695	2019-08-21 10:49:37 -05:00
Howard Pritchard	7b09c15b90	Merge pull request #6892 from janjust/v4.0.x-osc_fix v4.0.x: osc/ucx: Fix possible win creation/destruction race condition	2019-08-19 13:26:32 -06:00
George Bosilca	c9f48e2e77	Whitespace cleanup No code or logic changes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-08-16 10:27:43 -04:00
Edgar Gabriel	d72d39bfee	io_ompio_file_open: fix offset calculation with SEEK_END and SEEK_CUR. fixes an issue reported by Wei-keng Liao Fixes Issue #6858 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-08-16 09:03:10 -05:00
Tomislav Janjusic	e9a0343780	osc/ucx: Fix possible win creation/destruction race condition To avoid fully initializing the osc/ucx component for MPI application that are not using One-Sided functionality, the initialization happens at the first MPI window creation. This commit ensures atomicity of global state modifications. ported from: 6678ac0f557935b291ec2310216b7ea46e0c13b1 Signed-off-by: Artem Polyakov <artpol84@gmail.com> fix alignment, and fix error path	2019-08-12 22:23:17 +03:00
Gilles Gouaillardet	39ec580b76	coll/base: only retain datatypes/op if the request has not yet completed a non blocking collective might return ompi_request_null, so we should not retain anything in that case. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@63d3ccde9d)	2019-08-13 00:13:40 +09:00
Gilles Gouaillardet	ae26957619	coll/base: cleanup ompi_coll_base_nbc_request_t elements Since ompi_coll_base_nbc_request_t is to be used in an opal_free_list_t, it must be returned into a "clean" state. So cleanup some data in the callback completion subroutines. This fixes a regression introduced in open-mpi/ompi@0fe756d416 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@0862c409f1)	2019-08-13 00:13:40 +09:00
Gilles Gouaillardet	b37c85dcca	coll/libnbc: fixes ompi ompi_coll_libnbc_request_t parent base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t to correctly support the retention of datatypes/operators This fixes a regression introduced in open-mpi/ompi@0fe756d416 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@f8eef0fde9)	2019-08-13 00:13:40 +09:00
Sergey Oblomov	2fa112c0a6	UCX: added PPN hint for UCX context - added PPN hint for UCX context init Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 43186e494b47ca29e8d5e7a864b6b98b8e873195) Conflicts: opal/mca/common/ucx/common_ucx_wpool.c	2019-08-09 11:51:30 +03:00
Howard Pritchard	31aa52f11a	Merge pull request #6846 from nysal/topic/v4.0.x/ucx_accumulate_fix v4.0.x: osc/ucx: Fix data corruption with non-contiguous accumulates	2019-08-02 12:43:40 -06:00
Nysal Jan K.A	359cdf2b53	osc/ucx: Fix data corruption with non-contiguous accumulates Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com> (cherry picked from commit 3529d447020684ab305411caa97423826bb40906)	2019-07-26 14:41:08 +05:30
Mikhail Brinskii	b9998a14dc	COLL/TUNED: Minor var names/comments fixes Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 65618f8db848613c95cbe112033df94721d326a8)	2019-07-26 11:29:12 +03:00
Mikhail Brinskii	3d5b7b4a1b	COLL/TUNED: Update alltoall selection rule for mlx Use linear with sync alltoall algorithm for certain message/comm size ranges. Does not affect default fixed decision, unless HPCX (with its custom parameters) is used or corresponding mca is set. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 404c4800688548b021bda68bdf10792424e6b1c5)	2019-07-26 11:28:47 +03:00
Howard Pritchard	667aba9913	Merge pull request #6810 from janjust/v4.0.x v4.0.x OSC: Reset external request to NULL	2019-07-23 09:05:03 -06:00
Tomislav Janjusic	63605fc466	v4.0.x OSC: Reset external request to NULL to avoid double request completion Co-authored with Artem Polyakov <artemp@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2019-07-12 22:49:34 +03:00
Gilles Gouaillardet	c9e4240e70	mpi: retain operation and datatype in non blocking collectives MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd after a call to a non blocking collective and before the non-blocking collective completes. Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is invoked, and set a request callback so they are free'd when the MPI_Request completes. Thanks Thomas Ponweiser for reporting this Fixes open-mpi/ompi#2151 Fixes open-mpi/ompi#1304 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@0fe756d416)	2019-07-12 10:27:04 +09:00
Aurelien Bouteiller	9499dcfe41	Manage errors in NBC collective ops Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> Correctly bubble up errors in NBC collective operations Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> The error field of requests needs to be rearmed at start, not at create Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> (cherry picked from commit open-mpi/ompi@65660e5999)	2019-07-12 10:26:08 +09:00
Nysal Jan K.A	b6da090090	pml/ucx: Fix the max tag and context id values Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com> (cherry picked from commit fe4ef147f81b2ac56661175005de6c330eace690)	2019-07-03 16:38:07 +03:00
Geoff Paulsen	514e273968	Merge pull request #6770 from devreal/osc_winalloc_err_v4.0.x OSC rdma win allocate: propagate errors to avoid deadlocks (v4.0.x)	2019-06-28 14:04:12 -05:00
Howard Pritchard	6424857029	Merge pull request #6634 from jsquyres/pr/v4.0.x/ob1-fixes v4.0.x: Cherry pick ob1 fixes from master	2019-06-26 10:49:32 -06:00
Harald Klimach	16e1d74c8f	Suggestion to fix division by zero in file view. In common_ompi_aggregators calc_cost routine: do not cast the real division to an int intermediately. This patch removes the obsolete int variable c and assigns the result of the P_a/P_x division directly to n_as. With the intermediate int c variable, n_as gets 0 if P_a < P_x, resulting in a division by 0 when computing n_s. Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de> (cherry picked from commit e222a04ae57e5d09b8559f3c111de1f10a47246a)	2019-06-25 09:29:08 -06:00
Joseph Schuchart	c5cf3432b9	OSC rdma win allocate: synchronize error codes across shared memory group Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 8f27cc26d9845b5b207979b2a4621ef1089d1afb)	2019-06-24 17:49:26 +02:00
Howard Pritchard	73c4aac12d	Merge pull request #6750 from brminich/topic/all2all_linear_sync_fix_v4.0 COLL/BASE: Fix linear sync all2all - v4.0.x	2019-06-17 13:45:52 -06:00
Mikhail Brinskii	adba7f55f7	COLL/BASE: Fix linear sync all2all Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 79006f4e5a578d32bfa08de7b98e747ae18706f6)	2019-06-09 21:31:19 +03:00
Joseph Schuchart	900f0fa21f	OSC rdma: make sure accumulating in shared memory is safe Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit c67e2291937a09947c421dc84c6b3a8d07bec07f)	2019-06-07 12:45:00 +02:00
Geoff Paulsen	a04f5f0c70	Merge pull request #6692 from vspetrov/v4.0.x V4.0.x Coll/hcoll: don't init opal memhooks unless explicitely requested	2019-06-03 15:00:36 -05:00
Howard Pritchard	e78851a6c7	Merge pull request #6704 from edgargabriel/pr/v4.0.x-empty-fileview-fix common/ompio: fix division by zero problem with empty fview	2019-05-26 09:45:52 -06:00
Howard Pritchard	386ed07d54	Merge pull request #6689 from hoopoepg/topic/suppressed-pml-ucx-mt-warning-v4.0 PML/UCX: disable PML UCX if MT is requested but not supported - v4.0	2019-05-26 09:44:05 -06:00
Edgar Gabriel	c7250cd11d	common/ompio: fix division by zero problem with empty fview When using an empty fileview, a division by zero bug can occur in ompio. Not entirely sure why the problem did not show up previously, but some recent changes trigger that bug in one of our tests. This pr is part of a fix applied in commit f6b3a0a Fixes Issue #6703 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-23 13:48:57 -05:00
Valentin Petrov	8f82c899bc	Coll/hcoll: don't init opal memhooks unless explicitely requested by user If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal memory framework and register a mem release cb. Otherwise, rely on ucx. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-05-20 14:00:50 +03:00
Sergey Oblomov	1edd36638b	PML/UCX: disable PML UCX if MT is requested but not supported - in case if multithreading requested but not supported disable PML UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit a3578d9ece2b40a349529e7b223df50b0aac64aa)	2019-05-20 09:59:59 +03:00
Yossi Itigin	4f9fb3e9ce	OSC/UCX: Fix deadlock with atomic lock Atomic lock must progress local worker while obtaining the remote lock, otherwise an active message which actually releases the lock might not be processed while polling on local memory location. (picked from master 9d1994b) Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2019-05-20 09:54:01 +03:00
George Bosilca	4946570b24	Remove few warnings identified by @rhc in #5514 . Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit open-mpi/ompi@6d11a45f44)	2019-05-11 16:38:31 +09:00
George Bosilca	48f824327c	Fix the leak of fragments for persistent sends. The rdma_frag attached to the send request was not correctly released upon request completion, leaking until MPI_Finalize. A quick solution would have been to add RDMA_FRAG_RETURN at different locations on the send request completion, but it would have unnecessarily made the sendreq completion path more complex. Instead, I added the length to the RDMA fragment so that it can be completed during the remote ack. Be more explicit on the comment. The rdma_frag can only be freed once when the peer forced a protocol change (from RDMA GET to send/recv). Otherwise the fragment will be returned once all data pertaining to it has been trasnferred. NOTE: Had to add a typedef for "opal_atomic_size_t" from master into opal/threads/thread_usage.h into this cherry pick (it is in opal/include/opal_stdatomic.h on master, but that file does not exist here on the v4.0.x branch). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit a16cf0e4dd6df4dea820fecedd5920df632935b8) Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-03 06:20:02 -07:00
Brelle Emmanuel	c44821aef5	pml/ob1: fixed local handle sent during PUT control message In case of using a btl_put in ob1, the handle of the locally registered memory is sent with a PUT control message. In the current master code the sent handle is necessary the handle in the frag but if the handle has been successfully registered in the request, the frag structure does not have any valid handle and all fragments use the request one. I suggest to check if the handle in the fragment is valid and if not to send the handle from the request. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net> (cherry picked from commit e630046a4b82bc01379fb055af4c0e414c2a8e8f)	2019-05-03 05:53:35 -07:00
Mikhail Brinskii	e4ee56d1f3	SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h The new routine transfers the data asynchronously from the source PE to all PEs in the OpenSHMEM job. The routine returns immediately. The source and target buffers are reusable only after the completion of the routine. After the data is transferred to the target buffers, the counter object is updated atomically. The counter object can be read either using atomic operations such as shmem_atomic_fetch or can use point-to-point synchronization routines such as shmem_wait_until and shmem_test. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 2ef5bd8b3671f1e10caf00d06d66d120eac9c5be)	2019-05-02 21:25:59 +03:00
Howard Pritchard	41ef5c7a10	Merge pull request #6594 from vspetrov/osc_ucx_rget_rkey_fix OSC/UCX: use correct rkey for atomic_fadd in rget/rput	2019-05-01 11:53:17 -06:00
Mark Allen	c081757462	fixing an unsafe usage of integer disps[] (romio321 gpfs) There are a couple MPI_Alltoallv calls in ad_gpfs_aggrs.c where the send/recv data comes from places like req[r].lens, and the send buffer and send displacements for example were being calculated as sbuf = pick one of the reqs: req[bottom].lens sdisps[r] = req[r].lens - req[bottom].lens which might be okay if the .lens was data inside of req[] so they'd all be close to each other. But each .lens field is just a pointer that's malloced, so those addresses can be all over the place, so the integer-sized sdisps[] isn't safe. I changed it to have a new extra array sbuf and rbuf for those two Alltoallv calls, and copied the data into the sbuf from the same locations it used to be setting up the sdisps[] at, and after the Alltoallv I copy the data out of the new rbuf into the same locations it used to be setting up the rdisps[] at. For what it's worth I was able to get this to fail -np 2 on a GPFS filesystem with hints romio_cb_write enable. I didn't whittle the test down to something small, but it was failing in an MPI_File_write_all call. Signed-off-by: Mark Allen <markalle@us.ibm.com> (cherry picked from commit d85cac8f1a11495415b67ecab69d2ae1cd19d155)	2019-04-25 14:22:19 -04:00

1 2 3 4 5 ...

6840 Коммитов