openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	31aa52f11a	Merge pull request #6846 from nysal/topic/v4.0.x/ucx_accumulate_fix v4.0.x: osc/ucx: Fix data corruption with non-contiguous accumulates	2019-08-02 12:43:40 -06:00
Nysal Jan K.A	359cdf2b53	osc/ucx: Fix data corruption with non-contiguous accumulates Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com> (cherry picked from commit 3529d447020684ab305411caa97423826bb40906)	2019-07-26 14:41:08 +05:30
Mikhail Brinskii	b9998a14dc	COLL/TUNED: Minor var names/comments fixes Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 65618f8db848613c95cbe112033df94721d326a8)	2019-07-26 11:29:12 +03:00
Mikhail Brinskii	3d5b7b4a1b	COLL/TUNED: Update alltoall selection rule for mlx Use linear with sync alltoall algorithm for certain message/comm size ranges. Does not affect default fixed decision, unless HPCX (with its custom parameters) is used or corresponding mca is set. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 404c4800688548b021bda68bdf10792424e6b1c5)	2019-07-26 11:28:47 +03:00
Howard Pritchard	667aba9913	Merge pull request #6810 from janjust/v4.0.x v4.0.x OSC: Reset external request to NULL	2019-07-23 09:05:03 -06:00
Tomislav Janjusic	63605fc466	v4.0.x OSC: Reset external request to NULL to avoid double request completion Co-authored with Artem Polyakov <artemp@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2019-07-12 22:49:34 +03:00
Gilles Gouaillardet	c9e4240e70	mpi: retain operation and datatype in non blocking collectives MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd after a call to a non blocking collective and before the non-blocking collective completes. Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is invoked, and set a request callback so they are free'd when the MPI_Request completes. Thanks Thomas Ponweiser for reporting this Fixes open-mpi/ompi#2151 Fixes open-mpi/ompi#1304 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@0fe756d416)	2019-07-12 10:27:04 +09:00
Aurelien Bouteiller	9499dcfe41	Manage errors in NBC collective ops Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> Correctly bubble up errors in NBC collective operations Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> The error field of requests needs to be rearmed at start, not at create Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu> (cherry picked from commit open-mpi/ompi@65660e5999)	2019-07-12 10:26:08 +09:00
Nysal Jan K.A	b6da090090	pml/ucx: Fix the max tag and context id values Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com> (cherry picked from commit fe4ef147f81b2ac56661175005de6c330eace690)	2019-07-03 16:38:07 +03:00
Geoff Paulsen	514e273968	Merge pull request #6770 from devreal/osc_winalloc_err_v4.0.x OSC rdma win allocate: propagate errors to avoid deadlocks (v4.0.x)	2019-06-28 14:04:12 -05:00
Howard Pritchard	6424857029	Merge pull request #6634 from jsquyres/pr/v4.0.x/ob1-fixes v4.0.x: Cherry pick ob1 fixes from master	2019-06-26 10:49:32 -06:00
Harald Klimach	16e1d74c8f	Suggestion to fix division by zero in file view. In common_ompi_aggregators calc_cost routine: do not cast the real division to an int intermediately. This patch removes the obsolete int variable c and assigns the result of the P_a/P_x division directly to n_as. With the intermediate int c variable, n_as gets 0 if P_a < P_x, resulting in a division by 0 when computing n_s. Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de> (cherry picked from commit e222a04ae57e5d09b8559f3c111de1f10a47246a)	2019-06-25 09:29:08 -06:00
Joseph Schuchart	c5cf3432b9	OSC rdma win allocate: synchronize error codes across shared memory group Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 8f27cc26d9845b5b207979b2a4621ef1089d1afb)	2019-06-24 17:49:26 +02:00
Howard Pritchard	73c4aac12d	Merge pull request #6750 from brminich/topic/all2all_linear_sync_fix_v4.0 COLL/BASE: Fix linear sync all2all - v4.0.x	2019-06-17 13:45:52 -06:00
Mikhail Brinskii	adba7f55f7	COLL/BASE: Fix linear sync all2all Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 79006f4e5a578d32bfa08de7b98e747ae18706f6)	2019-06-09 21:31:19 +03:00
Joseph Schuchart	900f0fa21f	OSC rdma: make sure accumulating in shared memory is safe Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit c67e2291937a09947c421dc84c6b3a8d07bec07f)	2019-06-07 12:45:00 +02:00
Geoff Paulsen	a04f5f0c70	Merge pull request #6692 from vspetrov/v4.0.x V4.0.x Coll/hcoll: don't init opal memhooks unless explicitely requested	2019-06-03 15:00:36 -05:00
Howard Pritchard	e78851a6c7	Merge pull request #6704 from edgargabriel/pr/v4.0.x-empty-fileview-fix common/ompio: fix division by zero problem with empty fview	2019-05-26 09:45:52 -06:00
Howard Pritchard	386ed07d54	Merge pull request #6689 from hoopoepg/topic/suppressed-pml-ucx-mt-warning-v4.0 PML/UCX: disable PML UCX if MT is requested but not supported - v4.0	2019-05-26 09:44:05 -06:00
Edgar Gabriel	c7250cd11d	common/ompio: fix division by zero problem with empty fview When using an empty fileview, a division by zero bug can occur in ompio. Not entirely sure why the problem did not show up previously, but some recent changes trigger that bug in one of our tests. This pr is part of a fix applied in commit f6b3a0a Fixes Issue #6703 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-23 13:48:57 -05:00
Valentin Petrov	8f82c899bc	Coll/hcoll: don't init opal memhooks unless explicitely requested by user If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal memory framework and register a mem release cb. Otherwise, rely on ucx. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-05-20 14:00:50 +03:00
Sergey Oblomov	1edd36638b	PML/UCX: disable PML UCX if MT is requested but not supported - in case if multithreading requested but not supported disable PML UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit a3578d9ece2b40a349529e7b223df50b0aac64aa)	2019-05-20 09:59:59 +03:00
Yossi Itigin	4f9fb3e9ce	OSC/UCX: Fix deadlock with atomic lock Atomic lock must progress local worker while obtaining the remote lock, otherwise an active message which actually releases the lock might not be processed while polling on local memory location. (picked from master 9d1994b) Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2019-05-20 09:54:01 +03:00
George Bosilca	4946570b24	Remove few warnings identified by @rhc in #5514 . Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit open-mpi/ompi@6d11a45f44)	2019-05-11 16:38:31 +09:00
George Bosilca	48f824327c	Fix the leak of fragments for persistent sends. The rdma_frag attached to the send request was not correctly released upon request completion, leaking until MPI_Finalize. A quick solution would have been to add RDMA_FRAG_RETURN at different locations on the send request completion, but it would have unnecessarily made the sendreq completion path more complex. Instead, I added the length to the RDMA fragment so that it can be completed during the remote ack. Be more explicit on the comment. The rdma_frag can only be freed once when the peer forced a protocol change (from RDMA GET to send/recv). Otherwise the fragment will be returned once all data pertaining to it has been trasnferred. NOTE: Had to add a typedef for "opal_atomic_size_t" from master into opal/threads/thread_usage.h into this cherry pick (it is in opal/include/opal_stdatomic.h on master, but that file does not exist here on the v4.0.x branch). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit a16cf0e4dd6df4dea820fecedd5920df632935b8) Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-03 06:20:02 -07:00
Brelle Emmanuel	c44821aef5	pml/ob1: fixed local handle sent during PUT control message In case of using a btl_put in ob1, the handle of the locally registered memory is sent with a PUT control message. In the current master code the sent handle is necessary the handle in the frag but if the handle has been successfully registered in the request, the frag structure does not have any valid handle and all fragments use the request one. I suggest to check if the handle in the fragment is valid and if not to send the handle from the request. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net> (cherry picked from commit e630046a4b82bc01379fb055af4c0e414c2a8e8f)	2019-05-03 05:53:35 -07:00
Mikhail Brinskii	e4ee56d1f3	SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h The new routine transfers the data asynchronously from the source PE to all PEs in the OpenSHMEM job. The routine returns immediately. The source and target buffers are reusable only after the completion of the routine. After the data is transferred to the target buffers, the counter object is updated atomically. The counter object can be read either using atomic operations such as shmem_atomic_fetch or can use point-to-point synchronization routines such as shmem_wait_until and shmem_test. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 2ef5bd8b3671f1e10caf00d06d66d120eac9c5be)	2019-05-02 21:25:59 +03:00
Howard Pritchard	41ef5c7a10	Merge pull request #6594 from vspetrov/osc_ucx_rget_rkey_fix OSC/UCX: use correct rkey for atomic_fadd in rget/rput	2019-05-01 11:53:17 -06:00
Mark Allen	c081757462	fixing an unsafe usage of integer disps[] (romio321 gpfs) There are a couple MPI_Alltoallv calls in ad_gpfs_aggrs.c where the send/recv data comes from places like req[r].lens, and the send buffer and send displacements for example were being calculated as sbuf = pick one of the reqs: req[bottom].lens sdisps[r] = req[r].lens - req[bottom].lens which might be okay if the .lens was data inside of req[] so they'd all be close to each other. But each .lens field is just a pointer that's malloced, so those addresses can be all over the place, so the integer-sized sdisps[] isn't safe. I changed it to have a new extra array sbuf and rbuf for those two Alltoallv calls, and copied the data into the sbuf from the same locations it used to be setting up the sdisps[] at, and after the Alltoallv I copy the data out of the new rbuf into the same locations it used to be setting up the rdisps[] at. For what it's worth I was able to get this to fail -np 2 on a GPFS filesystem with hints romio_cb_write enable. I didn't whittle the test down to something small, but it was failing in an MPI_File_write_all call. Signed-off-by: Mark Allen <markalle@us.ibm.com> (cherry picked from commit d85cac8f1a11495415b67ecab69d2ae1cd19d155)	2019-04-25 14:22:19 -04:00
Brelle Emmanuel	2a4bc0cb58	pml/ob1: fixed exit from get_frag_fail when falling back on btl_put In the case the btl_get fails Ob1 tries to fallback on btl_put first but the return code was ignored. So the code fell back on both btl_put and btl_send. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net> (cherry picked from commit 9c689f2225d29aa152627f39bab841afead254af)	2019-04-22 14:25:34 -07:00
Valentin Petrov	2947ab2dbc	OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-04-17 10:35:34 +03:00
Valentin Petrov	68c88e86f2	OSC/UCX: use correct rkey for atomic_fadd in rget/rput Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-04-16 15:24:57 +03:00
Thananon Patinyasakdikul	5999fdad5a	pml/ob1: fix deadlock with communicator flag ALLOW_OVERTAKE. We missed an assert to check if ALLOW_OVERTAKE is set or not before validating the sequence number and this will cause deadlock. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu> (cherry picked from commit 0263456cf4e99efc67d38acd100cf948e0399d63)	2019-04-09 11:24:24 -07:00
Sergey Oblomov	14c271f993	PML/SPML/UCX: added evaluation of mmap events - there was a set of UCX related issues reported which caused by mmap API hooks conflicts. We added diagnostic of such problems to simplify bug-resolving pipeline Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit d8e3562bae700d84873c1d5ca9c45c846d7387ed)	2019-03-14 16:48:25 +02:00
Howard Pritchard	7aeb65579b	Merge pull request #6395 from brminich/topic/ucx_net_waddr_4.0.x PML/UCX: Use net worker address for remote peers - v4.0.x	2019-02-21 20:29:47 -07:00
Mikhail Brinskii	1c514948f6	PML/UCX: Use net worker address for remote peers For remote node peers pack smaller worker address, which contains network device addresses only. This would reduce amount of OOB traffic during startup. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com> (cherry picked from commit 751d88192d05edb7e1912bab4e48643c6f9e1574)	2019-02-21 16:58:20 +02:00
Gilles Gouaillardet	749f51845b	osc/rdma: correctly handle communications to self mark the "self" peer OMPI_OSC_RDMA_PEER_LOCAL_BASE when the window is dynamically created and use_cpu_atomics is set in order to correctly handle communications to self. Thanks Bart Janssens for reporting this issue. Refs. open-mpi/ompi#6394 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit open-mpi/ompi@fe05fcc11a)	2019-02-20 13:06:05 +09:00
Howard Pritchard	0b915b7e56	Merge pull request #6333 from jsquyres/pr/v4.0.x/hwloc-macro-conflict-fixes v4.0.x: Various minor hwloc cleanups	2019-02-12 09:13:19 -07:00
Howard Pritchard	5dd63405ce	Merge pull request #6368 from jsquyres/pr/v4.0.x/fix-ofi-configury v4.0.x: fix OFI configury	2019-02-11 13:15:52 -07:00
Jeff Squyres	9ad871fc38	ofi: revamp OPAL_CHECK_OFI configury Update the OPAL_CHECK_OFI configury macro: - Make it safe to call the macro multiple times: - The checks only execute the first time it is invoked - Subsequent invocations, it just emits a friendly "checking..." message so that configure output is sensible/logical - With the goal of ultimately removing opal/mca/common/ofi, rename the output variables from OPAL_CHECK_OFI to be opal_ofi_{happy\|CPPFLAGS\|LDFLAGS\|LIBS}. - Update btl/usnic and mtl/ofi for these new conventions. - Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that causes the macro to be invoked at a fairly random time, which makes configure stdout confusing / hard to grok. - Remove a little left-over kruft in OPAL_CHECK_OFI, too (which resulted in an indenting change, making the change to opal_check_ofi.m4 look larger than it really is). Thanks Alastair McKinstry for the report and initial fix. Thanks Rashika Kheria for the reminder. Updated from master cherry pick: the OFI BTL does not exist on the v4.0.x branch. Therefore, did not include the OFI BTL changes on master in this cherry pick. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527)	2019-02-07 06:36:35 -08:00
René Widera	e30e5b95c6	common/ompio: possible rounding issue Similar to #6286 rounding number of bytes into a single precision floating point value to round up the result of a division is a potential risk due to rounding errors. - remove floating point operations for `round up` - removes floating point conversion for round down (native behavior of integer division) Signed-off-by: René Widera <r.widera@hzdr.de> (cherry picked from commit a91fab80a1e55e1df15f649e18d247e5d4654eb9)	2019-01-30 12:31:39 -06:00
Edgar Gabriel	d1e8779fe3	common/ompio: fix a floating point division problem This commit fixes a problem reported on the mailing list with individual writes larger than 512 MB. The culprit is a floating point division of two large, close values. Changing the datatypes from float to double (which is what is being used in the fcoll components) fixes the problem. See issue #6285 and https://forum.hdfgroup.org/t/cannot-write-more-than-512-mb-in-1d/5118 Thanks for Axel Huebl and René Widera for reporting the issue. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit c0f8ce0fff4684b670135043dd150abc9d83d988)	2019-01-30 12:31:16 -06:00
Gilles Gouaillardet	a247292275	topo/treematch: silence a hwloc related warning treematch/km_partitioning.c #include "config.h", but there is no such file when the embedded treematch is used. In order to prevent the embedded treematch from incorrectly using the config.h from the embedded hwloc, generate a dummy config.h. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit 0aeb27f77650d3ee97e17e770c9e5aa487d5e1f5)	2019-01-30 07:33:33 -05:00
Jeff Squyres	1a1a932acc	romio321: ensure to distribute ompi_grequestx.h Refs https://github.com/open-mpi/ompi/issues/6227. Thanks to George Marselis for reporting. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 62321be186dd7d3efcedc2e801f226f6660ea0c4)	2018-12-28 13:18:10 -08:00
Matias A Cabral	b2327049c1	MTL/PSM2: add missing default priority Missing default priority after PR #6153 Signed-off-by: Matias Cabral <matias.a.cabral@intel.com> (cherry picked from commit c76c6d8b2801ca43ba33168a0b92522786c7c5bb)	2018-12-07 16:22:59 -08:00
Matias A Cabral	80113a368f	MTL/PSM2: Do not lower the priority when all processes are local. The intention of lowering the priority when all processes are local was to favor Vader BTL. However, in builds including the OFI MTL it gets selected instead. Reviewed-by: Spruit, Neil R <neil.r.spruit@intel.com> Reviewed-by: Gopalakrishnan, Aravind <aravind.gopalakrishnan@intel.com> Signed-off-by: Matias Cabral <matias.a.cabral@intel.com> (cherry picked from commit fc8582c5606b7a3d1b711f8f7b6144808290a48f)	2018-12-07 11:11:43 -08:00
Geoff Paulsen	752bbd195f	Merge pull request #6102 from hoopoepg/topic/set-osc-ucx-level-200-v4.0 OSC: set UCX module used by default - v4.0	2018-12-04 10:26:37 -06:00
Geoff Paulsen	bd2990f502	Merge pull request #6131 from devreal/rdma-plug-memleak-v4.0.x v4.0.x: Plug two memory leaks in rdma osc	2018-11-30 13:54:51 -06:00
Joseph Schuchart	c5346751e6	Plug two memory leaks in rdma osc Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 91885f5876129aa4fb43ed4b3404c9d1ca7e08b8)	2018-11-29 10:19:26 -05:00
Sergey Oblomov	6651672711	OSC/UCX: set max level value to 60 Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 2d230b3aacce0185f0d46e69f608071b670eeb3c)	2018-11-27 20:35:30 +02:00

1 2 3 4 5 ...

6819 Коммитов