openmpi

Автор	SHA1	Сообщение	Дата
Yossi Itigin	98d0ecfe14	Merge pull request #6814 from brminich/tuned_all2all_select COLL/TUNED: Update alltoall selection rule for mellanox platform	2019-07-25 17:51:55 +03:00
Mikhail Brinskii	65618f8db8	COLL/TUNED: Minor var names/comments fixes Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-07-24 10:23:38 +00:00
Nysal Jan K.A	14808922cf	osc/ucx: Add support for the no_locks info key Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>	2019-07-18 17:29:01 +05:30
Mikhail Brinskii	404c480068	COLL/TUNED: Update alltoall selection rule for mlx Use linear with sync alltoall algorithm for certain message/comm size ranges. Does not affect default fixed decision, unless HPCX (with its custom parameters) is used or corresponding mca is set. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-07-13 23:27:40 +03:00
Gilles Gouaillardet	0fe756d416	mpi: retain operation and datatype in non blocking collectives MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd after a call to a non blocking collective and before the non-blocking collective completes. Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is invoked, and set a request callback so they are free'd when the MPI_Request completes. Thanks Thomas Ponweiser for reporting this Fixes open-mpi/ompi#2151 Fixes open-mpi/ompi#1304 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-07-12 09:15:45 +09:00
Nysal Jan K.A	fe4ef147f8	pml/ucx: Fix the max tag and context id values Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>	2019-07-03 14:33:01 +05:30
Geoff Paulsen	f1b2a09675	Merge pull request #6649 from devreal/rdma-fetchop-local OSC rdma: make sure accumulating in shared memory is safe	2019-06-28 14:46:21 -05:00
Artem Polyakov	6678ac0f55	osc/ucx: Fix possible win creation/destruction race condition To avoid fully initializing the osc/ucx component for MPI application that are not using One-Sided functionality, the initialization happens at the first MPI window creation. This commit ensures atomicity of global state modifications. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-06-20 09:05:03 -07:00
Artem Polyakov	0857742624	osc/ucx: Fix worker pool finalization Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-06-20 09:05:03 -07:00
Nathan Hjelm	560886f095	Merge pull request #6746 from devreal/osc_winalloc_err OSC rdma win allocate: propagate errors to avoid deadlocks	2019-06-18 17:57:53 -07:00
Harald Klimach	e222a04ae5	Suggestion to fix division by zero in file view. In common_ompi_aggregators calc_cost routine: do not cast the real division to an int intermediately. This patch removes the obsolete int variable c and assigns the result of the P_a/P_x division directly to n_as. With the intermediate int c variable, n_as gets 0 if P_a < P_x, resulting in a division by 0 when computing n_s. Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>	2019-06-13 18:47:32 +02:00
Jeff Squyres	7c3aeb3061	Merge pull request #6686 from alex-anenkov/coll-iallreduce-recursivedoubling coll/libnbc: add recursive doubling algorithm for MPI_Iallreduce	2019-06-10 10:09:51 -04:00
Yossi Itigin	a46e5da3ca	Merge pull request #6744 from brminich/topic/all2all_linear_sync_fix COLL/BASE: Fix linear sync all2all	2019-06-09 21:23:38 +03:00
Joseph Schuchart	8f27cc26d9	OSC rdma win allocate: synchronize error codes across shared memory group Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2019-06-07 11:03:21 +02:00
Mikhail Brinskii	79006f4e5a	COLL/BASE: Fix linear sync all2all Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-06-06 19:22:42 +03:00
Yossi Itigin	8535dd570b	Merge pull request #6732 from dmitrygladkov/topic/pml/ucx_init PML/UCX: Don't destroy UCP worker if it wasn't created	2019-06-06 10:41:33 +03:00
Dmitry Gladkov	c864ca51d2	PML/UCX: Don't destroy UCP worker if it wasn't created Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2019-06-03 10:49:36 +03:00
Tomislav Janjusic	6ea920e225	Coll/hcoll: adding scatterv interface Signed-off-by: Valentin Petrov valentinp@mellanox.com	2019-05-27 12:27:43 +03:00
Edgar Gabriel	8eda9f2ecd	common/ompio: fix coverty warnings this commmit fixes coverty warnings CID 1445198 and CID 1445197 For a reason that is a bit unclear to me, coverty only complained about the read files, but the write operations had the same issue, so I fixed that within the same commit as well. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-23 13:40:39 -05:00
Edgar Gabriel	27b2ec71a7	common/ompio: add support for read operations and collective I/O external32 data representation is now support by ompio for everything but non-blocking collective I/O operations. The support can further be improved in a second step to limit the temporary buffer size (at least for blocking operations), but it does work now for many scenarios. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 17:56:16 -05:00
Edgar Gabriel	ab56e6f0db	common/ompio: make individual read operations work. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 17:22:33 -05:00
Edgar Gabriel	f6b3a0af52	common/ompio: individual write of external32 works both blocking and non-blocking. collective write and read operations not yet. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 16:26:14 -05:00
Edgar Gabriel	d955753cb8	common/ompio: abstraction for different convertor types introduce separate convertors for memory vs. file representation. Adjust the interfaces for decode_datatype to provide the convertor to be used for that. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 13:35:38 -05:00
Edgar Gabriel	35be18b266	common/ompio: rename ompio_cuda* to ompio_buffer* the infrastructure put in place to manage cuda buffers is actually a lot more generic than just for cuda buffers. Specifically, we ca reuse much of the code to implement the external32 data representation. This commit converts the code from common_ompio_cuda* to common_ompio_buffer*. There are just very few places where we actually need to keep the OPAL_CUDA_SUPPORT ifdef in place. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 12:50:04 -05:00
Edgar Gabriel	a96efb7620	common/ompio: add comm_ompio_read_all/write_all functions in preparation for adding support for the external32 data representation. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-05-20 12:49:36 -05:00
Valentin Petrov	f19f6f432a	Coll/hcoll: don't init opal memhooks unless explicitely requested by user If user sets HCOLL_EXTERNAL_UCM_EVENTS=1 then we try init opal memory framework and register a mem release cb. Otherwise, rely on ucx. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-05-20 11:17:44 +03:00
Yossi Itigin	9d1994b906	OSC/UCX: Fix deadlock with atomic lock Atomic lock must progress local worker while obtaining the remote lock, otherwise an active message which actually releases the lock might not be processed while polling on local memory location. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2019-05-19 20:10:09 +03:00
Alex Anenkov	77d466edf3	coll/libnbc: add recursive doubling algorithm for MPI_Iallreduce Signed-off-by: Alex Anenkov <anenkov.ru@gmail.com>	2019-05-19 18:39:11 +07:00
Sergey Oblomov	a3578d9ece	PML/UCX: disable PML UCX if MT is requested but not supported - in case if multithreading requested but not supported disable PML UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-05-17 11:25:23 +03:00
Jeff Squyres	9442989e2c	Merge pull request #6382 from jsquyres/pr/ofi-mtl-gitignore mtl/ofi: add a .gitignore	2019-05-13 12:00:41 -04:00
Joseph Schuchart	c67e229193	OSC rdma: make sure accumulating in shared memory is safe Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2019-05-10 17:27:10 +02:00
Nathan Hjelm	4345308dfd	osc/rdma: fix CAS 32-bit network atomic compatibility check When checking for btl compatibility with 32-bit CAS osc/rdma was checking the incorrect flag field. Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>	2019-05-10 07:27:53 -06:00
KAWASHIMA Takahiro	dabad084b5	Merge pull request #6621 from bosilca/topic/persistent_req_leak Fix the leak of fragments for persistent sends (issue #6565)	2019-05-03 15:21:42 +09:00
George Bosilca	a16cf0e4dd	Fix the leak of fragments for persistent sends. The rdma_frag attached to the send request was not correctly released upon request completion, leaking until MPI_Finalize. A quick solution would have been to add RDMA_FRAG_RETURN at different locations on the send request completion, but it would have unnecessarily made the sendreq completion path more complex. Instead, I added the length to the RDMA fragment so that it can be completed during the remote ack. Be more explicit on the comment. The rdma_frag can only be freed once when the peer forced a protocol change (from RDMA GET to send/recv). Otherwise the fragment will be returned once all data pertaining to it has been trasnferred. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-05-02 09:40:11 -04:00
Jeff Squyres	ac54d771ec	mtl/ofi: add a .gitignore Ignore generated files. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-01 14:00:00 -07:00
Yossi Itigin	5d2200a7d6	Merge pull request #6605 from brminich/topic/shmem_all2all_put SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h	2019-05-01 12:00:21 +03:00
bosilca	399b7133ab	Merge pull request #6556 from EmmanuelBRELLE/PR_fix_local_handle_in_PUT_message pml/ob1: fixed local handle sent during PUT control message	2019-04-27 13:51:22 -04:00
Mikhail Brinskii	2ef5bd8b36	SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h The new routine transfers the data asynchronously from the source PE to all PEs in the OpenSHMEM job. The routine returns immediately. The source and target buffers are reusable only after the completion of the routine. After the data is transferred to the target buffers, the counter object is updated atomically. The counter object can be read either using atomic operations such as shmem_atomic_fetch or can use point-to-point synchronization routines such as shmem_wait_until and shmem_test. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-04-26 14:47:58 +03:00
Mark Allen	d85cac8f1a	fixing an unsafe usage of integer disps[] (romio321 gpfs) There are a couple MPI_Alltoallv calls in ad_gpfs_aggrs.c where the send/recv data comes from places like req[r].lens, and the send buffer and send displacements for example were being calculated as sbuf = pick one of the reqs: req[bottom].lens sdisps[r] = req[r].lens - req[bottom].lens which might be okay if the .lens was data inside of req[] so they'd all be close to each other. But each .lens field is just a pointer that's malloced, so those addresses can be all over the place, so the integer-sized sdisps[] isn't safe. I changed it to have a new extra array sbuf and rbuf for those two Alltoallv calls, and copied the data into the sbuf from the same locations it used to be setting up the sdisps[] at, and after the Alltoallv I copy the data out of the new rbuf into the same locations it used to be setting up the rdisps[] at. For what it's worth I was able to get this to fail -np 2 on a GPFS filesystem with hints romio_cb_write enable. I didn't whittle the test down to something small, but it was failing in an MPI_File_write_all call. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2019-04-23 16:01:55 -04:00
Jeff Squyres	9a9d106296	Merge pull request #6555 from EmmanuelBRELLE/PR-pmlob1_fix_rc_for_putfrag_when_get_failed pml/ob1: fixed exit from get_frag_fail when falling back on btl_put	2019-04-22 17:19:12 -04:00
Edgar Gabriel	d43427fc76	common/ompio: refactor the build_io_array function abstract out the io_array structure to be used in common_ompio_build_io_array function. This is preparation for a future component that would like to use the same function, but not modify the io_array stored on the file handle itself. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-04-17 14:42:33 -05:00
Valentin Petrov	30970bdfdf	OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP Addtional bugfix: origin_addr -> result_addr for no_op, replace_op and sum_op fetch destination. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-04-17 10:30:21 +03:00
Brelle Emmanuel	e630046a4b	pml/ob1: fixed local handle sent during PUT control message In case of using a btl_put in ob1, the handle of the locally registered memory is sent with a PUT control message. In the current master code the sent handle is necessary the handle in the frag but if the handle has been successfully registered in the request, the frag structure does not have any valid handle and all fragments use the request one. I suggest to check if the handle in the fragment is valid and if not to send the handle from the request. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>	2019-04-01 18:45:05 +02:00
Brelle Emmanuel	9c689f2225	pml/ob1: fixed exit from get_frag_fail when falling back on btl_put In the case the btl_get fails Ob1 tries to fallback on btl_put first but the return code was ignored. So the code fell back on both btl_put and btl_send. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>	2019-04-01 18:17:10 +02:00
George Bosilca	6ea0c4eab9	Prevent a segfault when accessing a rank outside a communicator. This is not fixing any issue, it is simply preventing a sefault if the communicator creation has not happened as expected. Thus, this code path should never really be hit in a correct MPI application with a valid communicator creation support. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-03-28 12:03:29 -04:00
Artem Polyakov	bfff5783f9	Merge pull request #6371 from artpol84/osc/select_dbg osc/base: Add debug output stating a selected component	2019-03-22 22:24:04 -07:00
Sergey Oblomov	d8e3562bae	PML/SPML/UCX: added evaluation of mmap events - there was a set of UCX related issues reported which caused by mmap API hooks conflicts. We added diagnostic of such problems to simplify bug-resolving pipeline Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-03-12 21:14:27 +02:00
Nathan Hjelm	73085e9ce3	Merge pull request #6413 from nuriallv/issue_osc_rdma osc/rdma: fix when determining the node with the rank_array info for a peer	2019-02-27 16:30:06 -07:00
bosilca	8400502d8a	Merge pull request #6353 from bosilca/topic/fix_monitoring_pvar Fix the PVAR allocation usage.	2019-02-25 16:03:56 -05:00
Nuria Losada	3cae149262	osc/rdma: fix when determining the node with the rank_array info for a peer Signed-off-by: Nuria Losada <nlosada@icl.utk.edu>	2019-02-20 13:12:00 -05:00

1 2 3 4 5 ...

6950 Коммитов