openmpi

Автор	SHA1	Сообщение	Дата
Yossi Itigin	8535dd570b	Merge pull request #6732 from dmitrygladkov/topic/pml/ucx_init PML/UCX: Don't destroy UCP worker if it wasn't created	2019-06-06 10:41:33 +03:00
Dmitry Gladkov	c864ca51d2	PML/UCX: Don't destroy UCP worker if it wasn't created Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2019-06-03 10:49:36 +03:00
Sergey Oblomov	a3578d9ece	PML/UCX: disable PML UCX if MT is requested but not supported - in case if multithreading requested but not supported disable PML UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-05-17 11:25:23 +03:00
George Bosilca	a16cf0e4dd	Fix the leak of fragments for persistent sends. The rdma_frag attached to the send request was not correctly released upon request completion, leaking until MPI_Finalize. A quick solution would have been to add RDMA_FRAG_RETURN at different locations on the send request completion, but it would have unnecessarily made the sendreq completion path more complex. Instead, I added the length to the RDMA fragment so that it can be completed during the remote ack. Be more explicit on the comment. The rdma_frag can only be freed once when the peer forced a protocol change (from RDMA GET to send/recv). Otherwise the fragment will be returned once all data pertaining to it has been trasnferred. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-05-02 09:40:11 -04:00
bosilca	399b7133ab	Merge pull request #6556 from EmmanuelBRELLE/PR_fix_local_handle_in_PUT_message pml/ob1: fixed local handle sent during PUT control message	2019-04-27 13:51:22 -04:00
Jeff Squyres	9a9d106296	Merge pull request #6555 from EmmanuelBRELLE/PR-pmlob1_fix_rc_for_putfrag_when_get_failed pml/ob1: fixed exit from get_frag_fail when falling back on btl_put	2019-04-22 17:19:12 -04:00
Brelle Emmanuel	e630046a4b	pml/ob1: fixed local handle sent during PUT control message In case of using a btl_put in ob1, the handle of the locally registered memory is sent with a PUT control message. In the current master code the sent handle is necessary the handle in the frag but if the handle has been successfully registered in the request, the frag structure does not have any valid handle and all fragments use the request one. I suggest to check if the handle in the fragment is valid and if not to send the handle from the request. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>	2019-04-01 18:45:05 +02:00
Brelle Emmanuel	9c689f2225	pml/ob1: fixed exit from get_frag_fail when falling back on btl_put In the case the btl_get fails Ob1 tries to fallback on btl_put first but the return code was ignored. So the code fell back on both btl_put and btl_send. Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>	2019-04-01 18:17:10 +02:00
George Bosilca	6ea0c4eab9	Prevent a segfault when accessing a rank outside a communicator. This is not fixing any issue, it is simply preventing a sefault if the communicator creation has not happened as expected. Thus, this code path should never really be hit in a correct MPI application with a valid communicator creation support. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-03-28 12:03:29 -04:00
Sergey Oblomov	d8e3562bae	PML/SPML/UCX: added evaluation of mmap events - there was a set of UCX related issues reported which caused by mmap API hooks conflicts. We added diagnostic of such problems to simplify bug-resolving pipeline Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-03-12 21:14:27 +02:00
Mikhail Brinskii	751d88192d	PML/UCX: Use net worker address for remote peers For remote node peers pack smaller worker address, which contains network device addresses only. This would reduce amount of OOB traffic during startup. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-02-14 18:06:36 +02:00
Thananon Patinyasakdikul	0263456cf4	pml/ob1: fix deadlock with communicator flag ALLOW_OVERTAKE. We missed an assert to check if ALLOW_OVERTAKE is set or not before validating the sequence number and this will cause deadlock. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2019-01-29 14:55:06 -05:00
Yossi Itigin	83cca9d52a	ucx: add owner.txt for components Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-12-01 17:14:03 +02:00
Yossi Itigin	f36eeef4c5	pml_ucx: initialize req_mpi_object.comm for error handler without this fix, an error handler invoked on pml_ucx request would segfault while trying to dereference requests[i]->req_mpi_object.comm Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-11-25 19:37:54 +02:00
Yossi Itigin	4c442f2601	Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg COMMON/UCX: suppressed coverity warnings	2018-10-22 11:00:47 +03:00
Sergey Oblomov	1099d5f023	COMMON/UCX: added error code to log output Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-21 11:37:25 +03:00
George Bosilca	dc972f0b92	Fix the PML monitoring. The monitoring PML hides it's existence from the OMPI infrastructure by removing itself from the list of PML loaded components, remaining hidden until MPI_Finalize. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-18 00:29:23 -04:00
George Bosilca	668aa15dda	Early selection of the best PML. With this patch the best PML is selected earlier, before finalizing the others PML. This provides a simpler mechanism to intercept and highjack the PML (as done in the monitoring PML) Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-18 00:29:23 -04:00
Howard Pritchard	7d6774acf8	remove the bfo pml Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 06:50:11 -06:00
Jeff Squyres	54ca3310ea	ompi: cleanup various string operations Several fixes to string handling: 1. strncpy() -> opal_string_copy() (because opal_string_copy() guarantees to NULL-terminate, and strncpy() does not) 2. Simplify a few places, such as: * Since opal_string_copy() guarantees to NULL terminate, eliminate some memsets(), etc. * Use opal_asprintf() to eliminate multi-step string creation There's more work that could be done; e.g., this commit doesn't attempt to clean up any strcpy() usage. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:10:20 -07:00
Yossi Itigin	b71e85b8d5	pml_ucx: fix return code from mca_pml_ucx_init() error flow Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-11 18:48:54 +03:00
Yossi Itigin	40ac9e4771	pml_ucx: fix return code from mca_pml_ucx_init() Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-10 14:41:05 +03:00
Yossi Itigin	4763822a64	pml_ucx: add ompi datatype attribute to release ucp_datatype Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-09 17:34:34 +03:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Nathan Hjelm	000f9eed4d	opal: add types for atomic variables This commit updates the entire codebase to use specific opal types for all atomic variables. This is a change from the prior atomic support which required the use of the volatile keyword. This is the first step towards implementing support for C11 atomics as that interface requires the use of types declared with the _Atomic keyword. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:48:55 -06:00
Gilles Gouaillardet	fed33c1530	pml/ob1: plug a memory leak in mca_pml_ob1_component_fini() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-08-30 10:07:17 +09:00
Yossi Itigin	68206a5635	Merge pull request #5569 from hoopoepg/topic/optimize-blocked-calls PML/UCX: blocked calls optimizations	2018-08-29 14:19:09 +03:00
Yossi Itigin	4bb6845888	Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno MCA/COMMON/UCX: added synonym to opal_mem_hook variable	2018-08-29 14:16:33 +03:00
Sergey Oblomov	c201c0abb3	PML/UCX: blocked calls optimizations: removed reset progress count Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:39 +03:00
Sergey Oblomov	2cd9e04166	PML/UCX: optimization of mprobe call - renamed vars - renamed of internal variable names - used unsigned datatypes Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:39 +03:00
Sergey Oblomov	38e908f83e	PML/UCX: optimization of mprobe call - refactoring of opal/UCX progress calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:38 +03:00
Sergey Oblomov	b0f87f2235	PML/UCX: blocked calls optimizations - added UCX progress priority Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:38 +03:00
Jeff Squyres	fe0852bcb4	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-24 07:39:14 -07:00
Nathan Hjelm	1c84f48640	config: remove OPAL_ENABLE_MULTI_THREADS config macro We long ago hard-coded this value to 1. This commit cleans it out entirely. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-23 13:47:02 -06:00
Sergey Oblomov	e00f7a68ba	MCA/COMMON/UCX: added synonim to opal_mem_hook variable - added synonim to opal_mem_hook variable to allow to print it in opal_info -a Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-21 15:05:12 +03:00
Nathan Hjelm	c294bbc352	Merge pull request #5508 from hjelmn/fuzzy_match Bring fuzzy matching support into master	2018-08-06 13:52:04 -06:00
Matthew Dosanjh	c8d13486cc	Fixed promotion bug Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-06 12:56:36 -06:00
Boris Karasev	57683366ca	pmix: added check for pmix fence status Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-08-06 15:01:57 +06:00
Nathan Hjelm	dd74c6252f	pml/ob1: custom matching cleanup and configury This commit updates the new custom matching code in pml/ob1 so it can not be enabled with a configure option. This commit also renames the fuzzy-matching headers to avoid potential name conflicts and removes the use of C reserved identifiers. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-02 13:06:19 -06:00
Matthew Dosanjh	572694b621	Adding custom match source. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-02 12:23:08 -06:00
Sergey Oblomov	d204b8a678	PML/SPML/UCX/COMPONENT: applied C99 initialization Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-28 09:44:03 +03:00
Sergey Oblomov	2806504290	PML/SPML/UCX: init global objects using C99 style - to avoid value mix used C99 style of object initializations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-25 14:52:45 +03:00
Sergey Oblomov	6fe0a73861	PML/UCX: fixed ucp request free on persistent request completion - in sine cases persistent request was deleted during completion callback, this cause double free of linked UCX request (assert in debug build or hang in release build) - UCX request is freed prior completion calback Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-20 19:32:20 +03:00
Yossi Itigin	29812494f2	Merge pull request #5402 from hoopoepg/topic/common-del-procs MCA/COMMON/UCX: del_procs calls are unified to common module	2018-07-19 11:19:45 +03:00
Sergey Oblomov	920cc2e0d9	MCA/COMMON/UCX: del_procs calls are unified to common module Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-18 07:37:25 +03:00
Sergey Oblomov	1c7ae22dfb	MCA/COMMON/UCX: shift opal memhooks into common UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 13:46:38 +03:00
KAWASHIMA Takahiro	0021616984	pml/ob1: Fix data corruption of MPI_BSEND Data transferred by `MPI_BSEND` may corrupt if all of the following conditions are met. - The message size is less than the eager limit. - The `btl_alloc` function in the BTL interface returns `NULL` for some reason. - The MPI program overwrites the send buffer after `MPI_BSEND` returns. The problem is in the way of pending a send request in ob1 PML. The `mca_pml_ob1_send_request_start_copy` function retruns `OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns `des = NULL`. In this case, the send request is added to the `send_pending` list and `MPI_BSEND` returns immediately. Next time the `mca_pml_ob1_send_request_start_copy` function tries sending, the user buffer may have been overwritten by the MPI program. Call hierarchy of `MPI_BSEND`: ``` MPI_Bsend mca_pml_ob1_send if (MCA_PML_BASE_SEND_BUFFERED == sendmode) mca_pml_ob1_isend MCA_PML_OB1_SEND_REQUEST_START_W_SEQ mca_pml_ob1_send_request_start_seq mca_pml_ob1_send_request_start_btl if (size <= eager_limit) if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED) mca_pml_ob1_send_request_start_copy mca_bml_base_alloc btl_alloc if (OMPI_ERR_OUT_OF_RESOURCE == rc) add_request_to_send_pending ompi_request_free ``` To solve this problem, we should save the data to the buffer attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`. This problem was introduced by ob1 optimization (commits `2b57f422` and `a06e491c`) in v1.8 series. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-07-12 14:30:58 +09:00
Sergey Oblomov	240670152e	MCA/COMMON/UCX: code beautify - alignment Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-06 19:40:58 +03:00
Sergey Oblomov	bef47b792c	MCA/COMMON/UCX: unified logging across all UCX modules - added common logging infrastructure for all UCX modules - all UCX modules are switched to new infra Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-05 16:25:39 +03:00
Sergey Oblomov	8080283b3d	MCA/COMMON/UCX: changed return type for wait_request - for now wait_request returns OMPI status - updated callers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 23:29:38 +03:00

1 2 3 4 5 ...

1393 Коммитов