openmpi

Автор	SHA1	Сообщение	Дата
Sergey Oblomov	6fe0a73861	PML/UCX: fixed ucp request free on persistent request completion - in sine cases persistent request was deleted during completion callback, this cause double free of linked UCX request (assert in debug build or hang in release build) - UCX request is freed prior completion calback Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-20 19:32:20 +03:00
Yossi Itigin	29812494f2	Merge pull request #5402 from hoopoepg/topic/common-del-procs MCA/COMMON/UCX: del_procs calls are unified to common module	2018-07-19 11:19:45 +03:00
Sergey Oblomov	920cc2e0d9	MCA/COMMON/UCX: del_procs calls are unified to common module Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-18 07:37:25 +03:00
Sergey Oblomov	1c7ae22dfb	MCA/COMMON/UCX: shift opal memhooks into common UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 13:46:38 +03:00
KAWASHIMA Takahiro	0021616984	pml/ob1: Fix data corruption of MPI_BSEND Data transferred by `MPI_BSEND` may corrupt if all of the following conditions are met. - The message size is less than the eager limit. - The `btl_alloc` function in the BTL interface returns `NULL` for some reason. - The MPI program overwrites the send buffer after `MPI_BSEND` returns. The problem is in the way of pending a send request in ob1 PML. The `mca_pml_ob1_send_request_start_copy` function retruns `OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns `des = NULL`. In this case, the send request is added to the `send_pending` list and `MPI_BSEND` returns immediately. Next time the `mca_pml_ob1_send_request_start_copy` function tries sending, the user buffer may have been overwritten by the MPI program. Call hierarchy of `MPI_BSEND`: ``` MPI_Bsend mca_pml_ob1_send if (MCA_PML_BASE_SEND_BUFFERED == sendmode) mca_pml_ob1_isend MCA_PML_OB1_SEND_REQUEST_START_W_SEQ mca_pml_ob1_send_request_start_seq mca_pml_ob1_send_request_start_btl if (size <= eager_limit) if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED) mca_pml_ob1_send_request_start_copy mca_bml_base_alloc btl_alloc if (OMPI_ERR_OUT_OF_RESOURCE == rc) add_request_to_send_pending ompi_request_free ``` To solve this problem, we should save the data to the buffer attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`. This problem was introduced by ob1 optimization (commits `2b57f422` and `a06e491c`) in v1.8 series. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-07-12 14:30:58 +09:00
Sergey Oblomov	240670152e	MCA/COMMON/UCX: code beautify - alignment Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-06 19:40:58 +03:00
Sergey Oblomov	bef47b792c	MCA/COMMON/UCX: unified logging across all UCX modules - added common logging infrastructure for all UCX modules - all UCX modules are switched to new infra Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-05 16:25:39 +03:00
Sergey Oblomov	8080283b3d	MCA/COMMON/UCX: changed return type for wait_request - for now wait_request returns OMPI status - updated callers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 23:29:38 +03:00
Sergey Oblomov	c2bd6af9f2	MCA/COMMON/UCX: minor unification of del_proces calls - some common functionality of del_procs calls is moved into mca_common module - blocking ucp_put call is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-02 15:10:53 +03:00
Sergey Oblomov	074f30ba27	PML/UCX: suppressed compilation warning Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 12:05:07 +03:00
Sergey Oblomov	502d04bf12	UCX/PML/SPML: fixed few coverity issues - fixed incorrect pointer manipulation/free - cleaned dead code - minor optimization on process delete routine - fixed error handling - free pointers - added debug output for woker flush failure Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-26 18:52:39 +03:00
Yossi Itigin	ee873f4f79	Merge pull request #5322 from hoopoepg/topic/mca-ucx-common MCA/UCX: added common module	2018-06-26 13:54:12 +03:00
Sergey Oblomov	d57ae62dee	MCA/UCX: added common module - implemented non-blocking routines for flush operations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-22 16:41:09 +03:00
Gilles Gouaillardet	edd02b7144	pml/ucx: silence a warning declare 'fenced' volatile in order to silence CID 1437465 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-22 13:11:42 +09:00
Sergey Oblomov	5f03628560	PML/UCX: removed uneeded flush Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-21 12:40:46 +03:00
Sergey Oblomov	2745da7dcc	PML/UCX: use non-blocking fence instead of async progress Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-21 09:46:03 +03:00
Sergey Oblomov	10f2d831ec	PML/UCX: fixed hang on MPI_Finalize - added async UCX progress thread to allow pending requests to complete Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-20 16:12:05 +03:00
Yossi Itigin	564f80d362	pml_ucx: add option to use opal memhooks instead of ucx internal hooks Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-06-17 15:30:44 +03:00
Thananon Patinyasakdikul	390d72addd	Merge pull request #4885 from davideberius/spc_pr Initial Software-based Performance Counters PR	2018-06-12 14:04:49 -07:00
David Eberius	d377a6b6f4	Added Software-based Performance Counters driver code along with several counters. This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI. All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example. Signed-off-by: David Eberius <deberius@vols.utk.edu>	2018-06-11 22:48:16 -04:00
Yossi Itigin	fd12540751	Merge pull request #5227 from hoopoepg/topic/pml-ucx-hang-on-finalize PML/UCX: fixed hand on MPI_Finalize	2018-06-08 13:19:49 +03:00
Sergey Oblomov	0a8261f3b0	PML/UCX: fixed hand on MPI_Finalize fixes issue https://github.com/openucx/ucx/issues/2656 added flush for worker object to complete all pending operations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-05 17:22:03 +03:00
Jeff Squyres	35438ae9b5	mpi/finalized: revamp INITIALIZED/FINALIZED Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See https://github.com/open-mpi/ompi/issues/5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to set the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-01 13:36:29 -07:00
Sergey Oblomov	5ec26914a6	PML/UCX: do not set offset on ordered data recv Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-05-21 19:40:07 +03:00
Sergey Oblomov	19607daa32	PML/UCX: create convertor clone instead of stack reset Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-05-17 16:39:13 +03:00
Sergey Oblomov	7c5de01c57	PML/UCX: reset converter stack on unordered messages Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-05-17 13:11:02 +03:00
bosilca	2ab628b92e	Merge pull request #5074 from bosilca/topic/remove_warnings Remove warnings identified by clang.	2018-05-15 11:15:23 -04:00
Yossi Itigin	385f38ab4e	ucx: improve error messages during connection establishment Also, unite common code calling ucp_ep_create() Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-04-30 15:45:05 +03:00
George Bosilca	6ff11267fb	Remove warnings identified by clang. Plus minor spacing and indentation issues. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-04-14 17:14:12 -04:00
Jeff Squyres	c3adcb05eb	Miscellaneous compiler warnings fixes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-23 11:45:30 -07:00
Thananon Patinyasakdikul	09cba8b30b	pml/ob1: fixed out of sequence bug. This commit fixes #4795 - Fixed typo that sometimes causes deadlock in change of protocol. - Redesigned out of sequence ordering and address the overflow case of sequence number from uint16_t. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2018-02-27 13:49:40 -05:00
Jeff Squyres	e7f91f8068	Merge pull request #4527 from clementFoyer/osc-no-includes Remove inter-dependencies between OSC modules.	2018-02-09 15:49:56 -05:00
Nathan Hjelm	da9f833f4a	pml/ob1: ignore the eager limit of RDMA-only btls This commit fixes a flaw in the eager limit check in pml/ob1. The check was incorrectly checking if RDMA-only BTLs (BTLs without the send flag) has a valid eager limit. This commit fixes the check by adding an additional check for the send flag on the BTL module. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-02-07 12:42:44 -07:00
Clement Foyer	f5b4fc05f8	Remove inter-dependencies between OSC modules. The osc monitoring component needed to include other OSC components header in order to be able tu access communicator through the component specific ompi_osc__module_t structures. This commit remove the dependency, and resolve the issue #4523. Extend the common monitoring API. Now it's possible to translate from local rank to world rank from both the communicator and the group. * Remove useless hashtable as we directly use the w_group contained in window structure. Add automatic generation at config time. The templates are expanded at configure time. It creates a new header file that generates all the variables/functions needed. Adding this during the autogen automagicaly generates for each of the available modules the proper functions. Only keep a generated argv-style array. Following Jeff's advice, the configure.m4 file generate a simple array of module variables to be iterated over to find the proper module. Signed-off-by: Clement Foyer <clement.foyer@inria.fr>	2018-02-07 11:52:00 +00:00
Sergey Oblomov	7a5811d0a8	request/state: update state for canceled request - fixed issue in set state for canceled request Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-01-29 18:26:20 +02:00
Yossi Itigin	f2851fd502	Merge pull request #4724 from alex-mikheev/topic/ucx_as_default ompi/oshmem: ucx is selected over yalla/ikrit by default	2018-01-17 17:41:49 +02:00
Alex Mikheev	640e945b9c	ompi: pml/ucx: blocking send using ucp_tag_send_nbr Signed-off-by: Alex Mikheev <alexm@mellanox.com>	2018-01-17 15:54:18 +02:00
Alex Mikheev	ae326546f4	ompi/oshmem: ucx is selected over yalla/ikrit by default Signed-off-by: Alex Mikheev <alexm@mellanox.com>	2018-01-17 15:08:04 +02:00
Alex Mikheev	e7bf0617cf	ompi: pml ucx: improve recv latency Signed-off-by: Alex Mikheev <alexm@mellanox.com>	2017-12-26 16:24:16 +02:00
Howard Pritchard	2233e44848	Merge pull request #4534 from hppritcha/topic/fix_a_segv_in_request pml/cm: check for request comp. before completing bsend	2017-12-04 09:09:41 -07:00
Nathan Hjelm	1282e98a01	opal/asm: rename existing arithmetic atomic functions This commit renames the arithmetic atomic operations in opal to indicate that they return the new value not the old value. This naming differentiates these routines from new functions that return the old value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-30 10:41:22 -07:00
Nathan Hjelm	647b40f3f2	Merge pull request #4442 from bosilca/topic/ob1_pvar Topic/ob1 pvar	2017-11-29 09:31:07 -07:00
Howard Pritchard	3285325884	pml/cm: check for request comp. before completing bsend. Turns out there are edge cases where an MTL's isend method may end up marking a send request complete prior to returning to the CM code. The would end up causing problems in the bsend path since the ompi_request_complete would end up getting invoked a second time on this request. This ended up causing segfaults, etc. in ompi_request_complete . Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-11-27 12:27:43 -07:00
bosilca	3f43e2c8df	Merge pull request #4419 from thananon/pr_newob1 pml/ob1: match callback will now queue wrong sequence frag and return.	2017-11-10 20:10:53 -05:00
George Bosilca	409638bdf4	Keep the out-of-sequence fragment ordered. Rework the logic to handle the out-of-sequence fragments on the receiver side. A large number of OOS messages are still arriving even in single threaded scenarios. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-11-08 14:27:13 -05:00
Thomas Naughton	86bb6f8bac	Merge pull request #4444 from naughtont3/tjn-fix-plm-monitoring-configury configury: single quote to avoid trouble with BSD	2017-11-08 10:56:37 -05:00
Thomas Naughton	c5dc41ee1a	configury: single quote to avoid trouble with BSD Signed-off-by: Thomas Naughton <naughtont@ornl.gov>	2017-11-03 11:34:28 -04:00
George Bosilca	d261282029	OB1 pvars should be linked with the OB1 component. If not the pvars will remain valid after the OB1 PML is unloaded, and any access will segfault (the callbacks associated with the pvar will point to the memory of the dlclosed module). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-11-03 01:18:26 -04:00
bosilca	3e1f771045	Merge pull request #4403 from naughtont3/tjn-fix-plm-monitoring-configury fix PML monitoring configury to compile DSOs	2017-11-02 19:08:20 -04:00
George Bosilca	b2b3da3046	Do not access the frag after returning it. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-10-31 16:39:23 -04:00

1 2 3 4 5 ...

1351 Коммитов