openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	5f3dbdb5c8	mtl/ofi: replace OMPI_UNLIKELY with OPAL version one off patch for v4.0.x. for some reason commit on master didn't have this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-09-26 16:01:28 -05:00
Michael Heinz	89be953cfd	REF6976 Silent failure of OMPI over OFI with large messages sizes INTERNAL: STL-59403 The OFI (libfabric) MTL does not respect the maximum message size parameter that OFI provides in the fi_info data. This patch adds this missing max_msg_size field to the mca_ofi_module_t structure and adds a length check to the low-level send routines. (cherry-picked from commit 3aca4af548a3d781b6b52f89f4d6c7e66d379609) Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60 Signed-off-by: Michael Heinz <michael.william.heinz@intel.com> Reviewed-by: Adam Goldman <adam.goldman@intel.com> Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>	2019-09-23 17:19:10 -04:00
Aravind Gopalakrishnan	37d1a202be	MTL OFI: Fix race condition due to global progress entries array Since progress entries array is globally allocated, it is susceptible to race conditions when using multi-threaded applications. Allocating it on the stack resolves any potential races as it is thread local by default. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com> (cherry picked from commit ed2343034d09b33eb44a0a727bef97a108edc8aa)	2018-08-28 14:23:56 -07:00
Howard Pritchard	7b6a2da71a	Merge pull request #5504 from rhc54/cmr40/ofi MTL OFI: send/isend split into blocking/non-blocking paths	2018-08-13 14:18:05 -06:00
Howard Pritchard	9a6f6e61f0	Merge pull request #5499 from nrspruit/ns_cancel_fix_4.0 MTL OFI: Fix Deadlock in fi_cancel given completion during cancel	2018-08-07 09:16:56 -06:00
Spruit, Neil R	1fbbae1907	MTL OFI: send/isend split into blocking/non-blocking paths -Updated blocking send to directly call functionality and set completion events expected to 0 initally. This allows for optimization for providers that support fi_tinject up to larger sizes. This also reduces latency on running the OFI mtl with smaller sizes without requiring calls to progress given fi_tinject is required to complete the messaging before returning and will not create any events in the Completion Queue. -Updated non-blocking send to directly call fi_tsend and avoid calling fi_tinject as the functionality should not wait on completions. This resolves a bug where applications calling MPI_Isend can overrun the TX buffer with small (inject) messages causing a deadlock. In addition this improves performance in message rates by preventing waiting on any size message to complete in non-blocking send messages. -Created common ompi_mtl_ofi_ssend_recv function to post the ssend recv which is common between isend and send code paths. Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com> (cherry picked from commit 7dc8c8ba3fa630df8c5c7ab36fcf25249a82bfe7)	2018-08-01 06:45:48 -07:00
Spruit, Neil R	9cc6bc1ea6	MTL OFI: Fix Deadlock in fi_cancel given completion during cancel - If a message for a recv that is being cancelled gets completed after the call to fi_cancel, then the OFI mtl will enter a deadlock state waiting for ofi_req->super.ompi_req->req_status._cancelled which will never happen since the recv was successfully finished. - To resolve this issue, the OFI mtl now checks ofi_req->req_started to see if the request has been started within the loop waiting for the event to be cancelled. If the request is being completed, then the loop is broken and fi_cancel exits setting ofi_req->super.ompi_req->req_status._cancelled = false; Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com> (cherry picked from commit 767135c580f75d3dde9cb9c88601dd18afda949a)	2018-07-30 07:17:40 -07:00
Spruit, Neil R	ac8d2e01f9	MTL OFI: MTL_OFI_RETRY_UNTIL_DONE support for Resource overflow - Added support in MTL_OFI_RETRY_UNTIL_DONE to handle -FI_EAGAIN from the provider and correctly attempt to progress the OFI Completion queue by calling ompi_mtl_ofi_progress. - If events were pending that blocked OFI operations from being enqueued they will be completed and the OFI operation will be retried once ompi_mtl_ofi_progress has successfully completed. - Updated MTL_OFI_RETRY_UNTIL_DONE to take a RETURN variable instead of requiring the existance of a "ret" variable to pass back the return value from completing the OFI operation. Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com> (cherry picked from commit d4f408a7f867b2f7bab84b9c966e1eba59f59e0e)	2018-07-23 11:14:42 -07:00
Spruit, Neil R	9a17864278	MTL OFI: Redesign sync send with reduced tag bits and quick ack -Updated the design for sync send MPI calls to use 2 protocol bits for denoting "sync_send" or "sync_send_ack". -"Sync_send" is added to the send tag only and is masked out in receives such that it can be read by the original Recv posted in the send/recv operation. -"Sync_send_ack" is sent from the recv callback to the send side. This 0 byte send does not generate a completion entry and instead sends the message and immediately completes the opal completion in the recv. -Tag formats ofi_tag_1 and ofi_tag_2 have been updated to include 2 more tag bits per format type due to the reduced protocal bits required by OMPI. Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>	2018-07-09 06:50:21 -07:00
Matias A Cabral	e6674556aa	MTL OFI: add support for FI_REMOTE_CQ_DATA. Extend number of supported ranks with providers that support FI_REMOTE_CQ_DATA. Add README file to OFI MTL Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>	2018-06-14 17:17:38 -07:00
Spruit, Neil R	e7bff501cd	MTL OFI: Added support for reading multiple CQ events in ofi progress -Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a threshold that can be set by the Open MPI User. -Users can adjust the number of events that can be handled in the ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #". -The default value for the the number of CQ events that can be read in a single call to ofi progress is 100 which is an average based off workload usecase anaylsis showing 70-128 as the range of multiple events returned during ofi progress. Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>	2018-02-15 09:41:14 -05:00
Aravind Gopalakrishnan	fb68726baf	MTL OFI: Allow retries in MTL progress for interrupted syscalls This fixes a regression in sockets provider which could return -EINTR value from fi_cq_read() due to a syscall being interrupted. The error value is currently interpreted as fatal condition. Relax the rule so that we can retry fi_cq_read() operation. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>	2017-12-20 14:58:49 -08:00
Howard Pritchard	cd48eccbae	mtl/ofi: fix problem with mprobe/mrecv At least with some providers (sockets and GNI), the mprobe/mrecv ofi mtl methods were incorrect. For these two providers at least one must supply the original tag and mask bits used with the prior FI_PEEK \| FI_CLAIM request that had been used to probe for the message. These providers take a strict interpretation of the following sentence from the libfabric fi_tagged man page: ``` Claimed messages can only be retrieved using a subsequent, paired receive operation with the FI_CLAIM flag set. ``` Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-11-24 08:11:18 -07:00
Aravind Gopalakrishnan	285fc42b4e	Fix OFI MTL to recognize correct CQ empty scenario Currently, the progress function is incorrectly interpreting any error value other than a positive value or -FI_EAVAIL to mean CQ is empty. CQ is empty only if fi_cq_read() call returned -EAGAIN error code. Fix that here. While at it, fix help text output for calls made to OFI API. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>	2017-10-30 12:13:44 -07:00
Ralph Castain	1e2019ce2a	Revert "Update to sync with OMPI master and cleanup to build" This reverts commit cb55c88a8b7817d5891ff06a447ea190b0e77479.	2016-11-22 15:03:20 -08:00
Ralph Castain	cb55c88a8b	Update to sync with OMPI master and cleanup to build Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-22 14:24:54 -08:00
yohann	bd47062764	mtl/ofi: Fix error handling.	2016-02-19 16:58:41 -08:00
yohann	3ad59435ce	mtl/ofi: Prevent possible memory leak.	2016-02-19 16:57:02 -08:00
yohann	22eddfee10	mtl/ofi: update copyright dates.	2016-02-16 09:56:09 -08:00
yohann	b3d8ead76e	mtl/ofi: Fix dynamic add_procs.	2016-02-12 10:05:52 -08:00
yohann	fde8b89ceb	mtl/ofi: Use OFI's representation of ANY_SRC instead of NULL.	2015-10-26 14:38:41 -07:00
yohann	4246de4508	mtl/ofi: Treat error correctly.	2015-10-26 14:38:33 -07:00
yohann	404393b9d7	mtl/ofi: Minor code cleanup.	2015-09-03 15:04:55 -07:00
yohann	a8cac09769	mtl/ofi: Renamed macro to prevent clash with FI_ namespace.	2015-09-03 14:42:45 -07:00
yohann	7adb9b7ab4	mtl/ofi: Handle -FI_EAGAIN on send and recv operations.	2015-09-03 10:47:00 -07:00
Jithin Jose	bc4e8b7e73	Fix warnings in direct (pml-cm,mtl-ofi) build Signed-off-by: Jithin Jose <jithin.jose@intel.com>	2015-07-29 15:49:37 -07:00
Howard Pritchard	f5c43c1185	mtl/ofi: retain inline progress function Retain inline progress function for ofi mtl, but have a non-inlined progress function which is registered with the opal progress mechanism. @jithinjosepkl I've bad news about the psm provider. I still notice segfaults - not always - but frequently at finalize when using the psm provider. I don't notice this when using the sockets provider. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-07-27 09:16:52 -06:00
Yohann Burette	7fd5ded327	mtl/ofi: message truncation is now indicated by FI_ETRUNC.	2015-06-25 11:06:41 -07:00
Yohann Burette	483ff23db1	mtl/ofi: cancels are now tracked by an error entry.	2015-06-25 11:06:41 -07:00
Yohann Burette	27f1884cf8	mtl/ofi: Reworked header files. Added compat to ease maintenance.	2015-05-12 15:47:50 -07:00
Yohann Burette	1be185ed87	mtl/ofi: Remove use of MR.	2015-04-24 15:55:21 -07:00
Yohann Burette	19607d2ce7	mtl/ofi: Remove memset() from progress path.	2015-04-20 14:12:39 -07:00
Yohann Burette	d2eda04801	mtl/ofi: Use fi_tinject for small messages.	2015-04-20 14:12:39 -07:00
Yohann Burette	9392bb5ede	mtl/ofi: Implement Probe/Mprobe/Mrecv using FI_PEEK/FI_CLAIM.	2015-04-17 16:42:13 -07:00
Jithin Jose	9c937d44ae	Inline MTL-OFI Signed-off-by: Jithin Jose <jithin.jose@intel.com> Conflicts: ompi/mca/mtl/ofi/mtl_ofi_recv.c	2015-04-03 15:19:30 -07:00
Yohann Burette	58a7a1e4ac	Adding an Open Fabrics Interfaces (OFI) MTL. This MTL implementation uses the OFIWG libfabric's tag messaging capabilities.	2014-12-16 15:43:39 -08:00

36 Коммитов