openmpi

Автор	SHA1	Сообщение	Дата
Todd Kordenbrock	2c082b6c7c	btl-portals4: fix a flow control configure bug This commit fixes a configure bug that caused flow control to be disabled regardless of the configure options used. Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com> (cherry picked from commit f7e74b6a3d19cd6e3edc3364783311e595f81b3d)	2019-12-12 08:43:20 -06:00
Todd Kordenbrock	1f5a79bbd4	mtl-portals4: don't finalize flow control if Portals4 was not initialized This commit fixes a segfault in mtl-portals4 finalize(). The segfault occurs if finalize() is called without any calls to add_procs(). This commit resolves the segfault by skipping the flow control fini() call if Portals4 was not initialized. Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com> (cherry picked from commit e7b867c044f8b776b75f3c6d917745c06237743e)	2019-12-12 08:43:20 -06:00
Geoff Paulsen	3652cd93c5	Merge pull request #7227 from wbailey2/pr/4.0.x/two_phase_support-fix v4.0.x: fcoll/two_phase: Compiler warning for wrong variable type used	2019-12-09 08:56:48 -06:00
William Bailey	71fe9d78e0	fcoll/two_phase: Compiler warning for wrong variable type used Squash compiler warning. Changed output specifier to match variable type (long int -> long long int). Signed-off-by: William Bailey <wbailey2@nd.edu> (cherry picked from commit e2718e01961782bffeef0bdfdb19ea286da0db2a)	2019-12-08 14:15:29 -05:00
Howard Pritchard	d4dd837a3c	Merge pull request #7194 from edgargabriel/pr/two-phase-aggr-calc-32bits-bug-v4.0.x fcoll/two_phase: fix error in calculating aggregators in 32bit mode	2019-11-27 12:20:34 -07:00
Howard Pritchard	41717c8b73	Merge pull request #7193 from edgargabriel/pr/simple-aggr-mode-fix-v4.0.x common/ompio: fix calculation in simple-grouping option	2019-11-26 09:26:26 -07:00
Edgar Gabriel	02da54c174	fcoll/two_phase: fix error in calculating aggregators in 32bit mode In fcoll_two_phase_supprot_fns.c: calculation of the aggregator index failed for large offsets on 32bit machine, due to improper handling of 64bit offsets. Fixes Issue #7110 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ea1355beae918b3acd67d5c0ccc44afbcc5b7ca9)	2019-11-25 09:06:36 -06:00
Edgar Gabriel	39acc3a251	common/ompio: fix calculation in simple-grouping option This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue #7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ad5d0df4e91d66efa5b69e8388f7ed0b4b6a1d09)	2019-11-25 09:04:13 -06:00
Howard Pritchard	7cc5841b9a	Merge pull request #7180 from devreal/btl-ugni-deadlock-v4.0.x uGNI: Fix potential deadlock when processing outstanding transfers (v4.0.x)	2019-11-20 05:03:52 -07:00
Joseph Schuchart	a346756bf4	uGNI: Fix potential deadlock when processing outstanding transfers Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit c09ca039b4703e26d6d7a0494e042dd27827e091)	2019-11-19 22:39:21 +01:00
Howard Pritchard	d57bea0e6d	Merge pull request #7176 from hppritcha/topic/pr_7096_to_v4.0.x opal_check_alps: fix configure output	2019-11-19 12:35:39 -07:00
Geoff Paulsen	40b396c0e8	Merge pull request #7175 from Akshay-Venkatesh/topic/v4.0.x-openib-cx6-detect OPAL/MCA/BTL/OPENIB: Detect ConnectX-6 HCAs	2019-11-19 06:45:17 -06:00
Jeff Squyres	87c0178ed4	opal_check_alps: fix configure output There was a path where OPAL_CHECK_ALPS would exit its testing but still leave `opal_check_cray_alps_happy` blank. Fix that by setting it to "no". Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 26705efad0ab539031049d50f5e21775d9c3f84f)	2019-11-19 05:07:32 -07:00
Akshay Venkatesh	db3e563749	OPAL/MCA/BTL/OPENIB: Detect ConnectX-6 HCAs Signed-off-by: Akshay Venkatesh <akvenkatesh@nvidia.com>	2019-11-18 17:04:54 -08:00
Geoff Paulsen	a2304ebaae	Merge pull request #7152 from hppritcha/topic/btl_uct_fixes_for_v4.0.x Topic/btl uct fixes for v4.0.x	2019-11-08 14:25:43 -06:00
Geoff Paulsen	0e25083700	Merge pull request #7145 from devreal/shmem_memheap_alloc_band_v4.0.x Shmem: use bitwise and instead of logical and to check for allocator capabilities (v4.0.x)	2019-11-08 13:52:41 -06:00
Howard Pritchard	59b24ab4f7	btl/uct: add UCT API version check to configury related to #7128 The UCX crew is no longer guaranteeing that the UCT API is going to be frozen, so this is kind of a whack-a-mole problem trying to keep the BTL UCT working with various changing UCT APIs. Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 9d345d9aa000233bec148540b071cecffc94438c)	2019-11-07 10:01:52 -07:00
Nathan Hjelm	55e01220cd	btl/uct: fix compilation for UCX 1.7.0 Ref #7128 Signed-off-by: Nathan Hjelm <hjelmn@google.com> (cherry picked from commit a3026c016a6a8be379f62585b6ddc070175c8106)	2019-11-07 10:00:33 -07:00
Nathan Hjelm	47ec3e4d2b	btl/uct: add support for OpenUCX v1.8 API changes OpenUCX broke the UCT API again in v1.8. This commit updates btl/uct to fix compilation with current OpenUCX master (future v1.8). Further changes will likely be needed for the final release. Signed-off-by: Nathan Hjelm <hjelmn@google.com> (cherry picked from commit 526775dfd7ad75c308532784de4fb3ffed25458f)	2019-11-07 10:00:21 -07:00
Joseph Schuchart	ad86d043cf	Shmem: use bitwise and instead of logical and to check for allocator capabilities Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 9f2c6a42c3c9be42057c8f7881cc2fdff6a36962)	2019-11-06 08:46:06 +01:00
Geoff Paulsen	524960dcdd	Merge pull request #7119 from devreal/grequestx-progress-v4.0.x Ensure that grequestx continuously make progress (v4.0.x)	2019-11-01 14:12:48 -05:00
Howard Pritchard	608502ff85	Merge pull request #7102 from edgargabriel/pr/v4.0.x-romio321-status-set-elements-fix MPIR_Status_set_bytes: fix for large count sizes	2019-11-01 13:08:35 -06:00
Geoff Paulsen	dbb873f46f	Merge pull request #7047 from edgargabriel/pr/v4.0.x-hdf5-2gb-bug comomn_ompio_file_read/write: fix 2GB limiting issue	2019-11-01 14:06:46 -05:00
Geoff Paulsen	55527abbc7	Merge pull request #7125 from sam6258/smiller_rsh_chdir_v4.0.x plm/rsh: Add chdir option to change directory before orted exec	2019-10-29 15:59:09 -05:00
Scott Miller	8eae54fd27	plm/rsh: Add chdir option to change directory before orted exec Signed-off-by: Scott Miller <scott.miller1@ibm.com> (cherry picked from commit c1b8599528feaac3c1d9ad44f0113f65eb7cc97f) Conflicts: orte/mca/plm/rsh/plm_rsh_module.c	2019-10-29 15:49:41 -04:00
Joseph Schuchart	b7f5c17d83	Ensure that grequestx continuously make progress Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 37e6bbb1e12a61b4a7ac90c411752a768fede67d)	2019-10-29 10:31:23 +01:00
Geoff Paulsen	3dba9ec2b4	Merge pull request #7033 from jjhursey/v4-fix-sigkill-wait v4.0.x:Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion	2019-10-22 12:40:58 -05:00
Edgar Gabriel	a3e1ecc14b	comomn_ompio_file_read/write: fix 2GB limiting issue individual read/write operations exceeding 2GB fail in ompio due to improper conversions from size_t to int in two different locations. This commit fixes an issue reported by Richard Warren from the HDF5 group. Fixes Issue #7045 Cherry-picked from commit a130f569df6badebe639d17c0a1a0cb79a596094 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2019-10-22 12:12:55 -05:00
Edgar Gabriel	6185fa1946	MPIR_Status_set_bytes: fix for large count sizes Change the ncounts argument to MPI_Count and use MPI_Status_set_elements_x for enabling read/write operations beyond the 2GB limit. Thanks to Richard Warren from the HDF5 group for reporting the issue and providing the suggested fix for romio. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit 8a3abbf80373bb36eb78688954118f1fb0cc2e9d)	2019-10-22 09:51:29 -05:00
Howard Pritchard	106109a286	Merge pull request #7043 from jsquyres/pr/v4.0.x/usnic-fixes-and-optimizations v4.0.x: usnic fixes and optimizations	2019-10-22 09:05:27 -05:00
Jeff Squyres	c6592822c0	btl/usnic: set retrans_timeout back down to 5ms Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 3080033a8c4db64199b03b6058e18488f619088c)	2019-10-15 07:54:32 -07:00
Jeff Squyres	1565239506	btl/usnic: set ack_iteration_delay default to 4 It was previously accidentally set to 0. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 132e4cab3bc71df0da87368a332d6af0090a6977)	2019-10-15 07:54:31 -07:00
Geoff Paulsen	cb5f4e737a	Merge pull request #7048 from hppritcha/topic/update_news_for_v402 NEWS: update for the v4.0.2 release	2019-10-06 15:22:59 -05:00
Howard Pritchard	7c2efbd616	NEWS: update for the v4.0.2 release [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-10-05 13:14:31 -06:00
Jeff Squyres	22bc268e6e	btl/usnic: properly size freelist items Move the prefix area from the head to the body in relevant size computations. This fixes a problem in high traffic situations where usNIC may have sent from unregistered memory. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit fe7f772f21627b01838c007db7cedbbb0ce8b536)	2019-10-04 16:47:19 -07:00
Jeff Squyres	58155bc760	btl/usnic: cap the number of resends per progress iteration New MCA param: btl_usnic_max_resends_per_iteration. This is the max number of resends we'll do in a single pass through usNIC component progress. This prevents progress from getting stuck in an endless loop of retransmissions (i.e., if more retransmissions are triggered during the sending of retransmissions). Specifically: we need to leave the resend loop to allow receives to happen (which may ACK messages we have sent previously, and therefore cause pending resends to be moot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 27e3040dfeba00a9a2615a217c164899f0009e59)	2019-10-04 16:47:13 -07:00
Jeff Squyres	8f929c68f1	btl/usnic: increase default retrans_timeout Significantly increase the default retrans timeout. If the retrans timeout is too soon, we can end up in a retransmission storm where the logic will continually re-transmit the same frames during a single run through the usNIC progress function (because the timer for a single frame expires before we have run through re-transmitting all the frames pending re-transmission). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 3cc95d86b2123f38f392e56adca7ac8a1fef6454)	2019-10-04 16:47:11 -07:00
Jeff Squyres	b5cb03450c	btl/usnic: clarifications and fixes regarding ACKs New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the number of times through the usNIC component progress function before sending a standalone ACK (vs. piggy-backing the ACK on any other send going to the target peer). Use "ticks" language to clarify that we're really counting the number of times through the usNIC component DATA_CHANNEL completion check (to check for incoming messages) -- it has no relation to wall clock time whatsoever. Also slightly change the channel-checking scheme in usNIC component progress: only check the PRIORITY channel once (vs. checking it once, not finding anything, and then falling through the progress_2() where we check PRIORITY again and then check the DATA channel). As before, if our "progress" libevent fires, increment the tick counter enough to guarantee that all endpoints that need an ACK will get triggered to send standalone ACKs the next time through progress, if necessary. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 968b1a51b59898877a8c7268d463d3d7d78d86a3)	2019-10-04 16:47:09 -07:00
Jeff Squyres	0839a9c313	btl/usnic: s/get_nsec/get_nticks/g Rename "get_nsec()" to "get_ticks()" to more accurately reflect that this function has no correlation to wall clock time at all. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit ce2910a28aea61043b81324c67999f3a47cfe7ac)	2019-10-04 16:47:08 -07:00
Joshua Hursey	c6fab32137	Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion. * The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait 30 seconds before sending SIGTERM then another 30 seconds before sending SIGKILL to remaining processes. This usually happens on an abnormal termination. Sometimes the user wants to delay the cleanup to give the system time to write out corefile or run other diagnostics. * The problem is that child processes may be completing while ORTE is in this loop. The SIGCHLD will interrupt the `sleep` system call. Without the loop the sleep could effectively be ignored in this case. - Sleep returns the amount of time remaining to sleep. If it was interrupted by a signal then it is a positive number less than or equal to the parameter passed to it. If it slept the whole time then it returns 0. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit 0e8a97c598d841d472047ea1025931813c3ef8a9)	2019-10-02 14:49:47 -05:00
Howard Pritchard	2d0dcaeedc	Merge pull request #7018 from rhc54/cmr40/oob v4.0.x: Cleanup stale code in ORTE/OOB	2019-09-27 12:58:26 -06:00
Ralph Castain	edbfcf090a	Cleanup stale code in ORTE/OOB Remove code for multiple OOB progress threads as it is an optimization nobody uses. Also turns out to have a race condition that can cause segfault on finalize, so maybe good that nobody is using it. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 41eb41c3f224abcacec78f4d77c138197c36f172) (cherry picked from commit a2f35c1834ab2fcb216285621d177a179e33dfe7)	2019-09-26 15:21:15 -07:00
Geoff Paulsen	2ea5548dbe	Merge pull request #7017 from hppritcha/topic/patch_ofi_v4.0.x mtl/ofi: replace OMPI_UNLIKELY with OPAL version	2019-09-26 17:18:22 -05:00
Howard Pritchard	5f3dbdb5c8	mtl/ofi: replace OMPI_UNLIKELY with OPAL version one off patch for v4.0.x. for some reason commit on master didn't have this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-09-26 16:01:28 -05:00
Geoff Paulsen	9f4b7ba508	Merge pull request #7002 from hoopoepg/topic/restored-ikrit-compilation-v4.0 IKRIT: restored compilation - v4.0	2019-09-24 12:26:12 -05:00
Geoff Paulsen	32984ceb65	Merge pull request #7005 from mwheinz/REFS6976-4.0.x v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes	2019-09-24 12:25:43 -05:00
Howard Pritchard	aece12963b	Merge pull request #7001 from amaslenn/mlnx-no-missing-libcuda-warn-v4 platform/mellanox: disable missing libcuda warning — v4	2019-09-24 09:32:58 -06:00
Michael Heinz	89be953cfd	REF6976 Silent failure of OMPI over OFI with large messages sizes INTERNAL: STL-59403 The OFI (libfabric) MTL does not respect the maximum message size parameter that OFI provides in the fi_info data. This patch adds this missing max_msg_size field to the mca_ofi_module_t structure and adds a length check to the low-level send routines. (cherry-picked from commit 3aca4af548a3d781b6b52f89f4d6c7e66d379609) Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60 Signed-off-by: Michael Heinz <michael.william.heinz@intel.com> Reviewed-by: Adam Goldman <adam.goldman@intel.com> Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>	2019-09-23 17:19:10 -04:00
Sergey Oblomov	f8843bba7c	IKRIT: restored compilation - due to some refactoring and adding new functionality compilation of ikrit module was broken - this commit restores compilation Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 991082abf2da3a76849be021c5f7ecced8052709)	2019-09-23 15:49:50 +03:00
Andrey Maslennikov	226dfc4ef0	platform/mellanox: disable missing libcuda warning Signed-off-by: Andrey Maslennikov <andreyma@mellanox.com> (cherry picked from commit 63ba7bec46e6a08f9948c82cba602d3b8d50fada)	2019-09-23 11:18:55 +03:00

1 2 3 4 5 ...

29644 Коммитов