openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	7c2efbd616	NEWS: update for the v4.0.2 release [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-10-05 13:14:31 -06:00
Jeff Squyres	22bc268e6e	btl/usnic: properly size freelist items Move the prefix area from the head to the body in relevant size computations. This fixes a problem in high traffic situations where usNIC may have sent from unregistered memory. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `fe7f772f21`)	2019-10-04 16:47:19 -07:00
Jeff Squyres	58155bc760	btl/usnic: cap the number of resends per progress iteration New MCA param: btl_usnic_max_resends_per_iteration. This is the max number of resends we'll do in a single pass through usNIC component progress. This prevents progress from getting stuck in an endless loop of retransmissions (i.e., if more retransmissions are triggered during the sending of retransmissions). Specifically: we need to leave the resend loop to allow receives to happen (which may ACK messages we have sent previously, and therefore cause pending resends to be moot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `27e3040dfe`)	2019-10-04 16:47:13 -07:00
Jeff Squyres	8f929c68f1	btl/usnic: increase default retrans_timeout Significantly increase the default retrans timeout. If the retrans timeout is too soon, we can end up in a retransmission storm where the logic will continually re-transmit the same frames during a single run through the usNIC progress function (because the timer for a single frame expires before we have run through re-transmitting all the frames pending re-transmission). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `3cc95d86b2`)	2019-10-04 16:47:11 -07:00
Jeff Squyres	b5cb03450c	btl/usnic: clarifications and fixes regarding ACKs New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the number of times through the usNIC component progress function before sending a standalone ACK (vs. piggy-backing the ACK on any other send going to the target peer). Use "ticks" language to clarify that we're really counting the number of times through the usNIC component DATA_CHANNEL completion check (to check for incoming messages) -- it has no relation to wall clock time whatsoever. Also slightly change the channel-checking scheme in usNIC component progress: only check the PRIORITY channel once (vs. checking it once, not finding anything, and then falling through the progress_2() where we check PRIORITY again and then check the DATA channel). As before, if our "progress" libevent fires, increment the tick counter enough to guarantee that all endpoints that need an ACK will get triggered to send standalone ACKs the next time through progress, if necessary. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `968b1a51b5`)	2019-10-04 16:47:09 -07:00
Jeff Squyres	0839a9c313	btl/usnic: s/get_nsec/get_nticks/g Rename "get_nsec()" to "get_ticks()" to more accurately reflect that this function has no correlation to wall clock time at all. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `ce2910a28a`)	2019-10-04 16:47:08 -07:00
Joshua Hursey	c6fab32137	Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion. * The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait 30 seconds before sending SIGTERM then another 30 seconds before sending SIGKILL to remaining processes. This usually happens on an abnormal termination. Sometimes the user wants to delay the cleanup to give the system time to write out corefile or run other diagnostics. * The problem is that child processes may be completing while ORTE is in this loop. The SIGCHLD will interrupt the `sleep` system call. Without the loop the sleep could effectively be ignored in this case. - Sleep returns the amount of time remaining to sleep. If it was interrupted by a signal then it is a positive number less than or equal to the parameter passed to it. If it slept the whole time then it returns 0. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit `0e8a97c598`)	2019-10-02 14:49:47 -05:00
Howard Pritchard	2d0dcaeedc	Merge pull request #7018 from rhc54/cmr40/oob v4.0.x: Cleanup stale code in ORTE/OOB	2019-09-27 12:58:26 -06:00
Ralph Castain	edbfcf090a	Cleanup stale code in ORTE/OOB Remove code for multiple OOB progress threads as it is an optimization nobody uses. Also turns out to have a race condition that can cause segfault on finalize, so maybe good that nobody is using it. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit `41eb41c3f2`) (cherry picked from commit a2f35c1834ab2fcb216285621d177a179e33dfe7)	2019-09-26 15:21:15 -07:00
Geoff Paulsen	2ea5548dbe	Merge pull request #7017 from hppritcha/topic/patch_ofi_v4.0.x mtl/ofi: replace OMPI_UNLIKELY with OPAL version	2019-09-26 17:18:22 -05:00
Howard Pritchard	5f3dbdb5c8	mtl/ofi: replace OMPI_UNLIKELY with OPAL version one off patch for v4.0.x. for some reason commit on master didn't have this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-09-26 16:01:28 -05:00
Geoff Paulsen	9f4b7ba508	Merge pull request #7002 from hoopoepg/topic/restored-ikrit-compilation-v4.0 IKRIT: restored compilation - v4.0	2019-09-24 12:26:12 -05:00
Geoff Paulsen	32984ceb65	Merge pull request #7005 from mwheinz/REFS6976-4.0.x v4.0.x: REF6976 Silent failure of OMPI over OFI with large messages sizes	2019-09-24 12:25:43 -05:00
Howard Pritchard	aece12963b	Merge pull request #7001 from amaslenn/mlnx-no-missing-libcuda-warn-v4 platform/mellanox: disable missing libcuda warning — v4	2019-09-24 09:32:58 -06:00
Michael Heinz	89be953cfd	REF6976 Silent failure of OMPI over OFI with large messages sizes INTERNAL: STL-59403 The OFI (libfabric) MTL does not respect the maximum message size parameter that OFI provides in the fi_info data. This patch adds this missing max_msg_size field to the mca_ofi_module_t structure and adds a length check to the low-level send routines. (cherry-picked from commit `3aca4af548`) Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60 Signed-off-by: Michael Heinz <michael.william.heinz@intel.com> Reviewed-by: Adam Goldman <adam.goldman@intel.com> Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>	2019-09-23 17:19:10 -04:00
Sergey Oblomov	f8843bba7c	IKRIT: restored compilation - due to some refactoring and adding new functionality compilation of ikrit module was broken - this commit restores compilation Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `991082abf2`)	2019-09-23 15:49:50 +03:00
Andrey Maslennikov	226dfc4ef0	platform/mellanox: disable missing libcuda warning Signed-off-by: Andrey Maslennikov <andreyma@mellanox.com> (cherry picked from commit `63ba7bec46`)	2019-09-23 11:18:55 +03:00
Geoff Paulsen	2f101326fc	Merge pull request #6997 from jsquyres/pr/v4.0.x/vader-do-not-use-cma v4.0.x: Do not use CMA in user namespaces	2019-09-22 07:39:28 -05:00
Adrian Reber	674655c641	Do not use CMA in user namespaces Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved. Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace. One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected and fall back to MCA_BTL_VADER_EMUL. Signed-off-by: Adrian Reber <areber@redhat.com> (cherry picked from commit `fc68d8a90f`)	2019-09-20 19:12:48 -07:00
Geoff Paulsen	71d97f0355	Merge pull request #6994 from gpaulsen/gpaulsen_v4.0.2rc3 Updating VERSION v4.0.2rc3	2019-09-20 13:52:04 -05:00
Howard Pritchard	83df06275d	Merge pull request #6996 from jsquyres/pr/v4.0.x/enable-timings-compile-fix v4.0.x: ess/pmi: Fix `--enable-timing` compilation error	2019-09-20 12:42:52 -06:00
KAWASHIMA Takahiro	e5be033c14	ess/pmi: Fix `--enable-timing` compilation error This commit fixes an compilation error when configured with `--enable-timing`. Procedures in the function `orte_ess_base_app_setup` in `orte/mca/ess/base/ess_base_std_app.c` are moved to `orte/mca/ess/pmi/ess_pmi_module.c` and `orte/mca/ess/singleton/ess_singleton_module.c` in the recent commit `57f6b94fa5`. In `ess_pmi_module.c`, the first argument of the `OPAL_TIMING_ENV_NEXT` macro should have been adapted to the destination function but was not. In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT` was not used in the destination function originally. So `OPAL_TIMING_ENV_NEXT` cannot be used in the function. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com> (cherry picked from commit `8e7d874e14`)	2019-09-19 18:14:06 -04:00
Howard Pritchard	265a47bdf8	Merge pull request #6990 from awlauria/fix_mpir_standard_v4.0.x v4.0.x: Conform MPIR_Breakpoint to MPIR standard.	2019-09-19 15:41:45 -06:00
Geoffrey Paulsen	0bb0e59345	Updating VERSION to v4.0.2rc3. Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>	2019-09-19 14:22:57 -05:00
Austen Lauria	1430df3c0f	Add 'orte_' prefix to noop_mpir_breakpoint_ptr. Signed-off-by: Austen Lauria <awlauria@us.ibm.com> (cherry picked from commit `77144689f0`)	2019-09-19 08:47:17 -04:00
Austen Lauria	3eb7b27d3a	Conform MPIR_Breakpoint to MPIR standard. - Fix MPIR_Breakpoint standard violation by returning void instead of a void*. Signed-off-by: Austen Lauria <awlauria@us.ibm.com> (cherry picked from commit `067adfa417`)	2019-09-18 09:53:15 -04:00
Geoff Paulsen	90b55db052	Merge pull request #6986 from hppritcha/topic/pr6961_to_4.0.x btl/vader: when using single-copy emulation fragment large rdma	2019-09-18 07:32:04 -05:00
Nathan Hjelm	5a945f668c	btl/vader: when using single-copy emulation fragment large rdma This commit changes how the single-copy emulation in the vader btl operates. Before this change the BTL set its put and get limits based on the max send size. After this change the limits are unset and the put or get operation is fragmented internally. References #6568 Signed-off-by: Nathan Hjelm <hjelmn@google.com> (cherry picked from commit `ae91b11de2`)	2019-09-17 20:01:37 -06:00
Geoff Paulsen	84e4af5175	Merge pull request #6969 from gpaulsen/topic/v4.0.x_VERSION_rc2 Reving VERSION to v4.0.2rc2	2019-09-10 09:58:52 -05:00
Geoffrey Paulsen	49a2558eff	Reving VERSION to v4.0.2rc2 Reving VERSION to v4.0.2rc2 Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>	2019-09-09 14:48:52 -04:00
Geoff Paulsen	a482edc14e	Merge pull request #6944 from jjhursey/v4/fix-tree-launch Fix tree spawn routed component issue	2019-09-09 13:10:36 -05:00
Geoff Paulsen	ce228d291f	Merge pull request #6952 from jsquyres/pr/v4.0.x/ddt-opt-and-fix v4.0.x: Datatype optimization and fix	2019-09-09 13:09:49 -05:00
Geoff Paulsen	287ee150d1	Merge pull request #6967 from rhc54/cmr40x/oob v4.0.x: Be a little less restrictive on interface requirements	2019-09-09 13:08:58 -05:00
Ralph Castain	95cc53e331	Be a little less restrictive on interface requirements If both types of interfaces are enabled, don't error out if one of them isn't able to open listener sockets. Only one interface family may be available on some machines, but someone might want to build the code to run more generally. Refs https://github.com/pmix/prrte/pull/249 Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit `06d188ebf3`)	2019-09-09 07:52:03 -07:00
George Bosilca	8f16780ee0	Add a test for datatypes composed by multiple predefined elements that can be merged into a larger UINT1 type. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `82d632278a`)	2019-09-03 15:09:34 -04:00
George Bosilca	e2b154327e	Small optimization on the datatype commit. This patch fixes the merge of contiguous elements into larger but more compact datatypes, and allows for contiguous elements to have thir blocklen increasing instead of the count. The idea is to always maximize the blocklen, aka. the contiguous part of the datatype. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `41e6f55807`)	2019-09-03 15:09:33 -04:00
Howard Pritchard	6912e09d7a	Merge pull request #6942 from guserav/v4-fix-osc-sm-post-32-bit-atomics v4.0.x: Fix osc sm posts when only 32 bit atomics support v4.0.x	2019-09-03 09:20:38 -06:00
guserav	9bf1873215	Fix osc sm posts when only 32 bit atomics support Signed-off-by: guserav <erik.zeiske@hpe.com> (cherry picked from commit `3c9f4e6823`)	2019-08-31 12:31:19 -07:00
Geoff Paulsen	893ea3f91f	Merge pull request #6929 from rhc54/cmr40/pmix314 Remove unnecessary error log	2019-08-30 14:10:36 -05:00
Howard Pritchard	c6fe859c28	Merge pull request #6946 from hkuno/intercept_mmap_fix v4.0.x: Fix mmap infinite recurse in memory patcher	2019-08-30 10:56:43 -06:00
Howard Pritchard	989461f305	Merge pull request #6915 from sam6258/smiller_regx_none v4.0.x: regx/naive: add regx/naive component	2019-08-30 07:51:10 -06:00
Harumi Kuno	fbbacc1303	Fix mmap infinite recurse in memory patcher This commit fixes issue #6853 by removing MacOS/Darwin-specific logic from intercept_mmap. Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>	2019-08-29 18:03:16 -07:00
Joshua Hursey	4c1160e257	Fix tree spawn routed component issue * Fix #6618 - See comments on Issue #6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2019-08-29 16:26:43 -04:00
Geoff Paulsen	2d515f747f	Merge pull request #6934 from devreal/osc-ucx-excl-lock-v4.0.x UCX osc: properly release exclusive lock to avoid lockup (v4.0.x)	2019-08-29 13:41:03 -05:00
Geoff Paulsen	78b8b0126c	Merge pull request #6938 from jsquyres/pr/v4.0.x/fix-ddt-variable-names-in-make-check v4.0.x: Update OPAL DDT variable names	2019-08-28 13:57:57 -05:00
Joseph Schuchart	8d130e1964	UCX osc: properly release exclusive lock to avoid lockup Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit 08cb6389e034c1a70368671f745f20904c774a1e)	2019-08-27 23:12:56 +02:00
Howard Pritchard	061574f938	Merge pull request #6935 from vspetrov/v4.0.x_coll_hcoll_nbc_request_bugfix V4.0.x Coll/hcoll: fixes hcoll non-blocking colls support	2019-08-27 14:40:36 -06:00
Jeff Squyres	8b3fd5682f	Update OPAL DDT variable names These variables were renamed in 904276bb44caec207638247f23139bc21bc6a09e; update them to use the new names. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `2ab8109be1`)	2019-08-27 13:08:16 -07:00
Valentin Petrov	83a2518994	Coll/hcoll: fixes hcoll non-blocking colls support open-mpi/ompi@0fe756d416 Introduced a bug in coll/hcoll component. The ompi_requests allocated by libhcoll would be treated as coll_base_nbc_request during ompi_coll_base_retain_<> call. Afterwards this would lead to a segv in the request cleanup. Fix: since libhcoll interface does not distinguish between the blocling/non-blocking requests use coll_base_nbc_request all the time and initialize it properly in coll/hcoll/get_coll_handle(). It is still within 2 cache lines. Signed-off-by: Valentin Petrov <valentinp@mellanox.com>	2019-08-27 17:23:52 +03:00
Ralph Castain	8efc6e1dc1	Remove unnecessary error log Refs https://github.com/pmix/pmix/pull/1413 Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-08-26 23:48:34 -07:00

... 3 4 5 6 7 ...

29811 Коммитов