openmpi

Автор	SHA1	Сообщение	Дата
NARIBAYASHI Akira	4ce2f96c10	opal/util: Fix typo Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com> (cherry picked from commit 3dc3bbc1b17d84172f5ee8b2936b3e2cab55c813)	2020-10-12 16:19:38 +09:00
Brian Barrett	31496e28e6	Merge pull request #8062 from brminich/topic/shmem_scoll_fix_4.1.x SHMEM/SCOLL: Fix inplace reductions - v4.1.x	2020-10-01 12:37:19 -07:00
Brian Barrett	51dc13d9b8	Merge pull request #8071 from wzamazon/v4.1.x_fix_oob_tcp oob/tcp: fix a race condition on stop_thread pipe	2020-10-01 12:30:17 -07:00
Brian Barrett	2486c75475	Merge pull request #8073 from bwbarrett/dist/v4.1.x Prep for 4.1.0rc2 release	2020-10-01 12:28:52 -07:00
Brian Barrett	af3113d797	dist: Update NEWS for 4.1.0 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2020-10-01 15:52:41 +00:00
Brian Barrett	d68ec2c35d	dist: Update version to 4.1.0rc2 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2020-10-01 15:45:55 +00:00
Wei Zhang	48cca8ab83	oob/tcp: fix a race condition on stop_thread pipe This patch fix a race condition, which caused the main thread to sometimes write to a closed pipe. The following are details: Currently, during shut down, the main thread will do the following: 1. set listen_thread_action to false. 2. write to stop_thread pipe to tell the listener thread to exit. The listener thread do the following: 1. call select() to listen to a set of file descriptors with a maximum wait time. 2. check listen_thread_action. If it is false, close the stop_thread pipe. The main thread will write to closed pipe, when 1. listener's call to select() finished because maximum wait time reached. 2. main thread set listen_thread_action to false 3. listener thread check listen_thread_action and closed the pipe 4. main thread write to the closed pipe. This patch address the issue by having the main thread close the pipe after the listener thread has been joined. This way, main thread will both write and close the thread, so there is no conflict. Note This patch was opened directly against v4.1.x branch because the orte/mca/oob/tcp directory has been removed from master branch. Signed-off-by: Wei Zhang <wzam@amazon.com>	2020-09-30 18:22:53 +00:00
Mikhail Brinskii	0f6dd8f2a5	SHMEM/SCOLL: Fix inplace reductions Signed-off-by: Mikhail Brinskii <mikhailb@nvidia.com> (cherry picked from commit dfe20e0472dff1f1911d72ede2b20913be6bd2e3)	2020-09-26 14:16:48 +03:00
Jeff Squyres	edf03e52f3	Merge pull request #7944 from bosilca/4.1/adapt Import the ADAPT collective into the 4.1	2020-09-23 16:42:49 -04:00
Xi Luo	e65fa4ff5c	Bring ADAPT collective to 4.1 This is a meta commit, that encapsulate all the ADAPT commits in the master into a single PR for 4.1. The master commits included here are: fe73586, a4be3bb, d712645, c2970a3, e59bde9, ee592f3 and c98e387. Here is a detailed list of added capabilities: * coll/adapt: Fix naming conventions and C11 atomic use * coll/adapt: Remove unused component field in module * Consistent handling of zero counts in the MPI API. * Correctly handle non-blocking collectives tags * As it is possible to have multiple outstanding non-blocking collectives provided by different collective modules, we need a consistent mechanism to allow them to select unique tags for each instance of a collective. * Add support for fallback to previous coll module on non-commutative operations (#30) * Replace mutexes by atomic operations. * Use the correct nbc request type (for both ibcast and ireduce) * coll/base: document type casts in ompi_coll_base_retain_* * add module-wide topology cache * use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast * reduce number of memory allocations * call the default request completion. * Remove the requests from the Fortran lookup conversion tables before completing and free it. * piggybacking Bull functionalities Signed-off-by: Xi Luo <xluo12@vols.utk.edu> Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Marc Sergent <marc.sergent@atos.net> Co-authored-by: Joseph Schuchart <schuchart@hlrs.de> Co-authored-by: Lemarinier, Pierre <pierre.lemarinier@atos.net> Co-authored-by: pierrele <31764860+pierrele@users.noreply.github.com>	2020-09-23 11:45:45 -04:00
Jeff Squyres	3ec835d697	Merge pull request #8048 from wckzhang/v4.1.x v4.1.x: coll/tuned: Fix dynamic message size for gather and scatter	2020-09-16 09:10:51 -04:00
William Zhang	7922430c66	coll/tuned: Fix dynamic message size for gather and scatter The gather and scatter operations did not use the correct message size (Only did datatype size * com size). This did not correctly reflect the total message size and prevents fine tuning within a com size. This patch multiplies the value by the number of elements sent. Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit 50823fe9a9ef4f93e55ee2087b311303d49f90a8)	2020-09-15 08:56:56 -07:00
Jeff Squyres	abfc6ba931	Merge pull request #8025 from hppritcha/topic/suppress_icc_no_long_double_v41x suppress icc long double message	2020-09-14 09:39:05 -04:00
Jeff Squyres	1d8caab4f5	Merge pull request #8027 from hppritcha/topic/patch_ofi_mtl_for_gni_prov_v41x OFI: patch OFI MTL for GNI provider	2020-08-31 09:30:37 -04:00
Howard Pritchard	fae374ce54	OFI: patch OFI MTL for GNI provider Uncovered a problem using the GNI provider with the OFI MTL. See https://github.com/ofiwg/libfabric/issues/6194. Related to #8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com> (cherry picked from commit d6ac41cbbd8dbfdbf1a87533e2e292f2508a85ff)	2020-08-28 14:30:31 -06:00
Brian Barrett	8b2bcbc9ca	Merge pull request #8021 from wckzhang/v4.1.x coll/tuned: Revert RSB and RS default algorithms	2020-08-27 10:33:42 -07:00
Howard Pritchard	c4128b21a0	suppress icc long double message improve configury to check whether icc is handling no long double. This prevents seeing 100s of messages like this: icc: command line warning #10148: option '-Wno-long-double' not supported A similar patch will be needed for pmix. Signed-off-by: Howard Pritchard <hppritcha@gmail.com> (cherry picked from commit 6df0e53421b285eef2dc0b442d818026301d29f1)	2020-08-27 16:04:50 +00:00
William Zhang	dceea5ad87	coll/tuned: Revert RSB and RS default algorithms Reduce scatter block and reduce scatter algorithms were hitting correctness issues for non commutative strided tests. We will revert to the original default algorithms for those two collectives (basic linear and non overlapping respectively) in the non commutative op case. See #8010 Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit 57b95bcb45d5ce3ae1a1e00bd17ceeaa206526fe)	2020-08-25 15:48:45 -07:00
Jeff Squyres	c4eb2fe40a	Merge pull request #8014 from hppritcha/topic/fix_ofi_mtl_mrecv_v41x ofi mtl: fix problem with mrecv	2020-08-19 13:50:25 -04:00
Howard Pritchard	833f8b2b41	ofi mtl: fix problem with mrecv the ofi mtl mrecv was not properly setting the message in/out arg to MPI_MRECV to MPI_MESSAGE_NULL. Signed-off-by: Howard Pritchard <hppritcha@gmail.com> (cherry picked from commit e6f81ed6d6f4940c95732451d6ebf4d39cde1591)	2020-08-19 10:22:37 -06:00
Brian Barrett	f63d6e2335	Merge pull request #7998 from wckzhang/v4.1.x V4.1.x btl/ofi backport disabling EFA and ofi_rxm providers	2020-08-17 12:12:07 -07:00
Jeff Squyres	2a0757f997	Merge pull request #8006 from bosilca/fix/smcuda_alignment Fix the cacheline usage in the CUDA BTL.	2020-08-17 15:08:42 -04:00
George Bosilca	a81c16b9d0	Fix the cacheline usage in the CUDA BTL. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2020-08-14 14:36:33 -04:00
William Zhang	27c252e211	btl/ofi: Disable ofi_rxm provider The ofi_rxm provider is dependent upon the underlying hardware for its implementation of FI_DELIVERY_COMPLETE. Since this can lead to early completions, we disable the provider to avoid correctness issues. This is not an issue in the mtl/ofi as it does not require FI_DELIVERY_COMPLETE. Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit 41acfee2bbfc5495aeeeae4b72f385ca8d1d8cee)	2020-08-12 14:03:08 -07:00
William Zhang	357ac2e2f5	btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0 EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric versions. While FI_DELIVERY_COMPLETE would be advertised by the provider, completions would return too early by not accounting for bounce buffers on the receive side. This would cause the BTL to receive early completions that lead to correctness issues. This is not an issue in the mtl/ofi as it does not require FI_DELIVERY_COMPLETE. Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit a7dcfd98742d7f44d96d74a60a38529894d4099c)	2020-08-12 14:02:51 -07:00
Jeff Squyres	e7b5e2550b	Merge pull request #7997 from bwbarrett/backports/v4.1.x-7957 Use the unaligned SSE memory access primitive.	2020-08-12 10:12:44 -04:00
George Bosilca	c20626e60d	Check unaligned ops for correctness. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit c4e88a43a3a1f6a13f8eaa8b90d2a0a8caba5adf) Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2020-08-12 02:32:10 +00:00
George Bosilca	eb9ced786a	Use the unaligned SSE memory access primitive. Alter the test to validate misaligned data. Fixes #7954. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit b6d71aa893db51c393df9c993fde597eff5457e8) Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2020-08-12 02:32:01 +00:00
Brian Barrett	22163d37b5	Merge pull request #7991 from wckzhang/v4.1.x v4.1.x: btl/ofi: Use common provider include/exclude list	2020-08-11 11:21:07 -07:00
Brian Barrett	ec04896d70	Merge pull request #7995 from mkurnosov/fix-parsing-locality-strng v4.1.x: opal/hwloc: fix a typo in parsing locality string: L0 changed to L1	2020-08-11 11:20:08 -07:00
Mikhail Kurnosov	11620038f9	Fix a typo in parsing locality string: L0 changed to L1 (`prte_hwloc_base_get_locality_string` never returns locality string with L0). Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com> (cherry picked from commit 4708458d6b12d1b7a8e11fa3c3f784c545780707)	2020-08-11 22:35:00 +07:00
William Zhang	abe3aaa4a7	btl/ofi: Use common provider include/exclude list The btl/ofi does not currently utilize the common ofi include/exclude list. Added verification code similar to the mtl/ofi that will check if the info object is in the include or exclude list. If it isn't in the include list or is in the exclude list, validate_info will return OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint when calling getinfo, instead filtering the provider during validate_info. This patch also moves the is_in_list MTL function into common code and adds additional debugging output to the BTL to match the MTL standard. Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit 9b8f463a768206a26e5b1dcea0612a403462b1d0)	2020-08-10 14:28:59 -07:00
Brian Barrett	704d019dc1	Merge pull request #7952 from wckzhang/v4.1.x coll/tuned: Change the default collective algorithm selection	2020-08-03 12:05:39 -07:00
William Zhang	5a13c5352f	coll/tuned: Change the default collective algorithm selection The default algorithm selections were out of date and not performing well. After gathering data from OMPI developers, new default algorithm decisions were selected for: allgather allgatherv allreduce alltoall alltoallv barrier bcast gather reduce reduce_scatter_block reduce_scatter scatter These results were gathered using the ompi-collectives-tuning package and then averaged amongst the results gathered from multiple OMPI developers on their clusters. You can access the graphs and averaged data here: https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3 Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit ce40cfbaa53406be71319041e13e893b0def7ad9)	2020-07-28 15:48:20 -07:00
Brian Barrett	e08e15cf93	Merge pull request #7959 from cniethammer/configure-leak-fix-v4.1.x v4.1: Fix memory leak in configure, which prevents leak sanitizer usage	2020-07-28 08:06:11 -07:00
Jeff Squyres	36bcc48dc1	Merge pull request #7902 from vspetrov/v4.1.x_hcoll_reduce_scatter V4.1.x hcoll reduce scatter	2020-07-27 15:50:09 -04:00
Jeff Squyres	6b4e2f02f8	Merge pull request #7956 from tomhers/topic/fix_ofi_btl_include_file v4.1.x: BTL/OFI: Fix missing include file.	2020-07-27 15:26:01 -04:00
Christoph Niethammer	e0bd64f843	Fix memory leak in configure, which prevents leak sanitizer usage If building Open MPI with sanitizers, e.g $ configure CC=clang CFLAGS=-fsanitize=address .... configure test programs are also build with the sanitizers and will report errors resulting in configure to fail. Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>	2020-07-22 14:49:10 +02:00
Brian Barrett	18418bfdf6	Merge pull request #7943 from dancejic/multi-4.1.x v4.1.x: common/ofi: added address format check to fix provider selection	2020-07-20 12:22:20 -07:00
Jeff Squyres	981b8858e7	Merge pull request #7951 from devreal/v4.1.x (v4.1.x) osc/rdma: fail query_btls if no endpoint for non-local peer is found	2020-07-20 15:12:59 -04:00
Joseph Schuchart	3d08d790e9	osc/rdma: fail query_btls if no endpoint for non-local peer is found Signed-off-by: Joseph Schuchart <schuchart@hlrs.de> (cherry picked from commit eebc451ec8313975998a63e25938fb4e0b4d6c44)	2020-07-17 08:38:37 +02:00
Brian Barrett	bd16024a0b	Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding Adding SLURM binding policy change to README	2020-07-16 12:50:15 -07:00
Geoffrey Paulsen	390772023a	Adding SLURM binding policy change to README Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>	2020-07-15 18:15:47 -05:00
tomhers	229724799e	BTL/OFI: Fix missing include file. The missing include file causes an error when using an external version of LibEvent. Signed-off-by: tomhers <tom.herschberg@gmail.com> (cherry picked from commit 88f9d2c90f1730f2e0b5bc4951893f60ff5a1332)	2020-07-15 17:14:55 -04:00
Nikola Dancejic	c018337d03	v4.1.x: common/ofi: added address format check to fix provider selection bugfix: provider selection would not differentiate between ipv4 and ipv6 addresses which would cause some nodes to be unable to communicate between each other. Adding a check for address format to provider selection to ensure that all nodes use the same address format. Signed-off-by: Nikola Dancejic <dancejic@amazon.com> (cherry picked from commit 7e463713014ae58f7e78d7a5b49e9e63d62e374e)	2020-07-15 00:32:37 +00:00
Jeff Squyres	57044aa0b0	Merge pull request #7934 from jjhursey/v4-jsm-no-bind v4.1.x: schizo/jsm: Disable binding when direct launched	2020-07-14 11:19:19 -04:00
Jeff Squyres	dfac9fae16	Merge pull request #7933 from artpol84/fix/4.1.x/schizo_slurm v4.1.x: schizo/slurm: Fix binding detection	2020-07-14 11:19:07 -04:00
Jeff Squyres	91c28f136b	Merge pull request #7935 from jsquyres/pr/v4.1.x/avx-for-strength v4.1.x: Add supports for MPI_OP using AVX512, AVX2 and MMX	2020-07-14 11:04:11 -04:00
dongzhong	b4e04bbd8a	Add supports for MPI_OP using AVX512, AVX2 and MMX Add logic to handle different architectural capabilities Detect the compiler flags necessary to build specialized versions of the MPI_OP. Once the different flavors (AVX512, AVX2, AVX) are built, detect at runtime which is the best match with the current processor capabilities. Add validation checks for loadu 256 and 512 bits. Add validation tests for MPI_Op. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: dongzhong <zhongdong0321@hotmail.com> Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 14b3c706289cfc26e53b13efd3eb85641636e459)	2020-07-13 13:49:00 -07:00
Joshua Hursey	c71e1fa1db	v4.1.x: schizo/jsm: Disable binding when direct launched Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-07-13 15:31:21 -05:00

1 2 3 4 5 ...

30000 Коммитов