openmpi

Автор	SHA1	Сообщение	Дата
Aurelien Bouteiller	9f4365fef6	Merge pull request #6783 from abouteiller/export/macos-epipe Prevent EPIPE on OSX.	2020-01-28 11:18:46 -05:00
Austen Lauria	3c0fa537c2	Merge pull request #7113 from devreal/osc_rdma_cas RDMA osc: perform CAS in shared memory if possible	2020-01-28 10:59:17 -05:00
Brian Barrett	d768d82231	Merge pull request #7167 from wckzhang/reachable_netlinks Reachable documentation change	2020-01-27 15:39:14 -08:00
Brian Barrett	fc8c7a5869	Merge pull request #7134 from wckzhang/btl_tcp_interface_match btl tcp: Use reachability and graph solving for global interface matching	2020-01-27 15:38:49 -08:00
Austen Lauria	10f6a77640	Merge pull request #7315 from abouteiller/export/tcp_errors_v2 Handle error cases in TCP BTL (v2)	2020-01-27 17:03:07 -05:00
Aurelien Bouteiller	76021e35ee	Adding a description of the FIN message for future reference. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	395fcc4253	Disable inband PML error reporting during MPI Finalize as it interferes with the Finalize process. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	93846fd0ee	Remove the pending event when socket is TCP_FAILED Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	6b3be224d4	Adding a FIN message to differentiate normal TCP closing from failures Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	b7be64482a	Revert "Revert "Handle error cases in TCP BTL"" This reverts commit `5162011428`. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:31:06 -05:00
Austen Lauria	969eb0286c	Merge pull request #6970 from ggouaillardet/topic/cuda_no_orte coll/cuda: remove unnecessary references to ORTE	2020-01-27 13:12:32 -05:00
Austen Lauria	4f6978466d	Merge pull request #7284 from bosilca/fix/monitoring_registration Minor cleanup in the monitoring PML.	2020-01-27 13:01:30 -05:00
Josh Hursey	01a671349b	Merge pull request #7337 from jjhursey/no-ssh-core plm/rsh: Fix segv on missing agent.	2020-01-27 10:32:41 -06:00
George Bosilca	ecbd842ca8	Trap wrong parameters to MPI_Init_thread. Instead of triggering the fault early in the initialization process, do a serialized initialization and report the error once all the supporting infrastructure is up and running. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2020-01-26 23:59:54 -05:00
Joshua Hursey	62d0058738	plm/rsh: Fix segv on missing agent. * Additionally, fixes the `NULL` option to `OMPI_MCA_plm_rsh_agent` would would also lead to a segv. Now it operates as intended by disqualifying the `rsh` component and falling back onto the `isolated` component. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-01-24 14:02:53 -05:00
Howard Pritchard	e9a54e8e0e	Merge pull request #7268 from hppritcha/topic/fix_6539_master make mpifort obey disable-wrapper-runpath	2020-01-23 16:37:37 -07:00
Artem Polyakov	548ed56bef	Merge pull request #7332 from hoopoepg/topic/fixed-coverity-issues SPML/UCX: fixed coverity issues	2020-01-23 11:54:00 -08:00
Jeff Squyres	f16d2903aa	Merge pull request #7328 from DDNStorage/ime_fs_missing_header fs/ime: fix compilation errors due to missing header inclusion	2020-01-23 14:13:59 -05:00
Sergey Oblomov	8543860689	SPML/UCX: fixed coverity issues - fixed sizeof(char***) by variable datatype - fixed resorce leak in proc_add Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2020-01-23 14:02:08 +02:00
Artem Polyakov	e04575e8b6	Merge pull request #7274 from janjust/master-oshmem-perf-multi-worker oshmem/ucx: multi-threaded spml ucx performance improvements	2020-01-22 14:24:54 -08:00
Tomislav Janjusic	3d6bf9fd8e	oshmem/ucx: improves spml ucx performance for multi-threaded applications. Improves multi-threaded performance by adding the option to create multiple ucx workers in threaded applications. Co-authored with: Artem Y. Polyakov <artemp@mellanox.com>, Manjunath Gorentla Venkata <manjunath@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2020-01-22 21:41:09 +02:00
Artem Polyakov	611a2bdf68	Merge pull request #7273 from janjust/master-oshmem-perf-progress oshmem/ucx: Fix progress in iput/iget: periodically poke progress to prevent hardware stalls when using DCT transport.	2020-01-22 11:39:57 -08:00
Sylvain Didelot	01ae0c22b8	fs/ime: fix compilation errors due to missing header inclusion OpenMPI doesn't compile anymore with IME because the header file "ompi/mca/fs/base/base.h" needs to be include in every file where mca_fs_base_get_mpi_err() is used. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>	2020-01-22 17:56:55 +01:00
Tomislav Janjusic	1b58e3d073	oshmem/ucx: Improves performance for non-blocking put/get operations. Improves the performance when excess non-blocking operations are posted by periodically calling progress on ucx workers. Co-authored with: Artem Y. Polyakov <artemp@mellanox.com>, Manjunath Gorentla Venkata <manjunath@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2020-01-22 00:59:26 +02:00
Jeff Squyres	d14d8ad30d	Merge pull request #7326 from jsquyres/pr/make-opal-output-safe-before-opal-init opal: ensure opal_gethostname() always returns a value	2020-01-21 16:14:14 -05:00
Jeff Squyres	f84a692af0	Merge pull request #7324 from jsquyres/pr/advance-hwloc-submodule hwloc2: advance hwloc git submodule	2020-01-21 14:03:08 -05:00
William Zhang	e958f3cf22	btl tcp: Use reachability and graph solving for global interface matching Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>	2020-01-21 18:24:08 +00:00
Jeff Squyres	8c819e2a85	opal: ensure opal_gethostname() always returns a value We initially thought it was a safe bet that opal_gethostname() would never be called before opal_init(). However, it turns out that there are some cases -- e.g., developer debugging -- where it is useful to call opal_output() (which calls opal_gethostname()) before opal_init(). Hence, we need to guarantee that opal_gethostname() always returns a valid value. If opal_gethostname() finds NULL in opal_process_info.nodename, simply call the internal function to initialize opal_process_info.nodename. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-21 09:56:52 -08:00
Jeff Squyres	ed753afbc0	hwloc2: advance hwloc git submodule Advance to hwloc-2.1.0rc2-33-g38433c0f, which includes a .gitignore update that we want here in Open MPI. Be warned; this is actually 33 commits beyond the hwloc v2.1.0 tag. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-21 09:42:39 -08:00
Ralph Castain	1d0b87e170	Merge pull request #7317 from rhc54/topic/agen Check the return from "chdir" to avoid infinite loop	2020-01-18 15:04:01 -07:00
Ralph Castain	7de5f76b18	Check the return from "chdir" to avoid infinite loop When autogen attempts to change to a new directory while processing a subdirectory, it can get into an infinite loop if that directory doesn't exist as it will remain in the top-level directory, see itself there (as "autogen.pl"), and re-execute itself. Check the return code on "chdir" and error out if it fails. Signed-off-by: Ralph Castain <rhc@pmix.org>	2020-01-18 11:35:08 -08:00
Austen Lauria	df9745a251	Merge pull request #7299 from awlauria/fix_warnings Fix some compiler warnings.	2020-01-17 11:33:41 -05:00
Jeff Squyres	69bd54c83b	Merge pull request #7310 from itemko/artemry/azure_ci_updates_for_review Fixed several Mellanox Open MPI CI issues.	2020-01-16 09:25:43 -05:00
Artem Ryabov	fd7e94022b	Fixed several Mellanox Open MPI CI issues. The following issues have been fixed: - Corrected the link to CI status badge in README after some renaming. - Updated agent capabilities. - Switched to jenkins_scripts from master branch. - Corrected support e-mail. Signed-off-by: Artem Ryabov <artemry@mellanox.com>	2020-01-16 13:43:28 +03:00
Nathan Hjelm	037b0bd9ee	Merge pull request #7304 from hjelmn/btl_vader_fix_max_address_on_aarch64 btl/vader: modify how the max attachment address is determined	2020-01-15 17:02:33 -07:00
Nathan Hjelm	728d51f9f3	btl/vader: modify how the max attachment address is determined This PR removes the constant defining the max attachment address and replaces it with the largest address that shows up in /proc/self/maps. This should address issues found on AARCH64 where the max address may differ based on the configuration. Since the calculated max address may differ between processes the max address is sent as part of the modex and stored in the endpoint data. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2020-01-14 15:15:36 -08:00
Nathan Hjelm	61f96b3d6d	Merge pull request #7283 from hjelmn/fix_issues_in_both_vader_and_opal_interval_tree_t_that_were_causing_issue_6524 Fix issues in both vader and opal interval tree t that were causing issue 6524	2020-01-14 14:54:51 -07:00
Jeff Squyres	b3fe523150	Merge pull request #7124 from devreal/fix-opal-align-min Fix OPAL_ALIGN_MIN to work on 32-bit systems	2020-01-14 15:18:55 -05:00
Jeff Squyres	02f02e7c1e	Merge pull request #7300 from itemko/artemry/mellanox-ci-with-azure-pipelines-review Reworked Mellanox Open MPI CI with Azure Pipelines	2020-01-14 14:11:34 -05:00
Artem Ryabov	98bfe87ee5	Reworked Mellanox OpenMPI CI with Azure Pipelines. Signed-off-by: Artem Ryabov <artemry@mellanox.com>	2020-01-14 20:20:51 +03:00
Jeff Squyres	25931ea8bf	Merge pull request #7200 from cpshereda/master-opal_gethostname-change Fix unsafe use of gethostname()	2020-01-13 16:07:43 -05:00
Jeff Squyres	887400c878	Merge pull request #7174 from jsquyres/pr/ofi-mtl-fi-version-bump mtl/ofi: increase the FI_VERSION requested to 1.5 and make sure to check for OFI_LOCAL_COMM	2020-01-13 13:22:22 -05:00
Charles Shereda	cbc6feaab2	Created opal_gethostname() as safer gethostname substitute. The opal_gethostname() function provides a more robust mechanism to retrieve the hostname than gethostname(), which can return results that are not null-terminated, and which can vary in its behavior from system to system. opal_gethostname() just returns the value in opal_process_info.nodename; this is populated in opal_init_gethostname() inside opal_init.c. -Changed all gethostname calls in opal subtree to opal_gethostname -Changed all gethostname calls in orte subtree to opal_gethostname -Changed all gethostname calls in ompi subdir to opal_gethostname -Changed all gethostname calls in oshmem subdir to opal_gethostname -Changed opal_if.c in test subdir to use opal_gethostname -Changed opal_init.c to include opal_init_gethostname. This function returns an int and directly sets opal_process_info.nodename per jsquyres' modifications. Relates to open-mpi#6801 Signed-off-by: Charles Shereda <cpshereda@lanl.gov>	2020-01-13 08:52:17 -08:00
Robert Wespetal	49128a7adb	mtl/ofi: Add workaround for EFA local/remote capabilities bug Some versions of Libfabric contain a bug in EFA where FI_REMOTE_COMM and FI_LOCAL_COMM are not advertised. In order to workaround this, we need to call fi_getinfo() without those capability bits to see if EFA is available first. Also move around some of the provider include/exclude list logic so we can skip this workaround if applicable. Signed-off-by: Robert Wespetal <wesper@amazon.com>	2020-01-13 08:26:01 -08:00
Jeff Squyres	21bc9042e1	mtl/ofi: check for FI_LOCAL_COMM+FI_REMOTE_COMM Make sure to get an RDM provider that can provide both local and remote communication. We need this check because some providers could be selected via RXD or RXM, but can't provide local communication, for example. Add OPAL_CHECK_OFI_VERSION_GE() m4 macro to check that the Libfabric we're building against is >= a target version. Use this check in two places: 1. MTL/OFI: Make sure it is >= v1.5, because the FI_LOCAL_COMM / FI_REMOTE_COMM constants were introduced in Libfabric API v1.5. 2. BTL/usnic: It already had similar configury to check for Libfabric >= v1.1, but the usnic component was checking for >= v1.3. So update the btl/usnic configury to use the new macro and check for >= v1.3. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-13 08:19:53 -08:00
George Bosilca	05093f9cb1	Minor cleanup in the monitoring PML. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2020-01-13 09:24:00 -05:00
Austen Lauria	b65ec27307	Fix some compiler warnings. Silence unused variables, incompatible pointer types, un-initialized variables, and signed/unsigned comparisons. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2020-01-10 13:10:53 -05:00
Jeff Squyres	dd2d7d2866	Merge pull request #7189 from michaellass/fix-dims_create dims_create: fix calculation of factors for odd squares	2020-01-10 09:47:09 -05:00
Jeff Squyres	c471f1cb0b	Merge pull request #7286 from jsquyres/pr/add-git-submodule-status-to-github-issue-template Github issue template: updates	2020-01-08 12:53:57 -05:00
Jeff Squyres	a26a6c8e42	Github issue template: updates 1. Add more recent release version numbers in the examples 2. Add request for output from `git submodule status` when building/installing from a git clone Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-08 11:47:35 -05:00

... 3 4 5 6 7 ...

30514 Коммитов