openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	174e967dbc	Remove ORTE project Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build without reference to ORTE. Setup opal/pmix framework to be static. Remove support for all PMI-1 and PMI-2 libraries. Add support for "external" pmix component as well as internal v4 one. remove orte: misc fixes - UCX fixes - VPATH issue - oshmem fixes - remove useless definition - Add PRRTE submodule - Get autogen.pl to traverse PRRTE submodule - Remove stale orcm reference - Configure embedded PRRTE - Correctly pass the prefix to PRRTE - Correctly set the OMPI_WANT_PRRTE am_conditional - Move prrte configuration to the end of OMPI's configure.ac - Make mpirun a symlink to prun, when available - Fix makedist with --no-orte/--no-prrte option - Add a `--no-prrte` option which is the same as the legacy `--no-orte` option. - Remove embedded PMIx tarball. Replace it with new submodule pointing to OpenPMIx master repo's master branch - Some cleanup in PRRTE integration and add config summary entry - Correctly set the hostname - Fix locality - Fix singleton operations - Fix support for "tune" and "am" options Signed-off-by: Ralph Castain <rhc@pmix.org> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-02-07 18:20:06 -08:00
Jeff Squyres	dc0d6a5e1b	Merge pull request #7366 from bgoglin/master Fix hwloc <v2.0 compilation issue	2020-02-05 13:33:01 -05:00
Brice Goglin	329d4451a6	opal/hwloc: remove some unused variables when building with hwloc < 1.7 Refs #7362 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2020-02-04 22:56:46 +01:00
Brice Goglin	907ad854b4	hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 1.11 Build was broken by mistake in commit d40662edc41a5a4d09ae690b640cfdeeb24e15a1 Fixes #7362 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2020-02-04 22:56:46 +01:00
Austen Lauria	395e2c9d8f	Merge pull request #7364 from awlauria/fix_compile Protect use of _Static_assert().	2020-02-04 15:24:07 -05:00
Austen Lauria	824dbcbcf3	Protect use of _Static_assert(). Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2020-02-04 13:46:58 -05:00
Jeff Squyres	827e9f66af	Merge pull request #7265 from jsquyres/pr/fortran-you-win-again Fortran fixes	2020-02-04 07:03:02 -05:00
Yossi Itigin	1047e28a28	Merge pull request #7353 from dmitrygx/topic/spml/ucx SPML/UCX: Fix compilation warnings with GCC	2020-02-04 10:38:37 +02:00
Ralph Castain	fd2de25609	Merge pull request #7358 from rhc54/topic/opt Always consider retrieval of HOSTNAME to be optional	2020-02-03 18:59:36 -08:00
Ralph Castain	2ee1aaf08d	Always consider retrieval of HOSTNAME to be optional Signed-off-by: Ralph Castain <rhc@pmix.org>	2020-02-03 16:03:05 -08:00
Jeff Squyres	ab398f4b9a	fortran: ensure not to use [AM_]CPPFLAGS Automake's Fortran compilation rules inexplicably use CPPFLAGS and AM_CPPFLAGS. Unfortunately, this can cause problems in some cases (e.g., picking up already-installed mpi.mod in a system-default include search path). So in relevant module-using Fortran compilation Makefile.am's, zero out CPPFLAGS and AM_CPPFLAGS. This has a side-effect of requiring that we compile the one .c file in the F08 library in a new, separate subdirectory (with its own Makefile.am that does _not_ have CPPFLAGS/AM_CPPFLAGS zeroed out). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2020-02-03 14:45:32 -08:00
Jeff Squyres	f4a47a5a8e	fortran: remove useless CPPFLAGS assignment These -D's are for C compilation, not Fortran compilation. Remove this useless statement. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-02-03 14:45:25 -08:00
Howard Pritchard	d2b68e6ecd	Merge pull request #7201 from bgoglin/master hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0	2020-02-03 11:23:42 -07:00
Howard Pritchard	9916b9124e	Merge pull request #7345 from hppritcha/topic/follow_up_pr7268 PR 7268 follow-up	2020-02-03 10:42:47 -07:00
Dmitry Gladkov	b6a658b24a	SPML/UCX: Fix compilation warnings with GCC Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2020-02-03 05:11:49 -08:00
Jeff Squyres	dfbaf23241	Merge pull request #7210 from MaskRay/mpi_fortran_argv_null_ Fix Fortran st_size fields of mpi_fortran_argv_null_, mpi_fortran_weights_empty_, mpi_fortran_unweighted_, mpi_fortran_errcodes_ignore_, and mpi_fortran_argvs_null_	2020-02-02 16:56:44 -05:00
Fangrui Song	5609268e90	Make C and Fortran types for MPI sentinels agree in size Fix the C types for the following: * MPI_UNWEIGHTED * MPI_WEIGHTS_EMPTY * MPI_ARGV_NULL * MPI_ARGVS_NULL * MPI_ERRCODES_IGNORE There is lengthy discussion on https://github.com/open-mpi/ompi/pull/7210 describing the issue; the gist of it is that the C and Fortran types for several MPI global sentenial values should agree (specifically: their sizes must(*) agree). We erroneously had several of these array-like sentinel values be "array-like" values in C. E.g., MPI_ERRCODES_IGNORE was an (int ) in C while its corresponding Fortran type was "integer, dimension(1)". On a 64 bit platform, this resulted in C expecting the symbol size to be sizeof(int)==8 while Fortran expected the symbol size to be sizeof(INTEGER, DIMENSION(1))==4. That is incorrect -- the corresponding C type needed to be (int). Then both C and Fortran expect the size of the symbol to be the same. () NOTE: This code has been wrong for years. This mismatch of types typically worked because, due to Fortran's call-by-reference semantics, Open MPI was comparing the addresses* of these instances, not their types (or sizes) -- so even if C expected the size of the symbol to be X and Fortran expected the size of the symbol to be Y (where X!=Y), all we really checked at run time was that the addresses of the symbols were the same. But it caused linker warning messages, and even caused errors in some cases. Specifically: due to a GNU ld bug (https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common symbols are incorrectly versioned VER_NDX_LOCAL because their definitions in Fortran sources have smaller st_size than those in libmpi.so. This makes the Fortran library not linkable with lld in distributions that ship openmpi built with -Wl,--version-script (https://bugs.llvm.org/show_bug.cgi?id=43748): % mpifort -fuse-ld=lld /dev/null ld.lld: error: corrupt input file: version definition index 0 for symbol mpi_fortran_argv_null_ is out of bounds >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so ... If we fix the C and Fortran symbols to actually be the same size, the problem goes away and the GNU ld bug does not come into play. This commit also fixes a minor issue that MPI_UNWEIGHTED and MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed by commit 107c0073dd11fb90d18122c521686f692a32cdd8). Fixes open-mpi/ompi#7209 Signed-off-by: Fangrui Song <i@maskray.me> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-02-02 04:20:39 -08:00
Jeff Squyres	48c04cd637	Merge pull request #7350 from artemry-mlnx/artemry/enable_mlnx_ci_for_release_branches Enabled Mellanox CI for release branches (changes for master branch).	2020-01-31 15:58:52 -05:00
Artem Ryabov	67b70ae9bd	Enabled Mellanox CI for release branches. Signed-off-by: Artem Ryabov <artemry@mellanox.com>	2020-01-31 21:57:03 +03:00
Howard Pritchard	5354b9b41c	PR 7268 follow-up Forgot to include a fix for the fortran test used to check if new dtags is supported. Related to #7268 This patch is already included on v4.0.x branch. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2020-01-28 12:34:49 -07:00
Austen Lauria	1275766037	Merge pull request #7207 from devreal/remove_shmem_seg_hdr Remove unused opal_shmem_seg_hdr_t to retain alignment	2020-01-28 13:57:55 -05:00
Aurelien Bouteiller	9f4365fef6	Merge pull request #6783 from abouteiller/export/macos-epipe Prevent EPIPE on OSX.	2020-01-28 11:18:46 -05:00
Austen Lauria	3c0fa537c2	Merge pull request #7113 from devreal/osc_rdma_cas RDMA osc: perform CAS in shared memory if possible	2020-01-28 10:59:17 -05:00
Brian Barrett	d768d82231	Merge pull request #7167 from wckzhang/reachable_netlinks Reachable documentation change	2020-01-27 15:39:14 -08:00
Brian Barrett	fc8c7a5869	Merge pull request #7134 from wckzhang/btl_tcp_interface_match btl tcp: Use reachability and graph solving for global interface matching	2020-01-27 15:38:49 -08:00
Austen Lauria	10f6a77640	Merge pull request #7315 from abouteiller/export/tcp_errors_v2 Handle error cases in TCP BTL (v2)	2020-01-27 17:03:07 -05:00
Aurelien Bouteiller	76021e35ee	Adding a description of the FIN message for future reference. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	395fcc4253	Disable inband PML error reporting during MPI Finalize as it interferes with the Finalize process. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	93846fd0ee	Remove the pending event when socket is TCP_FAILED Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	6b3be224d4	Adding a FIN message to differentiate normal TCP closing from failures Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	b7be64482a	Revert "Revert "Handle error cases in TCP BTL"" This reverts commit 51620114281720c9e375b95cadc45272b3135837. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:31:06 -05:00
Austen Lauria	969eb0286c	Merge pull request #6970 from ggouaillardet/topic/cuda_no_orte coll/cuda: remove unnecessary references to ORTE	2020-01-27 13:12:32 -05:00
Austen Lauria	4f6978466d	Merge pull request #7284 from bosilca/fix/monitoring_registration Minor cleanup in the monitoring PML.	2020-01-27 13:01:30 -05:00
Josh Hursey	01a671349b	Merge pull request #7337 from jjhursey/no-ssh-core plm/rsh: Fix segv on missing agent.	2020-01-27 10:32:41 -06:00
Joshua Hursey	62d0058738	plm/rsh: Fix segv on missing agent. * Additionally, fixes the `NULL` option to `OMPI_MCA_plm_rsh_agent` would would also lead to a segv. Now it operates as intended by disqualifying the `rsh` component and falling back onto the `isolated` component. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-01-24 14:02:53 -05:00
Howard Pritchard	e9a54e8e0e	Merge pull request #7268 from hppritcha/topic/fix_6539_master make mpifort obey disable-wrapper-runpath	2020-01-23 16:37:37 -07:00
Artem Polyakov	548ed56bef	Merge pull request #7332 from hoopoepg/topic/fixed-coverity-issues SPML/UCX: fixed coverity issues	2020-01-23 11:54:00 -08:00
Jeff Squyres	f16d2903aa	Merge pull request #7328 from DDNStorage/ime_fs_missing_header fs/ime: fix compilation errors due to missing header inclusion	2020-01-23 14:13:59 -05:00
Sergey Oblomov	8543860689	SPML/UCX: fixed coverity issues - fixed sizeof(char***) by variable datatype - fixed resorce leak in proc_add Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2020-01-23 14:02:08 +02:00
Artem Polyakov	e04575e8b6	Merge pull request #7274 from janjust/master-oshmem-perf-multi-worker oshmem/ucx: multi-threaded spml ucx performance improvements	2020-01-22 14:24:54 -08:00
Tomislav Janjusic	3d6bf9fd8e	oshmem/ucx: improves spml ucx performance for multi-threaded applications. Improves multi-threaded performance by adding the option to create multiple ucx workers in threaded applications. Co-authored with: Artem Y. Polyakov <artemp@mellanox.com>, Manjunath Gorentla Venkata <manjunath@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2020-01-22 21:41:09 +02:00
Artem Polyakov	611a2bdf68	Merge pull request #7273 from janjust/master-oshmem-perf-progress oshmem/ucx: Fix progress in iput/iget: periodically poke progress to prevent hardware stalls when using DCT transport.	2020-01-22 11:39:57 -08:00
Sylvain Didelot	01ae0c22b8	fs/ime: fix compilation errors due to missing header inclusion OpenMPI doesn't compile anymore with IME because the header file "ompi/mca/fs/base/base.h" needs to be include in every file where mca_fs_base_get_mpi_err() is used. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>	2020-01-22 17:56:55 +01:00
Tomislav Janjusic	1b58e3d073	oshmem/ucx: Improves performance for non-blocking put/get operations. Improves the performance when excess non-blocking operations are posted by periodically calling progress on ucx workers. Co-authored with: Artem Y. Polyakov <artemp@mellanox.com>, Manjunath Gorentla Venkata <manjunath@mellanox.com> Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2020-01-22 00:59:26 +02:00
Jeff Squyres	d14d8ad30d	Merge pull request #7326 from jsquyres/pr/make-opal-output-safe-before-opal-init opal: ensure opal_gethostname() always returns a value	2020-01-21 16:14:14 -05:00
Jeff Squyres	f84a692af0	Merge pull request #7324 from jsquyres/pr/advance-hwloc-submodule hwloc2: advance hwloc git submodule	2020-01-21 14:03:08 -05:00
William Zhang	e958f3cf22	btl tcp: Use reachability and graph solving for global interface matching Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>	2020-01-21 18:24:08 +00:00
Jeff Squyres	8c819e2a85	opal: ensure opal_gethostname() always returns a value We initially thought it was a safe bet that opal_gethostname() would never be called before opal_init(). However, it turns out that there are some cases -- e.g., developer debugging -- where it is useful to call opal_output() (which calls opal_gethostname()) before opal_init(). Hence, we need to guarantee that opal_gethostname() always returns a valid value. If opal_gethostname() finds NULL in opal_process_info.nodename, simply call the internal function to initialize opal_process_info.nodename. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-21 09:56:52 -08:00
Jeff Squyres	ed753afbc0	hwloc2: advance hwloc git submodule Advance to hwloc-2.1.0rc2-33-g38433c0f, which includes a .gitignore update that we want here in Open MPI. Be warned; this is actually 33 commits beyond the hwloc v2.1.0 tag. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-21 09:42:39 -08:00
Ralph Castain	1d0b87e170	Merge pull request #7317 from rhc54/topic/agen Check the return from "chdir" to avoid infinite loop	2020-01-18 15:04:01 -07:00

1 2 3 4 5 ...

30330 Коммитов