openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	6f5a453492	event: trivial comment change Switch from #-style to dnl-style. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit open-mpi/ompi@83e4a45a9f)	2018-07-24 09:52:56 +09:00
Gilles Gouaillardet	aa7a4d0f6f	hwloc: prefer external hwloc component Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit open-mpi/ompi@ce2c9fffd4)	2018-07-24 09:52:53 +09:00
Nathan Hjelm	b6bd3d33f1	btl/uct: fix compile warnings/errors Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `47ed8e8830`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-23 14:05:17 -06:00
Ralph Castain	508c3f391f	Default to internal PMIx if newer than external Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `1e6aaf7f22`)	2018-07-19 11:59:17 -07:00
Howard Pritchard	4447738098	Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default btl/openib: only look for iwarp/roce by default	2018-07-17 20:08:00 -06:00
Howard Pritchard	bc8134dae1	Merge pull request #5448 from thananon/ofi_context btl/ofi: Added FI_CONTEXT as requirement.	2018-07-17 19:51:13 -06:00
Howard Pritchard	6818272392	btl/openib: only look for iwarp/roce by default Due to decreasing support by vendors/other orgs for the OpenIB BTL, only look for iWarp/RoCE devices by default. Allow IB HCAs with ports configured for ethernet. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-07-17 19:11:37 -06:00
Thananon Patinyasakdikul	033c364ee0	btl/ofi: Added FI_CONTEXT as requirement. OFI BTL uses context for completion but never ask for it in fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT to eliminate any potential error. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-07-17 12:18:43 -07:00
Sergey Oblomov	a4b8253fa2	MCA/COMMON/UCX: fixed initialization of malloc hooks Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 20:09:50 +03:00
Sergey Oblomov	1c7ae22dfb	MCA/COMMON/UCX: shift opal memhooks into common UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 13:46:38 +03:00
Ralph Castain	4a596d35f7	Remove the PMIx ext4x component Update configury to redirect anything at or above v3 to the ext3x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-13 19:51:50 -07:00
Nathan Hjelm	9d3a79925b	btl/vader: fix bugs in rma emulation This commit fixes two bugs in the RMA/atomic emulation code: 1) Fix a fragment leak when using AMO emulation. 2) Always initialize the single-copy emulation code. This is required to use the AMO support. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-12 15:50:50 -06:00
Ralph Castain	fdca304268	Default to external PMIx installation Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-10 16:12:52 -07:00
Nathan Hjelm	8b090103e2	opal/fifo: fix 128-bit atomic fifo on Power9 This commit updates the atomic fifo code to fix a consistency issue observed on Power9 systems when builtin atomics are used. The cause was two things: 1) a missing write memory barrier in fifo push, and 2) a read ordering issue when reading the fifo head non-atomically. This commit fixes both issues and appears to correct then inconsistency. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-10 15:37:11 -06:00
KAWASHIMA Takahiro	3e179ba95f	hwloc/external: Suppress missing-include-dirs warning If OMPI is configured with `--with-hwloc=external` or `--with-hwloc=DIR` and gfortran is used, I see a lot of warnings when compiling files under the `ompi/mpi/fortran` directory. ``` f951: Warning: Nonexistent include directory 'BUILD_DIR/opal/mca/hwloc/external/hwloc/include' [-Wmissing-include-dirs] ``` There is no such `include` directory in the source tree and `configure`- created tree. I think these lines in the `configure.m4` file are wrongly copied from that for the embedded `hwlocXXX` component in the past. The `-Wmissing-include-dirs` option is enabled in gfortran by default but it is not enabled by default (or even with `-Wall`) in gcc and g++. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-07-09 10:55:33 +09:00
Yossi Itigin	e77e31b50b	Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging MCA/COMMON/UCX: unified logging across all UCX modules	2018-07-08 12:45:26 +03:00
Ralph Castain	17c4cf0db8	Install PMIx v3.0.0 release Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-06 06:38:02 -07:00
Sergey Oblomov	bef47b792c	MCA/COMMON/UCX: unified logging across all UCX modules - added common logging infrastructure for all UCX modules - all UCX modules are switched to new infra Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-05 16:25:39 +03:00
Sergey Oblomov	8080283b3d	MCA/COMMON/UCX: changed return type for wait_request - for now wait_request returns OMPI status - updated callers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 23:29:38 +03:00
Sergey Oblomov	f574c14e3a	ATOMICS/UCX: redefine atomic module API - now it accepts integer values directily instead of pointers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 14:41:45 +03:00
Yossi Itigin	4962651567	Merge pull request #5366 from hoopoepg/topic/mca-common-ucx-unify-2 MCA/COMMON/UCX: minor unification of del_proces calls	2018-07-04 14:38:37 +03:00
Nathan Hjelm	bd5cd62df9	btl/ugni: fix up some warnings Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 16:30:44 -06:00
Nathan Hjelm	d8916a4672	btl/ugni: fix race condition in completing frags The descriptor flags field in a fragment were being ready after the fragment may have been freed. This commit reads the flags before calling the user callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 10:48:54 -06:00
Nathan Hjelm	87d41da62b	btl/vader: add support for atomics and emulated rdma This commit adds support for atomic operations as well as rdma for systems without rdma support. This support is implemented using an internal send tag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-02 13:57:11 -06:00
Nathan T. Weeks	08f9ae97ee	btl/ugni: update BTL_VERBOSE argument list Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>	2018-07-02 09:23:30 -06:00
Sergey Oblomov	c2bd6af9f2	MCA/COMMON/UCX: minor unification of del_proces calls - some common functionality of del_procs calls is moved into mca_common module - blocking ucp_put call is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-02 15:10:53 +03:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00
Jeff Squyres	57bc657e7f	btl/tcp: fix hash map usage Fix two facepalms: 1. The "uint32" in the hash map functions refer to the key size, not the value size. The values are always 64 bits. 2. Pass the straight value to the "set" functions -- not the pointer to the value. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-28 15:29:41 -07:00
Yossi Itigin	3a7271ef4e	Merge pull request #5344 from hoopoepg/topic/mca-common-ucx-fixed-build MCA/COMMON/UCX: fixed build scripts	2018-06-28 15:14:04 +03:00
Sergey Oblomov	624d59604b	MCA/COMMON/UCX: minor optimization of build scripts Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-28 12:58:07 +03:00
Thananon Patinyasakdikul	304cf97ab5	Merge pull request #5334 from thananon/ofi_progress_fix btl/ofi: progress now happens after a threshold.	2018-06-27 12:51:33 -07:00
Sergey Oblomov	de8568c822	MCA/COMMON/UCX: enabled fallback into older UCX API Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 19:59:40 +03:00
Sergey Oblomov	1223b05811	MCA/COMMON/UCX: fixed build scripts - updated evaluation of UCX lib - used call from UCX v1.3 - updated makefile compilation flags Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 11:10:25 +03:00
Thananon Patinyasakdikul	be76896f7c	btl/ofi: progress now happens after a threshold. This commit changed the way btl/ofi call progress. Before, we force progression with every rdma/atomic call. This gives performance boost in some case and slow down on others. Now we only force progression after some number of rdma calls which result in better performance overall. Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set the threshold number. The new default is 64. Also: Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion is generated after the message has been received on the remote side. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-26 10:39:45 -07:00
Nathan Hjelm	b0ac6276a6	btl/ugni: improve multi-threaded RDMA performance This commit improves the injection rate and latency for RDMA operations. This is done by the following improvements: - If C11's _Thread_local keyword is available then always use the same virtual device index for the same thread when using RDMA. If the keyword is not available then attempt to use any device that isn't already in use. The binding support is enabled by default but can be disabled via the btl_ugni_bind_devices MCA variable. - When posting FMA and RDMA operations always attempt to reap completions after posting the operation. This allows us to better balance the work of reaping completions across all application threads. - Limit the total number of outstanding BTE transactions. This fixes a performance bug when using many threads. - Split out RDMA and local SMSG completion queue sizes. The RDMA queue size is better tuned for performance with RMA-MT. - Split out put and get FMA limits. The old btl_ugni_fma_limit MCA variable is deprecated. The new variable names are: btl_ugni_fma_put_limit and btl_ugni_fma_get_limit. - Change how post descriptors are handled. They are no longer allocated seperately from the RDMA endpoints. - Some cleanup to move error code out of the critical path. - Disable the FMA sharing flag on the CDM when we detect that there should be enough FMA descriptors for the number of virtual devices we plan will create. If the user sets this flag we will not unset it. This change should improve the small-message RMA performance by ~ 10%. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-26 11:31:35 -06:00
Ralph Castain	0ddbc75ce5	Merge pull request #4930 from kizill/fix-ipv6 fixed ipv6 OOB connection problems (fix issue #1585)	2018-06-26 09:13:53 -07:00
Nathan Hjelm	abb87f9137	Merge pull request #5338 from ggouaillardet/topic/uct btl/uct: misc fixes	2018-06-26 08:56:40 -06:00
Yossi Itigin	ee873f4f79	Merge pull request #5322 from hoopoepg/topic/mca-ucx-common MCA/UCX: added common module	2018-06-26 13:54:12 +03:00
Gilles Gouaillardet	b40b835a70	btl/uct: remove debug code Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 16:03:16 +09:00
Gilles Gouaillardet	552d0809aa	btl/uct: add missing include file Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 14:53:02 +09:00
Nathan Hjelm	6c089518e7	btl/uct: make uct endpoints array a flexible array member Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Nathan Hjelm	c5c5b42307	btl: add a new btl for the UCT layer in OpenUCX This commit adds a new btl for one-sided and two-sided. This btl uses the uct layer in OpenUCX. This btl makes use of multiple uct contexts and per-thread device pinning to provide good performance when using threads and osc/rdma. This btl has been tested extensively with osc/rdma and passes all MTT tests on aries and IB hardware. For now this new component disables itself but can be enabled by setting the btl_ucx_transports MCA variable with a comma-delimited list of supported memory domains/transport layers. For example: --mca btl_uct_memory_domains ib/mlx5_0. The specific transports used can be selected using --mca btl_uct_transports. The default is to use any available transport. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Sergey Oblomov	bf7fd480e9	MCA/COMMON/UCX: added non-blocking implementations of atomics - added implementation of swap/cswap/fadd operations - blocking add64 is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 12:25:31 +03:00
Sergey Oblomov	63e7ba6843	MCA/COMMON/UCX: added parameter for UCX/opal progress - added parameter to set UCX/opal progresses - minor refactoring of request wait routines Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 11:00:12 +03:00
Jeff Squyres	3767ce27c0	btl/tcp: trivial whitespace clean No code/logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:04:12 -07:00
Jeff Squyres	9034717876	btl/tcp: use a hash map for kernel IP interface indexes The giant size of the TCP proc struct is causing a problem in some environments (because it is allocated on the stack), and it was too big, anyway. Instead, use a hash map. That way, it starts small and can grow if it needs to. It also makes no assumptions about the values of the kernel interface indexes. Fixes #5292. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:03:30 -07:00
Jeff Squyres	e3d6c5ce3a	pmix3/pmix_server.c: minor compiler warning stomp Submitted upstream https://github.com/pmix/pmix/pull/776. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 06:35:09 -07:00
Sergey Oblomov	d57ae62dee	MCA/UCX: added common module - implemented non-blocking routines for flush operations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-22 16:41:09 +03:00
Jeff Squyres	e305e80aff	Merge pull request #5317 from jsquyres/pr/update-bind-to-cpulist-option orterun: use consistent CLI option name for --bind-to	2018-06-21 12:43:18 -04:00
Jeff Squyres	4603852740	orterun: use consistent CLI option name for --bind-to Since the new binding option is tied to the --cpu-list orterun CLI option, make the --bind-to option reflect the same name (vs. the --cpu-set CLI option, which is entirely different). For example: mpirun --bind-to cpu-list:ordered ... Note that "--bind-to cpulist:ordered" is accepted as a synonym, because people will be lazy. Also add some minor updates to the orterun.1in man page for clarification. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-21 08:22:00 -07:00
Edgar Gabriel	fb16d40775	Merge pull request #5196 from edgargabriel/topic/cuda io/ompio: introduce initial support for cuda buffers in ompio	2018-06-21 10:14:43 -05:00
Edgar Gabriel	8c2ea0ef49	opal/dataype: add additional interface to retrieve more details about cuda buffer the existing interface in opal_datatype_cuda do not allow to distinguish whether a buffer is a managed or unmanaged cuda buffer. Add an interface that allows to retrieve this information throug a convertor, since the information is actually available in the mca_common_cuda_* routines. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-06-21 09:25:50 -05:00
Ralph Castain	f17d47087a	Define a new binding method and qualifier Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 21:26:09 -07:00
Ralph Castain	5ac2ce6346	Cover all the PMIx data types Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 09:06:19 -07:00
Boris Karasev	39c9cb12bb	pmix/ext2x: fixed detection PMIx v2.0 by pmix component Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-06-20 13:23:51 +03:00
Ralph Castain	08707c9762	Sync to updated PMIx v3.0.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-19 21:25:43 -07:00
Noah Evans	a64abadf97	Fix mca_base_var_files separator In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch. Signed-off-by: Noah Evans <noah.evans@gmail.com>	2018-06-19 12:59:07 -06:00
Ralph Castain	f0a0d606a0	Correct accounting for tools Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)	2018-06-18 13:24:25 -07:00
Jeff Squyres	cd8d169599	Merge pull request #5276 from jsquyres/pr/info-key-len-fix util/info: tighten up error detection on key length	2018-06-18 12:03:21 -04:00
Thananon Patinyasakdikul	13f58f3191	Merge pull request #5274 from thananon/ofi_sep btl/ofi: add scalable endpoint support.	2018-06-18 08:41:06 -07:00
Jeff Squyres	266d5b2110	Merge pull request #5277 from jsquyres/pr/cygwin-patch external libevent: fix for Cygwin	2018-06-18 11:08:30 -04:00
Ralph Castain	7981818b84	Update PMIx atomics Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 10:03:49 -07:00
Ralph Castain	fa18ba395d	Sync to latest PMIx v3.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 02:41:46 -07:00
Ralph Castain	ac7bb15505	Fix other typo in help message Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:30:52 -07:00
Ralph Castain	8cfce583c0	Correct typo to properly check for PMIx v4 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:29:05 -07:00
Jeff Squyres	07c8ec6a3c	external libevent: fix for Cygwin Fix from Marco Atzeri for building on Cygwin. Signed-off-by: Marco Atzeri <marco.atzeri@gmail.com> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 09:08:58 -07:00
Jeff Squyres	2670a7f55c	util/info: tighten up error detection on key length Fix CID 1435996: use the proper % type to render the size. Also use opal_output(), not fprintf(). For debug builds, abort without dumping core (dumping core is very unfriendly when running thousands of automated tests) -- the stderr output is sufficient to find the coding error. For non-debug builds, truncate the key and emit a warning that it almost certainly will not work properly. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 08:45:03 -07:00
Ralph Castain	94794011f1	Silence warnings and ignore test binary Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-15 11:32:22 -07:00
Ralph Castain	e2e6da4379	Merge pull request #5258 from rhc54/topic/pmix4 Sync to PMIx v3.0rc and add ext4x	2018-06-15 09:41:08 -07:00
Thananon Patinyasakdikul	dae3c9447c	btl/ofi: add scalable endpoint support. This commit add support for scalable endpoint to enhance multithreaded application performance. The BTL will detect the support from ofi provider and will fallback to normal usage of scalable endpoint is not supported. NEW MCA parameters: - mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint. - mca_btl_ofi_num_contexts_per_module: number of communication context to create (should be the same as number of thread). Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 15:44:29 -07:00
Thananon Patinyasakdikul	08cfacddee	Merge pull request #5272 from thananon/opal__thread opal/threads: Added keyword `opal_thread_local` for TLS	2018-06-14 14:45:47 -07:00
Thananon Patinyasakdikul	469a404c3b	opal/thread: Added keyword `opal_thread_local` for TLS. configure: add checks for `__thread` on top of current check for `_Thread_local` and define OPAL_HAVE_THREAD_LOCAL if the compiler support TLS. Added `opal_thread_local` keyword to unify the definition. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 13:25:04 -07:00
Nathan Hjelm	009a3f093b	opal/asm: fix 32-bit fetch and op atomics These functions were erroneously returing the new, not the old value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-14 13:54:52 -06:00
Howard Pritchard	7dcab6e4a4	Merge pull request #5269 from hppritcha/topic/squash_gcc7.3.0_warnings topo/treematch - quash compiler warning	2018-06-13 21:13:04 -05:00
Howard Pritchard	64de269cc3	topo/treematch - quash compiler warning quash a compiler warning showing up with gcc 7.3 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-06-13 16:34:17 -05:00
Thananon Patinyasakdikul	390d72addd	Merge pull request #4885 from davideberius/spc_pr Initial Software-based Performance Counters PR	2018-06-12 14:04:49 -07:00
Jeff Squyres	591b225527	Merge pull request #5245 from PeterGottesman/hwloc-fix Ensure required hwloc directories are in dist tarballs	2018-06-12 13:08:35 -04:00
Peter Gottesman	afe363b73b	Ensure required hwloc directories are in dist tarballs Fixes MTT failure when running autogen on a tarball Signed-off-by: Peter Gottesman <pgottesm@cisco.com>	2018-06-12 08:33:50 -07:00
Howard Pritchard	b43f94895f	Merge pull request #5261 from thananon/ofi_new_mr btl/ofi: change required mr mode bits.	2018-06-11 20:54:09 -06:00
David Eberius	d377a6b6f4	Added Software-based Performance Counters driver code along with several counters. This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI. All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example. Signed-off-by: David Eberius <deberius@vols.utk.edu>	2018-06-11 22:48:16 -04:00
Thananon Patinyasakdikul	b6e07bfac6	btl/ofi: change required mr mode bits. FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits (FI_MR_PROV_KEY \| FI_MR_ALLOCATED \| FI_MR_VIRT_ADDR). The btl functionality remains the same. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-11 09:25:57 -07:00
Ralph Castain	48f27655a6	Sync to PMIx v3.0rc and add ext4x Sync to the draft rc for PMIx v3.0. Add an external component for PMIx master, which is at v4.0 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-11 05:54:23 -07:00
Jeff Squyres	84701cd2b0	Merge pull request #5204 from markalle/info_snprintf fix info-subscribe to use snprintf() and warn on long key	2018-06-08 15:22:55 -04:00
Nicolas Morey-Chaisemartin	038ceb0b61	btl: openib: Add support for Broadcom Cumulus RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:58 +02:00
Nicolas Morey-Chaisemartin	d3416345ab	btl: openib: add support for QLogic RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:41 +02:00
Arm	ce4d769aee	Merge pull request #5235 from thananon/ofi_einr_fix btl/ofi: handles FI_EINTR properly.	2018-06-06 11:30:48 -07:00
Ralph Castain	840fb42f93	PMIx rte component does support dynamics Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-05 21:55:19 -07:00
Thananon Patinyasakdikul	f34a73af70	btl/ofi: handles FI_EINTR properly. OFI sockets provider will sometime return FI_EINTR. Apparently this flag does not exist in OFI version 1.5. This commit added an ifdef check. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-05 18:40:00 -07:00
Nathan Hjelm	8192eb9aa6	patch/overwrite: add support for aarch64 This commit adds patcher support of aarch64. The current implementation uses mov, movk, and br to perform the jump. This uses 5 instructions. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-04 16:04:26 -06:00
Arm	555029b323	Merge pull request #5216 from thananon/btl_ofi btl/ofi: the component now compiles with libfabric < 1.5.	2018-06-04 13:47:00 -07:00
Thananon Patinyasakdikul	c18cf7315f	btl/ofi: configury: will not build if OFI ver < 1.5. This commit fixes problem where users have libfabric version < 1.5 installed and build failed because the new MR modebits does not exist. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-04 10:35:50 -07:00
Ralph Castain	014bb3c8de	Fix external hwloc builds Remove spurious comma in header file definition. Remove unused variables Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-03 11:24:21 -07:00
Ralph Castain	55ac526a67	Enable the PMIx ompi/rte component Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using: * autogen --no-orte * with external libevent, external hwloc, and external PMIx master * configuring PMIx master with the same libevent and hwloc * execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)	2018-06-03 07:25:12 -07:00
Ralph Castain	4ceaed3a76	Merge https://github.com/bgoglin/ompi into topic/rte2	2018-06-03 07:24:12 -07:00
Howard Pritchard	623e36de8a	Merge pull request #5205 from thananon/btl_ofi new btl/ofi: RDMA only btl using libfabric.	2018-06-02 13:00:50 -06:00
Mark Allen	93fefc4d70	fix info-subscribe to use snprintf() and warn on long key This checkin mainly concerns our internal info keys that are registering for callbacks via opal_infosubscribe_subscribe(). Those keys need to have an extra __IN_<key>/val stored to preserve their pre-callback value. So that means our internal keys are limited to 5 chars shorter than the usual key length limit. The code previously would have been silently inactive if a large key happened to come in, now it warns and also uses snprintf() to avoid compiler warnings. I'm also making the top-level MPI_Info_set warn if the user uses our reserved "__IN_" prefix. I had wanted the feature to be more invisible than that, but it would require a more sophisticated approach to change that. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-06-01 18:31:32 -04:00
Thananon Patinyasakdikul	6efb8069ec	new btl/ofi: RDMA only btl using libfabric. This commit added new transport layer to be used with osc rdma module. This BTL provides put, get, atomic and fetch atomic operations. It can be used with multiple hardware vendors as long as they have their provider under Libfabric and have the right capabilities. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-01 15:22:04 -07:00
Nathan Hjelm	f8dbf62879	opal/asm: change ll/sc atomics to macros This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References #3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-05-31 19:45:19 -06:00
Jeff Squyres	fb0473acb5	pmix3x: compiler warning stomp This fix was already included in pmix upstream (https://github.com/pmix/pmix/commit/fb7af8af2). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-05-30 10:14:37 -07:00
Jeff Squyres	dec247d96e	opal/datatype: minor compiler warning stomp Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-05-30 10:08:19 -07:00
bosilca	4ebed21b6d	Merge pull request #4670 from ggouaillardet/topic/opal_bitmap opal/bitmap: fix opal_bitmap_set_bit()	2018-05-29 10:50:21 -04:00
Sylvain Jeaugey	4eb75623ef	cuda: add option to remove warning about missing libcuda. Signed-off-by: Sylvain Jeaugey <sjeaugey@nvidia.com>	2018-05-24 14:56:46 -07:00
Brice Goglin	847f2e9933	opal/hwloc: remove now unused available field from opal_hwloc_obj_data_t Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	b260600450	opal/hwloc: simplify df_search() and make it work with hwloc 2.x NUMA nodes Don't do a recursive search (hence no need for *idx anymore). Find the level depth, to hide cache-issues first. Then iterate over that level to find the objects we want. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	a06fc74664	opal/hwloc: remove an obsolete comment about offlines CPUs etc Only online/available objects are enabled in OMPI now. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	369a7ea279	opal/hwloc: remove df_search_cores and fix things for hwloc 2.x NUMA nodes Just iterate over cores inside the given object cpuset. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	0cd0c12111	opal/hwloc: remove min_bound() functions df_search_min_bound() would need to be fixed for hwloc 2.0, but it's only used in opal_hwloc_base_find_min_bound_target_under_obj() which isn't used anymore. So just remove all of them. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	d12ef324c9	hwloc 2.0 doesn't have hwloc/myriexpress.h anymore Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	33ea2f0de4	fix OPAL_HWLOC_WANT_SHMEM management in opal/mca/hwloc/external/external.h Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Brice Goglin	bd08a6ead9	hwloc: fix hwloc/shmem.h in the external case Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:53:07 +02:00
Jeff Squyres	af4299ebc5	hwloc: updates for hwloc 2.0.x API Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-05-24 11:53:07 +02:00
Brice Goglin	77cc3fcda5	hwloc: update to hwloc 2.0.1 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2018-05-24 11:52:59 +02:00
Jeff Squyres	22bfdb194d	Merge pull request #5174 from hoopoepg/topic/typo-in-comment CONVERTOR: fixed typos in comments	2018-05-17 12:15:02 -04:00
Sergey Oblomov	52d5ca048e	CONVERTOR: fixed typos in comments Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-05-16 22:02:39 +03:00
bosilca	2ab628b92e	Merge pull request #5074 from bosilca/topic/remove_warnings Remove warnings identified by clang.	2018-05-15 11:15:23 -04:00
Howard Pritchard	db45d61dfa	Merge pull request #5147 from hppritcha/topic/plug_debug_hole_in_verbs btl/openib: add conditional around an assert	2018-05-05 08:12:53 -06:00
Howard Pritchard	30eed9f035	btl/openib: addition conditional around an assert A user trying to build Open MPI with explicit use of CFLAGS on the make command line hit problems. This fixes one of the problems. https://www.mail-archive.com/users@lists.open-mpi.org//msg32241.html Signed-off-by: Howard Pritchard <hppritcha@gmail.com>	2018-05-04 14:17:07 -06:00
Ralph Castain	4ff61450a4	Ensure pmix_cleanup finalizes the class system Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-05-04 06:22:36 -07:00
Nathan Hjelm	380dcb57de	Merge pull request #5072 from bosilca/topic/datatype_add_size_t Allow OPAL DDT to receive size_t count argument.	2018-05-01 09:47:45 -06:00
Gilles Gouaillardet	edb8fe8e4b	pmix/ext1x: fix index handling when populating an info array Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-04-26 11:06:43 +09:00
Ralph Castain	f424aa367e	Fix external PMIx v1.2.5 support As @hjelmn and I discussed, this is a little hacky. However, it is the only solution that can be done solely from the OMPI side. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-04-25 13:42:36 -07:00
Howard Pritchard	0751578cbf	Merge pull request #4949 from hppritcha/topic/memkind_update mpool/memkind: refactor to use the current API	2018-04-25 07:24:28 -06:00
Howard Pritchard	824197f886	mpool/memkind: refactor to use the current API The mpool/memkind component was using a deprecated "partitions" API. This commit refactors the memkind component to make use of the supported public API. The public API uses 3 parameters to specify a mpool "kind": - a memkind type (which for now is just default or HBM) - a memkind policy - a memkind_bits (partly to specify pagesize) The MCA parameters were changed to reflect these memkind parameters. Add a make check test for sanity checking of the memkind component. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-04-24 22:11:21 -06:00
Nathan Hjelm	69456c8962	Merge pull request #5042 from Stonesjtu/patch-1 Fix typo	2018-04-16 12:13:35 -06:00
George Bosilca	6ff11267fb	Remove warnings identified by clang. Plus minor spacing and indentation issues. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-04-14 17:14:12 -04:00
George Bosilca	cd683e3eec	Allow OPAL DDT to receive size_t count argument. Fixes issue #5069, which relates a BigMPI bug with the use of MPI_Type_vectpor to construct very large datatypes (>2GB). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-04-14 15:32:19 -04:00
Todd Kordenbrock	55c6918316	Merge pull request #5053 from tkordenbrock/topic/master/btl-portals4.del_proc.fix master: btl-portals4: don't free module resources when proc count goes to zero	2018-04-12 12:12:34 -05:00
Gilles Gouaillardet	37e7bca867	pmix/ext1x: fix misc build time errors Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-04-12 14:58:55 +09:00
Kaiyu Shi	b7c5e65d4f	Fix typo Signed-off-by: Kaiyu Shi <skyisno.1@gmail.com>	2018-04-11 10:29:43 +08:00
Todd Kordenbrock	b569633ddf	btl-portals4: don't free module resources when proc count goes to zero This commit fixes a segfault in btl-portals4 add_procs(). The segfault occurs if add_procs() is called after a del_procs() call that reduces the proc count to zero which would cause PT and NI resources to be freed. This commit resolves the segfault by using a common initiailization boolean and only freeing module resources in finalize(). Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)	2018-04-10 14:20:22 -05:00
Jeff Squyres	45922c4e81	pmix/base: set PMIx to follow OPAL's mca_component_show_load_errors Have Open MPI's PMIx component to set PMIx's "show_load_errors" to do the same thing that Open MPI's "show_load_errors" does. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-04-10 10:24:35 -07:00
Jeff Squyres	f200b866df	btl/tcp: roll back parts of `40afd525f8` Some of the show_help() messages that were added in `40afd525f8` were really normal / expected behavior (e.g., if 2 peers connect in the TCP BTL more-or-less simultaneously, one of them will drop the connection -- no need to show_help() about this; it's expected behavior). Roll back these messages to be opal_output_verbose() kinds of messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-04-07 12:28:10 -07:00
Jeff Squyres	a2fc1ace09	Merge pull request #4992 from jsquyres/pr/pmix-version-info-mca-vars pmix: add "pmix*_library_version" info MCA var	2018-04-04 17:29:06 -04:00
Ralph Castain	cd52ccdb68	Move past the '.' when getting jobstepid The strtoul function returns the pointer to the first non-digit character, which is a '.' in this case. Calling strtoul at that point will always yield a zero - you have to move past it to get the remaining number Thanks to Greg Lee for the detailed analysis of the problem. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-04-04 11:22:38 -07:00
Jeff Squyres	9f472d8a7b	pmix: add "pmix*_library_version" info MCA var Simple MCA vars for ext1, ext2, and pmix3 components to reflect what the underlying PMIx library version is. For example: ``` $ ompi_info --param pmix pmix3x --parsable --level 9 \| grep library_version mca:pmix:pmix3x:param:pmix_pmix3x_library_version:value:PMIx library version 3.0.0 (embedded in Open MPI) mca:pmix:pmix3x:param:pmix_pmix3x_library_version:source:default mca:pmix:pmix3x:param:pmix_pmix3x_library_version:status:writeable mca:pmix:pmix3x:param:pmix_pmix3x_library_version:level:4 mca:pmix:pmix3x:param:pmix_pmix3x_library_version:help:Version of the underlying PMIx library mca:pmix:pmix3x:param:pmix_pmix3x_library_version:deprecated:no mca:pmix:pmix3x:param:pmix_pmix3x_library_version:type:string mca:pmix:pmix3x:param:pmix_pmix3x_library_version:disabled:false ``` Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-29 14:21:07 -07:00
Jeff Squyres	8c419294a8	btl/tcp: fix CID 710596 sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses), but the memory that we are copying from (my_ss->sin_addr) is only 4 bytes long. Don't copy beyond the end of that source buffer. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:22 -07:00
Jeff Squyres	3003be14f3	btl/sm: fix CID 1415105 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:21 -07:00
Jeff Squyres	a17f4afdc7	btl/tcp: fix CID 1416634 Fix resource leak in the TCP BTL. Also add a little defensive programming. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-26 14:21:21 -07:00
Nathan Hjelm	7f761d8434	opal_free_list: use lifo atomic functions in opal_free_list_wait_mt This commit fixes a multi-threading bug when using the thread-safe free list functions. opal_free_list_wait_mt() was using the conditional version of opal_lifo_pop() and not the thread-safe call. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-03-26 10:16:42 -06:00
Ralph Castain	e443adc7a1	Reset OMPI master to PMIx master Track PMIx master instead of the reference server - fixes problem of external PMIx master builds. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-03-25 08:36:46 -07:00
Artem Polyakov	77ff99e9ee	Merge pull request #4933 from karasevb/timings_update timings: added new timing points	2018-03-25 00:10:49 -07:00
Jeff Squyres	06ec93a61a	util/fd: fix CID 1430413 Take multiple defensive steps to fix CID 1430413 and ensure that ret is always initialized upon return. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-24 04:25:26 -07:00
Jeff Squyres	c3adcb05eb	Miscellaneous compiler warnings fixes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-23 11:45:30 -07:00
Jeff Squyres	f66ac43fbc	opal/util: fix CID 1430381 Fix minor resource leak. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-23 08:48:11 -07:00
Jeff Squyres	023a4a82d3	Merge pull request #4942 from jsquyres/pr/tcp-btl-help-message-updates TCP help message updates	2018-03-22 08:53:04 -05:00
Jeff Squyres	0f8077ace6	oob/tcp: add show_help message about version mismatch Be more explicit about version mismatch between ORTE processes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-21 20:18:28 -07:00
Jeff Squyres	a15d8233c9	Merge pull request #3434 from dsharma283/pr-3431 ompi/opal: add support for HDR link speeds	2018-03-21 21:57:20 -05:00
Jeff Squyres	40afd525f8	btl/tcp: make error messages more specific Convert some verbose messages to opal_show_help() messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-21 19:34:03 -07:00
Jeff Squyres	e0d86b1c72	opal/util/fd: add opal_fd_get_peer_name(() Returns a string name (either a resolved name or IPv4/IPv6 name in a string if unresolvable. The caller is responsible for freeing the string. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-21 19:34:03 -07:00
Devesh Sharma	90e9b22196	ompi/opal: add support for HDR link speeds This patch enables to use adapters with HDR speeds. issue id 3431 Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>	2018-03-21 19:15:41 -07:00
Howard Pritchard	ade280eb7c	Merge pull request #3292 from markalle/pr/ibv_reg_mr__fork IB fork	2018-03-21 09:39:08 -06:00
Boris Karasev	85ce76fa36	Merge pull request #4926 from karasevb/pmix_dstore_enable pmix: dstore returned for direct modex	2018-03-21 14:24:51 +07:00
Boris Karasev	3796307a57	timings: added new timing points Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-03-21 05:16:25 +02:00
Stanislav Kirillov	c2bfca19ba	fix ipv6 missing copypaste Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2018-03-21 02:25:04 +03:00
Stanislav Kirillov	86061fbf8d	fixed ipv6 OOB connection problems Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2018-03-20 16:07:53 +00:00
Boris Karasev	dca3dd2ea4	pmix: dstore returned for direct modex Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-03-20 04:56:48 +02:00
Jeff Squyres	cc4bb433bc	mpool/memkind: fix typo in partition page sizes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-19 16:33:08 -05:00
Boris Karasev	36a0c6a794	pmix: fixed the direct modex request This commit fixes the case when local client asks for the key from the process on the remote node. The local server don't have commit count for remote ranks, it is maintained by another PMIx server, so commit count should be ignored for remote requests. Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-03-19 11:51:03 +02:00
Jeff Squyres	2713a24009	opal_datatype_module.c: reset opal_cuda_verbose `999de137ce` accidentally reset opal_cuda_verbose's default value. This commit puts it back. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-03-13 10:10:15 -07:00
George Bosilca	999de137ce	Fix the datatype debug. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-03-08 03:40:08 +09:00
George Bosilca	7848035195	Update the loop stats. The loop should be updated on each internal iteration. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-03-08 03:18:39 +09:00
Jordan Cherry	2f0e8153a5	Merge pull request #4247 from jocherry/btlTcpLinksBugFix tcp btl: Fix multiple-link connection establishment.	2018-03-07 08:40:37 -08:00
Ralph Castain	7241043809	Modify the internal logic for resolve nodes/peers The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event. Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)	2018-03-02 02:00:31 -08:00
Ralph Castain	17c40f4cea	Implement support for proctable queries Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-03-02 02:00:31 -08:00
Ralph Castain	0434b615b5	Update ORTE to support PMIx v3 This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on": * initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch * IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet. * Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later. * Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-03-02 02:00:31 -08:00
Gilles Gouaillardet	9fedf2836e	btl/vader: handle unexpected short read/write in process_vm_{read,write}v Important note : According to the man page "On success, process_vm_readv() returns the number of bytes read and process_vm_writev() returns the number of bytes written. This return value may be less than the total number of requested bytes, if a partial read/write occurred. (Partial transfers apply at the granularity of iovec elements. These system calls won't perform a partial transfer that splits a single iovec element.)" So since we use a single iovec element, the returned size should either be 0 or size, and the do loop should not be needed here. We tried on various Linux kernels with size > 2 GB, and surprisingly, the returned value is always 0x7ffff000 (fwiw, it happens to be the size of the larger number of pages that fits a signed 32 bits integer). We do not know whether this is a bug from the kernel, the libc or even the man page, but for the time being, we do as is process_vm_readv() could return any value. Thanks Heiko Bauke for the bug report. Refs. open-mpi/ompi#4829 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-03-02 13:20:46 +09:00
Nathan Hjelm	5ed2fc2d48	mca/base: add support for additional variable types This commit adds long, int32_t, uint32_t, int64_t, and uint64_t as possible MCA variable types. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-03-01 20:42:27 -07:00
Jordan Cherry	d7e7e3acb7	tcp btl: Fix multiple-link connection establishment. Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers. Three issues were resulting in hangs during large message transfer: * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process address check was binary, rather than a count * The accept handler would not skip a btl module that was already in use, resulting in all connections for a given address being vectored to a single btl * Multiple addresses in the same subnet caused connections to be stalled, as the receiver would always use the same (first) address found. Binding the outgoing connection solves this issue * Lastly fix race condition created by connections being started at the exact same time by accpeting connections not in the closed state, allowing endpoint_accept to resolve dispute Signed-off-by: Jordan Cherry <cherryj@amazon.com>	2018-02-27 16:36:44 +00:00
Nathan Hjelm	5380d7cce5	mpool/hugepage: add missing header Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-02-26 13:35:56 -07:00
Nathan Hjelm	38d9b10db8	rcache/base: update VMA tree to use opal_interval_tree_t This commit replaces the current VMA tree implementation with one that uses the new opal_interval_tree_t class. Since the VMA tree lock is no longer used this commit also updates rcache/grdma and btl/vader to take better care when searching for existing registrations. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-02-26 13:35:56 -07:00
Nathan Hjelm	7163fc98a0	opal/class: add a new class: opal_interval_tree_t This commit adds a new class to opal: opal_interval_tree_t. This is a thread-safe impelementation of a 1-dimensional interval tree. The data structure is intended to provide a faster implementation of the registration cache VMA tree. The thread safety is provided by a relativistic red-black tree implementation. This structure provides support for multiple-reader, and single writer. There is one caveat, an item may appear in the tree twice while the tree is being updated. Care needs to be taken to avoid issues associated with this "feature". I don't anticipate a problem with the current VMA tree usage. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-02-26 13:35:56 -07:00
Ralph Castain	60e6440603	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-02-19 09:20:13 -08:00
Jeff Squyres	b452991ad8	btl/usnic: missed a preprocessor check in `d36648b` Missed updating one instance of `==` to `>=` in `d36648b`. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-02-19 07:03:05 -08:00
Jeff Squyres	d36648b547	btl/usnic: update BTL_VERSION handling Follow-on to `8097d09858`: now that BTL_VERSION is defined in btl.h, be a little smarter about whether we define it or not. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-02-16 13:20:36 -08:00
Ralph Castain	8097d09858	Silence usnic warnings - BTL version has changed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-02-16 10:00:18 -08:00
Nathan Hjelm	9aa21f4467	Merge pull request #4796 from hjelmn/btl_v3.1 opal/btl: add support for flushing RDMA/atomic operations	2018-02-15 12:44:18 -07:00
Nathan Hjelm	072a6a4850	opal/btl: add support for flushing RDMA/atomic operations This commit adds a new optional function to the BTL module: btl_flush. This function takes an optional BTL endpoint. When called this function completes all outstanding RDMA and atomic operations started prior to the call to btl_flush. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-02-13 12:49:51 -07:00
Gilles Gouaillardet	9121eb4ff9	opal/lifo: fix a ABA problem in opal_lifo_pop_atomic that was introduced in open-mpi/ompi@11bb8b09a0 Fixes open-mpi/ompi#4784 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-02-09 14:48:54 +09:00
Ralph Castain	1a7dfd7d54	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-02-07 12:16:51 -08:00
Ralph Castain	9fe8153d38	Sync to IOF branch and continue fix of request for job info from unknown nspace Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 02400d30d79ce3c7e7e28f9a08f7062a5b6f4c51)	2018-02-03 19:56:35 -08:00
Howard Pritchard	1adf0873f7	Merge pull request #4766 from hppritcha/topic/squash_grdma_comp_warning rcache/grdma: squash a compiler warning	2018-02-02 18:58:32 -07:00
Gilles Gouaillardet	43700faba1	pmix/ext3x: remove autogenerated ext3x.h header file This header file was meant to be autogenerated, and for some reasons, was never removed from the repository. Update .gitignore as well Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-31 23:45:42 +09:00
Gilles Gouaillardet	8209fca842	pmix/ext3x: bring external component up-to-date with the embedded pmix3x add the callback prototype for the upcoming PMIx_IOF_push() API Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-31 13:35:34 +09:00
Gilles Gouaillardet	0481277e93	pmix/ext3x: bring external component up-to-date with the embedded pmix3x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-31 13:33:33 +09:00
Nathan Hjelm	bb212e0c94	Merge pull request #4767 from ggouaillardet/topic/vader_backing_file btl/vader: make the backing file job specific	2018-01-30 21:27:02 -07:00
Gilles Gouaillardet	0285c63348	pmix/ext3x: generate component source when only static libraries are built Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-31 13:21:14 +09:00
Gilles Gouaillardet	611d7c2d27	btl/vader: make the backing file job specific Since open-mpi/ompi@47fd2313ab the backing file is now in /dev/shm by default. As a consequence, the backing file name has to include the jobid so more than one job can run at a time. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-30 16:52:51 +09:00
Howard Pritchard	c3cac6731f	rcache/grdma: squash a compiler warning tired of seeing this compiler warning Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-01-29 11:50:01 -07:00
Ralph Castain	a17df810ed	Sync with PMIx iof rfc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-25 10:51:38 -08:00
Ralph Castain	e9cd7fd7e6	Update orte Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-25 08:53:43 -08:00
Ralph Castain	9fb80bd239	Update the opal/pmix base framework elements Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-25 08:37:52 -08:00
Ralph Castain	187352eb3d	Update the PMIx external components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-25 08:35:57 -08:00
Ralph Castain	a5679ef000	Update the PMIx 3.x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-25 08:34:44 -08:00
Yossi Itigin	7cee60346e	opal_progress: check timer only once per 8 calls Reading the system clock on every call to opal_progress() is an expensive operation on most architectures, and it can negatively affect the performance, for example of message rate benchmarks. We change opal_progress() to read the clock once per 8 calls, unless there are active users of the event mechanism. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-01-16 19:18:53 +02:00
Ralph Castain	6216225bda	Ensure cleanup of registered files/dirs Resolve a race condition between registering for a file to be removed upon termination and actual creation of that file by providing attributes that identify whether the path is a file or directory. This removes the need for PMIx to detect the difference. Refs #4686 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-11 11:05:30 -08:00
Ralph Castain	6dacf40a8c	Ensure the epilog gets executed in PMIx server If we abnormally terminate, then we still want any cleanups to be executed. Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-10 18:28:05 -08:00
Gilles Gouaillardet	1a17cb3b1c	opal/datatype: add opal_datatype_is_monotonic() return true if the datatype has non-negative displacements and monotonically nondecreasing, and false otherwise. Thanks George for the guidance. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-01-09 18:05:14 +09:00
Ralph Castain	d620070c77	Correct the comment in the default MCA param template - we do not support a param called "component_path". The correct syntax is "mca_base_component_path" Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-01-05 08:46:44 -08:00
Nathan Hjelm	8b8aae372d	opal/asm: add atomic min/max convenience functions This commit adds atomic functions for min/max. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-01-02 08:38:36 -07:00
Gilles Gouaillardet	125169f057	opal/bitmap: fix opal_bitmap_set_bit() Correctly reallocate the bitmap when needed Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-12-27 14:56:43 +09:00
Nathan Hjelm	39d598899b	rcache/grdma: fix crash when part of a registration is unmapped This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References #4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-12-22 10:36:35 -07:00
Ralph Castain	d5471d7898	Silence warnings in optimized build Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-12-20 12:00:28 -08:00
Ralph Castain	db8ebd33ad	Fix the optnone attribute, add extension attribute See how the various compilers handle these Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-12-18 19:18:53 -08:00
Nathan Hjelm	47fd2313ab	btl/vader: move backing files into /dev/shm on Linux This commit moves the backing files to /dev/shm to avoid limitations that may be set on /tmp. The files are registered with pmix to ensure they are cleaned up after an erroneous exit. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 48101278160672317ade352365592f56ef3b8977)	2017-12-18 07:09:18 -08:00
Ralph Castain	07427c6d89	Update to PMIx v3.0 PR for cleanup registration If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so. Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-12-18 06:53:11 -08:00
Ralph Castain	5c4185abd8	Add the __optnone__ attribute to help avoid optimizing out MPIR_Breakpoint Thanks to @kiranchandramohan for the suggestion Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-12-14 13:14:21 -08:00
Nathan Hjelm	d3fa1bbbb0	rcache/grdma: try to prevent erroneous free error messages It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References #4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-12-12 09:18:39 -07:00
Nathan Hjelm	a82f761a4a	btl/vader: change the way fast boxes are used There were multiple paths that could lead to a fast box allocation. One of them made little sense (in-place send) so it has been removed to allow a rework of the fast-box send function. This should fix a number of issues with hanging/crashing when using the vader btl. References #4260 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-12-11 10:38:33 -07:00
Gilles Gouaillardet	11e5f86bf8	mpool/base: plug a memory leak set the key of all mpool_tree_item objects, so they can be retrieved in mpool_base_free and then returned back to the mca_mpool_base_tree_item_free_list free list. Refs. open-mpi/ompi#4567 Thanks Philip Blakely for the bug report. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-12-07 09:06:25 +09:00
Nathan Hjelm	2e74befa13	Merge pull request #4565 from benmenadue/master Use malloc instead of posix_memalign for small (<= sizeof(void *)) alignments	2017-12-05 16:35:24 -07:00
Nathan Hjelm	ad59b93266	Merge pull request #4566 from kawashima-fj/pr/arm64-atomic opal/asm/arm64: Fix `opal_atomic_compare_exchange_*` bug	2017-12-05 16:34:51 -07:00
Nathan Hjelm	8e0e184bc9	opal/asm: fix compilation of 128-bit compare-exchange with gcc7 This commit removes eax and edx from the clobber list. Older versions of gcc handled these ok but gcc 7 does not. They are not required as eax and edx are specified in output constraints. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-12-05 15:42:47 -07:00
Ben Menadue	90fa8af10b	Use correct alignment request in mca_mpool_base_alloc. Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>	2017-12-06 07:02:17 +11:00
Nathan Hjelm	641bdc4ab7	opal/asm: fix 32-bit build Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-12-05 11:49:13 -07:00
KAWASHIMA Takahiro	08254e8b12	opal/asm/arm64: Fix `opal_atomic_compare_exchange_*` bug Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-12-05 15:57:29 +09:00
Ben Menadue	db3e25edad	Update mca_mpool_base_alloc to use malloc instead of posix_memalign for alignment requests of <= sizeof(void *). This works around issue #4564 . Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>	2017-12-05 09:51:31 +11:00
Matias Cabral	2c86b8723d	Merge pull request #4510 from matcabral/mtl_psm2_shadow_vars New flag for MCA parameters that allows a behaving with a default value of "unset".	2017-12-04 12:25:37 -08:00
Gilles Gouaillardet	d062db1a98	sync_builtin: fix misc typos Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-12-04 11:59:50 +09:00
Nathan Hjelm	7893248c5a	opal/asm: add fetch-and-op atomics This commit adds support for fetch-and-op atomics. This is needed because and and or are irreversible operations so there needs to be a way to get the old value atomically. These are also the only semantics supported by C11 (there is not atomic_op_fetch, just atomic_fetch_op). The old op-and-fetch atomics have been defined in terms of fetch-and-op. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-30 10:41:23 -07:00
Nathan Hjelm	1282e98a01	opal/asm: rename existing arithmetic atomic functions This commit renames the arithmetic atomic operations in opal to indicate that they return the new value not the old value. This naming differentiates these routines from new functions that return the old value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-30 10:41:22 -07:00
Nathan Hjelm	9d0b3fe9f4	opal/asm: remove opal_atomic_bool_cmpset functions This commit eliminates the old opal_atomic_bool_cmpset functions. They have been replaced by the opal_atomic_compare_exchange_strong functions. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-30 10:41:22 -07:00
Nathan Hjelm	11bb8b09a0	opal/class: use new compare-and-swap functions Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-29 12:56:32 -07:00
Nathan Hjelm	84f63d0aca	opal/asm: add opal_atomic_compare_exchange_strong functions This commit adds a new set of compare-and-exchange functions. These functions have a signature similar to the functions found in C11. The old cmpset functions are now deprecated and defined in terms of the new compare-and-exchange functions. All asm backends have been updated. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-11-29 12:45:44 -07:00
Nathan Hjelm	647b40f3f2	Merge pull request #4442 from bosilca/topic/ob1_pvar Topic/ob1 pvar	2017-11-29 09:31:07 -07:00
Josh Hursey	7bbd24c868	Merge pull request #4517 from jjhursey/fix/ppc-asm-eieio ppc/asm: Fix opal_atomic_wmb definition	2017-11-28 07:43:58 -06:00
Gilles Gouaillardet	3b4b3bb6f9	pmix/ext3x: add a missing cnctcbfunc field to ext3x_opalcaddy_t Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-11-28 16:11:08 +09:00
Ralph Castain	3906aaf41a	Silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-25 11:50:18 -08:00
Ralph Castain	30f23ac67a	Save one more file descriptor per process by not opening one for stddiag if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru the PMIx connection. Unfortunately, PMIx v1 does not support this feature, but we can remove the stddiag support once PMIx v1 slides out of the support window Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-25 11:48:53 -08:00
Joshua Hursey	4f0d43686e	ppc/asm: Fix opal_atomic_wmb definition * Fix typo in the `opal_atomic_wmb` declaration. * Fix lingering `eieio` reference in the XL assembly to be `lwsync` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-11-20 09:51:12 -06:00
Matias A Cabral	1fad59465f	New flag for MCA parameters that allows a behaving with a default value of "unset". mtl/psm2: Update some shadow mca parameters to use the default "unset". mtl/psm2: Add new shadow parameter to allow specifying the service level. Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>	2017-11-16 16:28:50 -08:00
Wojtek Wasko	276de13a1e	Make interface's kernel index an int instead of int16_t Sometimes, the ethernet interfaces can get quite high kernel indices. struct ifreq (see netdevice(7)) defines ifr_ifindex to be int's. The OOB component used int16_t internally for matching (in case of -mca oob_tcp_if_[in\|ex]clude) which meant that any interface index > 32767 would never be matched because the integer would be truncated to int16_t upon return from the function. OOB would then refuse to work because it didn't find any usable interfaces and MPI job would abort. Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>	2017-11-15 04:32:26 -05:00
Jeff Squyres	c19822dad4	pmix: pack pointer to object (vs. pointer to pointer) Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-11-13 09:50:44 -08:00
Ralph Castain	6eb3c124e1	Merge pull request #4498 from rhc54/topic/pmixup Some minor cleanups of the DVM	2017-11-12 19:01:15 -08:00
Ralph Castain	9c84e1485b	Some minor cleanups of the DVM Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-12 16:27:37 -08:00
Ralph Castain	64e838c1ac	Merge pull request #4495 from rhc54/topic/pmixup Sync to PMIx master	2017-11-11 18:25:45 -08:00
Ralph Castain	d75d0bc5f6	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-11 17:06:41 -08:00
Jeff Squyres	99662757e2	usnic: only output unknown frames in verbose mode Per https://www.mail-archive.com/users@lists.open-mpi.org/msg31758.html, only output unknown frames when we're outputting verbose BTL messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-11-10 11:14:05 -08:00
Gilles Gouaillardet	7a1c65007a	atomics: always #include <stdbool.h> so the bool type is defined when using old compilers that do not support gcc builtin atomics (such as gcc 4.1.x from CentOS 5) Fixes open-mpi/ompi#4478 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-11-10 14:05:46 +09:00
George Bosilca	8a9ef3dc2d	Delay the initialization until necessary. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-11-08 17:32:18 -05:00
Gilles Gouaillardet	cfdf042d89	Merge pull request #4461 from ggouaillardet/topic/cygwin_fixes memory/patcher: #ifdef out some parts when SYS_munmap is not defined	2017-11-08 13:24:12 +09:00
Ralph Castain	d4b83cc951	Sync with PMIx master Implement direct modex protection to turn off PMIx dstore when direct modex scenario is detected Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-07 18:10:56 -08:00
Gilles Gouaillardet	d19a8351c8	memory/patcher: #ifdef out some parts when SYS_munmap is not defined so memory/patcher can work under cygwin Thanks Marco Atzeri for bringing this to our attention Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-11-07 16:44:40 +09:00
George Bosilca	e57834aaaa	Point to the corect MPI object. Store the pointer to the object handle and not the pointer to the pointer. We should not assert(0) in the code ! Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-11-03 01:20:34 -04:00
bosilca	63e8a8c608	Merge pull request #4431 from hjelmn/asm_cleanup opal: rename opal_atomic_cmpset* to opal_atomic_bool_cmpset*	2017-11-02 18:45:56 -04:00
Ralph Castain	b97caf8f05	Correct copy/paste error in example Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-11-02 10:33:28 -07:00
Nathan Hjelm	3ff34af355	opal: rename opal_atomic_cmpset* to opal_atomic_bool_cmpset* This commit renames the atomic compare-and-swap functions to indicate the return value. This is in preperation for adding support for a compare-and-swap that returns the old value. At the same time the return type has been changed to bool. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-10-31 12:47:23 -06:00
Nathan Hjelm	055f413d1b	opal/asm: add support for and, or, and xor atomics This commit adds additional atomics math operations that are needed throughout the codebase. The semantics of the new operations are consistent with the existing atomics (op then fetch). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-10-31 11:39:50 -06:00
Ralph Castain	27f3d417ca	Revert the MPI_Init fence operations to use volatile bool instead of thread macros. The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock. However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct. Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-31 08:09:02 -07:00
Ralph Castain	7839dc91a8	Sync to PMIx v3.0 (master) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-30 13:06:41 -07:00
Ralph Castain	36d7e752b6	I think we have all concluded that there is no good answer to locating the external libevent library, so surrender to the situation and simply remove that requirement. Users wanting to utilize the embedded PMIx library can install it, but will have to use mpicc _and_ add an explicit -lpmix to their cmd line to compile their application. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-29 07:39:02 -07:00
Gilles Gouaillardet	5c61a4e3a5	configury: fix handling of external libevent library Search external libevent library in both DIR/lib64 and DIR/lib when --with-libevent=DIR is specified but --with-libevent-libdir=DIR is not Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-10-27 15:52:18 +09:00
Ralph Castain	ea3508b26b	Sync to PMIx master (now v3.0) Fix an apparent typo in external libevent configury Require external libevent for install of separate libpmix Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-26 21:05:17 -07:00
Ralph Castain	01ed7548c4	Update to PMIx v3.0a Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-25 12:25:27 -07:00
Ralph Castain	8fbfe68754	Alter the PMIx embedded configuration so that we can build static with devel headers - if the builder requests that we install a separate libpmix, then don't prefix the PMIx variables. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-24 21:45:27 -07:00
Ralph Castain	292983261a	We should never block when requesting dmodex data from the PMIx server as this will block it from being able to accept connections from local clients. Do not deregister standing dmodx requests when a fence completes unless we actually collected the data in the fence Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-24 07:51:10 -07:00
bosilca	ac348da13a	Merge pull request #4374 from bosilca/topic/osx_syslog Topic/osx syslog	2017-10-23 18:06:36 -04:00
Ralph Castain	6ea3c8a0bd	Update the interlib example to show an alternative method for model declaration. Add a missing range value to the OPAL layer. Make it easier to see OMPI model callbacks Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-23 11:27:42 -07:00
George Bosilca	8f32b345de	Address syslog issues on OSX 10.13 with gcc 7.x gcc 7.[1,2] (at least) fails to correctly parse the OSX 10.13 sys/syslog.h header. As a results we need to potect syslog support in OPAL, PMIX and ORTE. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-10-23 14:02:10 -04:00
Ralph Castain	a63904d47f	Updates to support cross-version operations with OMPI v2.x Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-22 08:38:33 -07:00
Ralph Castain	f8ce31f13c	Fix event registration so OpenMP/MPI coordination sides can both get notification of model declarations Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-19 18:06:38 -07:00
Howard Pritchard	e8bfd494e7	pmix/cray: define fence method for cray pmix Turns out UCX PML calls opal_pmix.fence in its del procs method without checking whether or not the fence method for the pmix component was defined. Rather than patch UCX PML, actually define a fence method for the cray pmix. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-10-17 15:58:01 -06:00
Ralph Castain	60b338e857	Sync to PMIx v3. Ensure prun uses the ess/tool component. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-14 08:24:57 -07:00
Ralph Castain	8ae10c9e1a	Ensure we exit with an appropriate error code when hitting a PMI2 error Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-13 19:30:28 -07:00
Jeff Squyres	fba6990328	Merge pull request #4330 from jsquyres/pr/configure-option-for-show-load-errors configure: add --en\|disable-show-load-errors-by-default	2017-10-12 10:42:33 -04:00
Nathan Hjelm	1c52d9dffe	opal/asm: clean up no longer supported architectures We no longer officially support MIPS or ARM before v6. This commit updates the configury to check for sync builtins on these architectures and removes the MIPS and IA64 assembly from opal/include/opal/sys. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-10-11 13:09:29 -06:00
Jeff Squyres	5705192151	configure: add --en\|disable-show-load-errors-by-default Give packagers a configure CLI option to set the value of the MCA variable mca_base_component_show_load_errors. The --disable form of this option is intended for Open MPI packagers who tend to enable support for many different types of networks and systems in their packages. For example, consider a packager who includes support for both the FOO and BAR networks in their Open MPI package, both of which require support libraries (libFOO.so and libBAR.so). If an end user only has BAR hardware, they likely only have libBAR.so available on their systems -- not libFOO.so. Disabling load errors by default will prevent the user from seeing potentially confusing warnings about the FOO components failing to load because libFOO.so is not available on their systems. Conversely, system administrators tend to build an Open MPI that is targeted at their specific environment, and contains few (if any) components that are not needed. In such cases, they might want their users to be warned that the FOO network components failed to load (e.g., if libFOO.so was mistakenly unavailable), because Open MPI may otherwise silently failover to a slower network path for MPI traffic. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-10-11 11:02:21 -07:00
Ralph Castain	388034c814	Add support for the -v (verbose) option to prun and silence the "executing" and "completed" output otherwise. Debounce "unreachable" notifications for tools when they disconnect Enable the -x cmd line option for prun Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)	2017-10-10 12:54:49 -07:00
Ralph Castain	c696e04c5e	Since PMIx is moving to release v3.0, embed the new release candidate in opal/pmix framework. Move the pmix2x code over to the ext2x component. Create a new ext3x component Remove some build product. Tell PMIx that we don't need a new nspace generated when OMPI calls connect Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-09 13:51:08 -07:00
Ralph Castain	51f3fbdb3e	Fix cmd line passing of DVM URI Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 18:10:46 -07:00
Ralph Castain	c3b239cee8	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 12:40:23 -07:00
Ralph Castain	5352c31914	Enable remote tool connections for the DVM. Fix notifications so we "de-bounce" termination calls Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 10:47:05 -07:00
Ralph Castain	073eff5dcd	Update to track PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-05 10:50:08 -07:00
Ralph Castain	c341b53475	Fix the embedded hwloc configure to always disable cuda support. Add definitions for updated hwloc objects when old external versions are used Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-04 11:35:20 -07:00
Mohan Gandhi	fa01fad2ca	Merge pull request #4295 from mohanasudhan/iss4131 Btl tcp: Fix racing condition on simultaneous handshake	2017-10-03 17:14:24 -07:00
Mohan Gandhi	6d642e8d94	Btl tcp: Fix racing condition on simultaneous handshake Their is racing condition in TCP connection establishment during simultaneous handshake. This PR handles the fix for it. Signed-off-by: Mohan Gandhi <mohgan@amazon.com>	2017-10-03 13:13:43 -07:00
Ralph Castain	3ad5a40ba8	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-03 10:56:30 -07:00
Ralph Castain	57c14cbfed	Sync to PMIx master to pickup a little bug fix Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-27 07:54:16 -07:00
Gilles Gouaillardet	b3558f261b	opal/util: initialize proc_hostname in the opal_proc_t constructor Refs open-mpi/ompi#4264 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-09-26 10:47:26 +09:00
Ralph Castain	d5db4ee965	Update to track PMIx master (v2.1.0) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-25 10:24:13 -07:00
Jeff Squyres	db10da97e3	Merge pull request #4257 from jsquyres/pr/moar-hwloc-cuda-cleanup hwloc2a/configure.m4: be more careful in with_cuda->enable_cuda	2017-09-25 11:35:33 -04:00
Brian Barrett	c768f980e1	reachable: Fix string length Coverity warning Make sure hostnames are null terminated, even when they were too long to fit in the hostname buffer. Fixes: CID 1418232 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-24 19:38:45 -07:00
Jeff Squyres	2ec2a329dc	hwloc2a/configure.m4: be more careful in with_cuda->enable_cuda Be a little more deliberate about convering OMPI's --with-cuda CLI value to hwloc's --enable-cuda configure option. Also, unconditionally disable hwloc NVML support (because Open MPI is not currently using it). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-09-24 05:36:23 -07:00
Ralph Castain	5fed7330e7	Update the configure logic to separate the emitting of a libpmix library from with-devel-headers. Instead, we create a new --enable-install-libpmix expressly for that purpose. Continue to link the new library back to libopen-pal to resolve the renamed symbols. Update opal configure logic to set disable_dlopen when disable_mca_dso is given. Fix typos in disable_dlopen when setting variables (incorrect inclusion of quotes) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-22 16:02:57 -07:00
Ralph Castain	3493c43468	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-22 10:48:00 -07:00
Ralph Castain	b4ad81da85	Silence warnings about verbose output Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 2c9655bb631742fd7693e00289d1949f4b2fc155)	2017-09-22 09:05:03 -07:00
Ralph Castain	9edea02b46	Merge pull request #4246 from rhc54/topic/spawn Fully support OMPI spawn options.	2017-09-21 11:23:34 -07:00
Ralph Castain	fe9b584c05	Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)	2017-09-21 10:29:27 -07:00
Brice Goglin	84a721d17a	hwloc: disable GL and OpenCL in the hwloc component Open MPI doesn't use GL or OpenCL OS devices, so just disable them in hwloc. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-09-21 08:25:46 -07:00
Jeff Squyres	f5d51dc2f5	hwloc: do not build hwloc CUDA support if --without-cuda used Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-09-21 08:24:54 -07:00
bosilca	ab68aced23	Merge pull request #3738 from bosilca/topic/tcp_event_count Fix the TCP performance impact when BTL not used	2017-09-19 23:08:58 -04:00
Gabe Saba	c6235a9a0f	reachable: add tests Add test suite for netlink and weighted reachable components. We don't have a great way of running components through unit tests today, so make them stand-alone tests that are run with mpirun and such. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Brian Barrett	ae122c4b17	reachable: Change ownership to Amazon Amazon is going to use the reachable framework to fix some connection bugs in the TCP BTL, so claim support ownership of the weighted and netlink components. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Gabe Saba	9e53605a6f	reachable: Implement netlink component Wire up the libnl utilities Jeff and Ralph added previously to the netlink reachable component so that it actually does work. The algorithm is a bit simplistic, but should work for our use cases. If there's a route, assume the two interfaces can talk. If there's no gateway, assume the two interfaces are in the same subnet, and give preference to that connection. If there's a gateway, assume there's a route, but the interfaces are not in the same subnet. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Gabe Saba	4d81006222	reachable: Add IPv6 support to libnl code Add IPv6 support to the netlink component's utility wrappers around libnl-3. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Brian Barrett	4d5bfd0429	reachable: Simplify gateway check in netlink The netlink component's libnl wrapper code returned the next hop in the route table to allow the calling code to differentiate between same and different networks, which is a fine comparison for IPv4, but is pretty expensive for IPv6 (coming soon to a netlink component near you). Rather than provide extra information (the address of the next hop), just provide whether there is a gateway or not, which is all the netlink component actually needs. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Brian Barrett	a543e7f130	reachable: remove libnl-1 support from netlink The netlink reachable component has never been released in a usable form, but had code copied from usNIC to support both libnl-1 and libnl-3. If nothing else, this code was a little buggy in handling the case where libnl-3 but not libnl-route-3 were installed. Jeff and I decided to drop libnl-1 support from the netlink reachable component, given that it's getting pretty old and the weighted component provides the same information that the TCP BTL and OOB are using today, so libnl-1 customers won't see a step backwards from where they are today. Signed-off-by: Brian Barrett <bbarrett@mazon.com>	2017-09-19 19:42:54 -07:00
Gabe Saba	3f8d294191	reachable: Enable weighted component / fix interface Based on work from usNIC, the best way to use the reachability information the reachable components return is to build a connectivity graph between the two peers and run a bipartite graph solver. Rather than returning the "best" pairing, the reachability framework now returns the entire mapping, allowing a (soon to be added) graph solver to build the "optimal" connectivity pairing. Practically, this means changing the return type of the reachable() function and rewriting the weighted_reachable() function to return the full mapping. The netlink_reachable() function still always returns NULL. At the same time, fix bit-rot in the weighted component and enable builds of the component by removing the opal_ignore. Also, add IPv6 support to the weighted component to support both use cases in the TCP BTL. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Gabe Saba	8f2df42055	reachable: Initialize / Finalize reachable framework Initialize the reachable framework during opal_init() and tear it back down during opal_finalize(). The framework was never used, so the lack of initialization didn't matter, but this is a required step in using the framework. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Brian Barrett	6048c543fa	reachable: Rename code copied from usnic Ralph and Jeff created the reachable framework and added the netlink component based on code copied from the usnic btl. However, they never renamed all the symbols from the libnl compatibility code. This patch finishes the rename. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00
Brian Barrett	502f383f4d	util: Add link-local check to net interface Add a check for link-local IPv6 addresses to the net interface to support better computation of network pairings in the weighted reachable component. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-09-19 19:42:54 -07:00

... 4 5 6 7 8 ...

5471 Коммитов