openmpi

Автор	SHA1	Сообщение	Дата
heasterday	2fc5ab7c8f	Update mpool_hugepage_component.c Signed-off-by: Hunter Easterday <heasterday@lanl.gov> (cherry picked from commit `ad0d2c451e`) (cherry picked from commit `509380d99f`)	2019-01-15 08:10:46 -07:00
Nathan Hjelm	b7fbdeb759	btl/vader: minor correction to match ompi coding style Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `edaf08bf6d`)	2019-01-07 16:40:01 -07:00
Nathan Hjelm	140842668f	btl/vader: don't try to set reachabilty in add_procs if not requested This commit fixes a bug where add_procs can incorrectly return an error when going through the dynamic add_procs path. This doesn't happen normally, only when pml/ob1 is not in use. References #6201 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `30b8336cb4`)	2019-01-07 14:20:33 -07:00
Gilles Gouaillardet	61108b6228	pmix/ext3x: fix support for external PMIx v3.1 The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard. Open MPI does not really need them (they are only used to be reported as not supported), so smply #ifdef protect them to support an external PMIx v3.1 The change only need to be done in ext3x/ext3x.c. But since this file is automatically generated from pmix3x/pmix3x.c, we have to update the latter file. Refs. open-mpi/ompi#6247 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit open-mpi/ompi@950ba16aa1)	2019-01-07 20:27:34 +09:00
Howard Pritchard	fd2259774d	Merge pull request #6167 from ggouaillardet/topic/v4.0.x/btl_uct_fix_warning btl/uct: fix misc warnings	2018-12-21 15:31:16 -07:00
Howard Pritchard	4cce2b84fa	Merge pull request #6176 from ggouaillardet/topic/v4.0.x/uct_configury v4.0.x: btl/uct: fix a typo in configure.m4	2018-12-11 09:19:06 -07:00
Howard Pritchard	ebd3421e53	Merge pull request #6165 from ggouaillardet/topic/v4.0.x/pmix_atomics pmix/pmix3x: fix macros usage in embedded pmix3x	2018-12-11 09:17:17 -07:00
Gilles Gouaillardet	f446472f06	btl/uct: fix a typo in configure.m4 remove whitespace around '=' when setting btl_uct_LIBS Thanks Ake Sandgren for reporting this Refs. open-mpi/ompi#6173 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@b89deeb1bb)	2018-12-11 15:15:12 +09:00
Gilles Gouaillardet	dd2b1ce49b	btl/uct: fix a warning Use the PRIsize_t macro to correctly print a size_t Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@78aa6fdd1d)	2018-12-07 17:02:16 +09:00
Gilles Gouaillardet	ffbe85c65f	btl/uct: fix AC_CHECK_DECLS usage AC_CHECK_DECLS take a comma separated list of macros/symbols, so replace the whitespace separator with a comma. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@b715dd2657)	2018-12-07 16:27:57 +09:00
Gilles Gouaillardet	195a07d03d	pmix/pmix3x: fix macros usage in embedded pmix3x Use PMIX_* macros instead of OPAL_* macros master does things differently, so this is a one-off commit Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-12-07 16:07:07 +09:00
Nathan Hjelm	0957861689	btl/uct: fix some issues when using UCX over ugni Though not a recommended configuration it is possible to use Open MPI over UCX over uGNI. This configuration had some issues related to the connection management and tl selection. This commit fixes those issues. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `e07a64c52d`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-12-06 10:57:59 -07:00
Gilles Gouaillardet	057118dbe6	btl/uct: fix AC_CHECK_DECLS usage AC_CHECK_DECLS take a comma separated list of macros/symbols, so replace the whitespace separator with a comma. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit `b715dd2657`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-12-06 10:56:45 -07:00
Joshua Hursey	e1f75d5ff1	Add OPAL_VPID to unpacking * Needed to properly read PMIx job data like the following - `OPAL_PMIX_LOCALLDR` - `OPAL_PMIX_RANK` - `OPAL_PMIX_GLOBAL_RANK` - `OPAL_PMIX_APPLDR` - `OPAL_PMIX_APP_RANK` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit `a557c4130c`)	2018-11-21 11:48:58 -06:00
Howard Pritchard	9275b692b7	Merge pull request #6043 from hjelmn/v4.0.x_fix_this_damn_memory_barrier_bug_that_is_referenced_in_github_bug_6014_now_lets_get_this_release_out_the_door v4.0.x: opal/asm: work around possible gcc compiler bug	2018-11-08 08:01:33 -07:00
Nathan Hjelm	5efc76ef44	pmix3x: fix potential memory barrier bug with __atomic builtin atomics See open-mpi/ompi#6014 for more information. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-11-06 10:37:14 -07:00
Nathan Hjelm	e57c3fb3c9	opal/asm: work around possible gcc compiler bug It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should not cause any performance issues as it is equivalent to the memory barrier in the hand-written atomics. References #6014 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `30119ee339`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-11-06 10:28:13 -07:00
Howard Pritchard	bbfde1533b	btl/openib: fix a problem with ib query Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: #5810 Fixes: #5914 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit `8126779a35`)	2018-11-02 10:42:17 -06:00
Sergey Oblomov	0846c9d112	COMMON/UCX: added error code to log output Also fixes a PGI compilation error with --enable-debug. Signed-off-by: Geoff Paulsen <gpaulsen@users.noreply.github.com> Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `1099d5f023`)	2018-10-30 09:55:25 -05:00
Gilles Gouaillardet	7f36dce348	event/external: fix version requirement Only default to the external component if its version is greater or equal than the internal libevent (2.0.22) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit `b205039205`)	2018-10-29 15:20:48 -06:00
Gilles Gouaillardet	40a1f0c214	event/external: misc configury fixes - Always use the external component when configure'd with --with-libevent=external - Fix the external libevent library version detection by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros - Use the event2/event.h header (event.h is deprecated since libevent 2.0 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit `35e77a286c`)	2018-10-29 15:20:32 -06:00
Nathan Hjelm	270ae73611	btl/uct: update for UCT_CB_FLAG_SYNC removal Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit `1b37328ba8`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-23 10:18:43 -06:00
Howard Pritchard	e8f28a5506	Merge pull request #5882 from hjelmn/btl_uct_updates_for_the_v4_branch btl/uct: bug fixes and general improvements	2018-10-22 13:12:34 -06:00
Howard Pritchard	210b4c60aa	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit `6564d3d217`)	2018-10-17 16:16:36 -06:00
Nathan Hjelm	e6f84e79de	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `707d35deeb`) Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-16 19:16:11 -06:00
Nathan Hjelm	eaa98af52c	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `5c770a7bec`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 15:28:20 -06:00
Nathan Hjelm	0c4ba45af2	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `6ed68da870`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:34:33 -06:00
Geoff Paulsen	c8ff7e3ef2	Merge pull request #5874 from rhc54/cmr40/config v4.0.0: Fix configury for internal PMIx	2018-10-10 10:47:26 -05:00
Nathan Hjelm	1153082a0f	btl/uct: bug fixes and general improvements This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `39be6ec15c`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-09 16:16:50 -06:00
Ralph Castain	376e2e4d98	Add missing file to tarball Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-09 06:46:09 -07:00
Ralph Castain	40270bd24b	Minor cleanups to the pmix/ext2x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-09 06:42:08 -07:00
Ralph Castain	12790e8ec6	Protect PMIx from bad configure entry Ignore with-hwloc=internal or external as those are meaningless to pmix (will upstream) Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `c498a7e77a`)	2018-10-09 03:45:58 -07:00
Ralph Castain	3e2cc6f46a	Fail configure if pmix won't build If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `f379ba9c8e`)	2018-10-09 03:45:37 -07:00
Nathan Hjelm	fba5eda436	btl/vader: fix race condition in writing header Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit `8291f6722d`) Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-08 08:49:01 -06:00
Nathan Hjelm	fa768d748f	btl/vader: work around Oracle compiler bug This commit works around an Oracle C compiler bug in 5.15 (not sure when it was introduced). The bug is triggered when we chain assignments of atomic variables. Ex: _Atomic intptr x, y; intptr_t z = 0; x = y = z; Will produce a compiler error of the form: operand cannot have void type: op "=" assignment type mismatch: long "=" void To work around the issue we are removing the chain assignment and setting the head and tail on different lines. Fixes #5814 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `dfa8d3a81a`)	2018-10-03 11:36:17 -04:00
Nathan Hjelm	df6dd69db8	btl/vader: ensure the fast box tag is always read first On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `66a7dc4c72`)	2018-10-03 11:36:17 -04:00
Geoff Paulsen	593d652077	Merge pull request #5823 from jsquyres/pr/v4.0.x/fix-tcp-btl-show-help-ip-address v4.0.x: btl/tcp: output the IP address correctly	2018-10-03 08:29:37 -05:00
George Bosilca	b63bee5da4	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `a3a492b42c`)	2018-10-02 14:37:31 -04:00
George Bosilca	d450e460d6	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit `9164e26e2f`)	2018-10-02 14:37:31 -04:00
Jeff Squyres	81f2f19398	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `5dae086f7e`)	2018-10-02 10:49:51 -04:00
Sergey Oblomov	68d3baffd5	OPAL/COMMON/UCX: used __func__ macro instead of __FUNCTION__ - used __func__ macro instead of __FUNCTION__ to unify macro usage with other components Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `9a51e257d1`)	2018-09-27 12:04:07 +03:00
Jeff Squyres	8bdf4553d9	mca_base_var: fix output bug about settable vars Fix the test that determined whether we output "writeable" or "read-only" for MCA vars (it was checking the wrong flag). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `176da51aec`)	2018-09-24 14:14:51 -07:00
Geoff Paulsen	3d4164e1e1	Merge pull request #5752 from gpaulsen/misc-warnings-fixes Miscellaneous compiler warning stomps.	2018-09-22 15:01:53 -05:00
Geoff Paulsen	556367af31	Merge pull request #5754 from gpaulsen/event_threading opal/progress: protect against multiple threads in event base	2018-09-22 15:00:32 -05:00
Mark Allen	5ac3fac6c2	snprintf() length fix for info The important part of this fix is a couple places 5 was hard-coded that needed to be strlen(OPAL_INFO_SAVE_PREFIX). But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There was an "if" statement making sure all the arguments had appropriate strlen(), but gcc still complained about the following snprintf() because the size of the struct element is iterator->ie_key[OPAL_MAX_INFO_KEY + 1]. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-09-21 14:47:11 -05:00
Nathan Hjelm	cd88e307fd	opal/progress: protect against multiple threads in event base libevent does not support multiple threads calling the event loop on the same event base. This causes external libevent's to print out re-entrant warning messages. This commit fixes the issue by protecting the call to the event loop with an atomic swap check. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-21 14:40:08 -05:00
Jeff Squyres	2e37f97a38	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `fe0852bcb4`)	2018-09-21 14:35:51 -05:00
Ralph Castain	192f0f6fff	Remove the OFI/BTL component Remove this component pending re-architecture of the overall OFI components. We have had similar issues before when multiple components use the same library - typical issues are race conditions, initialize and finalize errors, etc. We are seeing similar problems here as we get broader exposure to different library version and environment combinations. The correct fix in the past has been to centralize the library interactions in a "common" component. We will pursue that here by moving some additional functions (e.g., endpoint creation) into the existing opal/mca/common/ofi component. We can't do that and thoroughly test it in time for the v4.0.0 release, so we'll simply remove this component from the release. Once we have things correctly fixed, we'll submit a PR to restore the component plus the related fixes to some future v4.x release. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-20 18:09:15 -07:00
Geoff Paulsen	4688da0631	Merge pull request #5736 from hoopoepg/topic/topic/common-del-procs-v4.0 MCA/COMMON/UCX: del_procs calls are unified to common module - v4.0	2018-09-20 18:12:25 -05:00
Geoff Paulsen	8dbfb9b032	Merge pull request #5745 from bwbarrett/v4.0.x-cuda-async openib: Disable CUDA async by default	2018-09-20 18:09:32 -05:00
Geoff Paulsen	4d9deb33ec	Merge pull request #5688 from rhc54/cmr40/pmix302 v4.0.x: Update to PMIx v3.0.2	2018-09-20 16:26:06 -05:00
Brian Barrett	a4fabb7e26	openib: Disable CUDA async by default Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit `9344afd485`) Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-09-20 14:20:54 -07:00
Howard Pritchard	27732bdf33	Merge pull request #5738 from hppritcha/topic/remove_scif_support_4.0.x SCIF: remove it	2018-09-19 12:43:26 -06:00
Howard Pritchard	730f98d8b0	Merge pull request #5665 from hjelmn/v4.0.x_cache_flush v4.0.x: patcher/base: improve instruction cache flush for aarch64	2018-09-19 12:12:57 -06:00
Howard Pritchard	01d4d52588	SCIF: remove it KNC is effectively dead. Remove corresponding SCIF support in Open MPI. cherry pick of PR #5737 + news update Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit `b9ac3d8931`)	2018-09-19 11:48:17 -06:00
Sergey Oblomov	3cace87749	MCA/COMMON/UCX: del_procs calls are unified to common module Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `920cc2e0d9`)	2018-09-19 10:47:27 +03:00
Ralph Castain	131ea01320	Update to PMIx v3.0.2 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-18 13:03:18 -07:00
Howard Pritchard	06c62c6b30	Merge pull request #5685 from selvintxavier/brcm_hcas_v4.0.x v4.0.x: Add support for different Broadcom HCAs	2018-09-17 16:29:45 -06:00
Selvin Xavier	9114a9ac95	v4.0.x: Add support for different Broadcom HCAs Adds device ids of different Broadcom adapters from BCM57XXX and BCM58XXX family of HCAs. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> (cherry-picked from `a53a6f7650`)	2018-09-16 22:54:07 -07:00
Nathan Hjelm	83668f4b47	btl/vader: ensure that the send tag is always written last To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes #5375, #5638 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `850fbff441`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 12:34:26 -06:00
Nathan Hjelm	1cc295e4c4	patcher/base: improve instruction cache flush for aarch64 This commit updates the patcher component to either use the __clear_cache intrinsic or the correct assembly to flush the instruction cache. Fixes #5631 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `1cdbceb095`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-10 15:44:50 -06:00
Geoff Paulsen	17aab5ea5b	Merge pull request #5659 from ggouaillardet/topic/v4.0.x/misc_finalize_leaks Plug misc leaks on MPI_Finalize()	2018-09-10 14:06:31 -05:00
Howard Pritchard	38171be7ac	Merge pull request #5656 from jsquyres/pr/v4.0.x/dont-let-openib-yell v4.0.x: btl/openib: don't complain about no NICs	2018-09-10 09:39:22 -06:00
Gilles Gouaillardet	d2646cd565	mpool/memkind: plug a memory leak in mca_mpool_memkind_close() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@ad2c207a7e)	2018-09-10 09:19:57 +09:00
Gilles Gouaillardet	9410de0d27	opal/util: plug a memory leak in the opal_infosubscriber_t destructor Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@7556dd0abb)	2018-09-10 09:19:34 +09:00
Gilles Gouaillardet	baf41aceed	pmix/pmix3x: plug a memory leak in external_register() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@aeddd2f249)	2018-09-10 09:17:31 +09:00
Gilles Gouaillardet	8015e6f929	pmix/base: plug a memory leak in opal_pmix_base_select() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@6e47c5708e)	2018-09-10 09:16:56 +09:00
Geoff Paulsen	954692f06e	Merge pull request #5614 from karasevb/v4.0.x_fix_hwloc_numa_obj v4.0.x: Fixed the NUMA obj detection for hwloc ver >= 2.0.0	2018-09-07 14:49:26 -05:00
Jeff Squyres	38c7364896	btl/openib: don't complain about no NICs Since openib is on its long, slow way out the door, don't let it complain about not being able to find any NICs at run time. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `098ec55e37`)	2018-09-07 12:01:33 -07:00
Nathan Hjelm	2fb1a5e1b2	btl/uct: add missing opal_mem_hooks_unregister_release call This commit fixes a bug when using the UCT btl with the UCX memory hooks disabled. We were misssing a call to opal_mem_hooks_unregister_release to remove the btl memory hook callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `36c206d2d6`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-05 14:02:58 -06:00
Howard Pritchard	b6dafb6b90	Merge pull request #5636 from jsquyres/pr/v4.0.x/verbs-usnic-configury-moar-strictness v4.0.x: make common/verbs-usnic actually check if it can compile	2018-09-04 09:13:12 -06:00
Geoff Paulsen	3282c61048	Merge pull request #5625 from hoopoepg/topic/optimize-blocked-calls-v4.0 PML/UCX: blocked calls optimizations - v4.0	2018-08-31 14:11:11 -05:00
Geoff Paulsen	334748753c	Merge pull request #5626 from hoopoepg/topic/opal-mem-hooks-syno-v4.0 MCA/COMMON/UCX: added synonim to opal_mem_hook variable - v4.0	2018-08-31 14:09:14 -05:00
Geoff Paulsen	b2daa0001f	Merge pull request #5565 from rhc54/cmr40/pmix301 Update to PMIx 3.0.1	2018-08-31 13:58:41 -05:00
Jeff Squyres	3e842348d1	common/verbs-usnic: check that it will actually compile If someone specifies --with-verbs-usnic, actually do a configury check to ensure that it will compile (vs. assuming that it will compile if someone asks for it). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `05e5f61fe1`)	2018-08-30 14:56:37 -07:00
Sergey Oblomov	028bcb8a73	MCA/COMMON/UCX: added synonim to opal_mem_hook variable - added synonim to common ucx variables to allow to print it in opal_info -a Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `e00f7a68ba`)	2018-08-29 15:17:00 +03:00
Sergey Oblomov	9215eb9a3b	PML/UCX: blocked calls optimizations - refactoring of opal/UCX progress calls - added UCX progress priority Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit `b0f87f2235`)	2018-08-29 14:38:22 +03:00
Boris Karasev	31ca3842da	Fixed copyrights of prev commit. Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit `beb0697f24`)	2018-08-28 12:29:16 +03:00
Boris Karasev	d995fb1b3f	Fixed the NUMA obj detection for hwloc ver >= 2.0.0 Since version hwloc 2.0.0 has a new organization of NUMA nodes on the topology tree. This commit adds the detection of local NUMA object for hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist component. Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit `e5291ccc34`)	2018-08-28 12:29:08 +03:00
Zoltán Mizsei	b2628129fd	fcntl include bugfix Signed-off-by: Zoltán Mizsei <zmizsei@extrowerk.com> (cherry picked from commit open-mpi/ompi@ac3f8a16ed)	2018-08-27 09:48:54 +09:00
Jeff Squyres	4fd51a1563	Merge pull request #5592 from hjelmn/v4.0.x_sc_emu btl/vader: clean up debuging and squash warning	2018-08-23 17:15:09 -07:00
Nathan Hjelm	eba44d3709	btl/vader: clean up debuging and squash warning References #5512 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `c74cf666a9`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-23 15:53:43 -06:00
Howard Pritchard	c2cc336135	Merge pull request #5486 from rhc54/cmr40/maps Cleanup pmix selection and map-by modifiers	2018-08-22 04:40:00 -04:00
Ralph Castain	e27e945d9a	Complete job control integration Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-08-20 16:08:54 -07:00
Ralph Castain	3eef3d1d8f	Update to PMIx 3.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-08-20 14:00:41 -07:00
Boris Karasev	8873d901e8	pmix: added check for pmix fence status Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit `57683366ca`) Conflicts: opal/mca/common/ucx/common_ucx.c opal/mca/common/ucx/common_ucx.h Modified: ompi/mca/pml/ucx/pml_ucx.c oshmem/mca/spml/ucx/spml_ucx.c	2018-08-17 21:33:50 +06:00
Geoff Paulsen	8483eb4bf7	Merge pull request #5471 from hjelmn/v4.0.x_uct_btl_fix btl/uct: fix compile warnings/errors	2018-08-15 16:30:40 -05:00
Geoff Paulsen	98bd571cc8	Merge pull request #5472 from ggouaillardet/topic/v4.0.x/prefer-externals v4.0.x: Prefer external hwloc and libevent	2018-08-15 16:27:33 -05:00
Nathan Hjelm	b4f80e4e36	btl/vader: move memory barrier to where it belongs The write memory barrier was intended to precede setting a fast-box header but instead follows it. This commit moves the memory barrier to the intended location. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `dca3516765`)	2018-08-14 09:19:48 -07:00
Jeff Squyres	72e5766a56	hwloc201/configure.m4: make it safe when used with hwloc:external The Autoconf AC_CONFIG_* macros can only be instantiated exacly once for any given file, and they must be in a code execution path at run time for the target file to be generated at the end of configure. For example, if you want to generate file ABC at the end of configure, you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that will get executed when configure is run. That's pretty straightforward. What's not straightforward is two corner cases: 1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file more than once. If you do, autoreconf will fail (even before you can run configure). 2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by configure, the file ABC is not registered properly, and ABC will not be generated at the end of configure. This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles (etc.) so that targets like "make distclean" and "make distcheck" will work properly. Hence, we have to invoke HWLOC_SETUP_CORE. However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side effects. It would be nice to do able to do something like this: ``` if hwloc:extern is going to be used: Invoke minimal HWLOC_SETUP_CORE (with no side effects) else Invoke full HWLOC_SETUP_CORE (with side effects) fi ``` But we can't, because autoreconf will detect that AC_CONFIG_FILES has been invoked on the same files more than once (regardless of whether those code paths will be executed at run time or not). Kaboom. Similarly, we can't do this: ``` if hwloc:extern is not going to be used: Invoke full HWLOC_SETUP_CORE (with side effects) fi ``` Because then hwloc's AC_CONFIG_FILES won't be registered properly when hwloc:external is used (i.e., when the HWLOC_SETUP_CORE macro is not in a code path that is executed at run time), and targets like "make distclean" will fail because hwloc's Makefiles won't have been setup. Kaboom. But remember that the hwloc framework is a bit special: there will only ever be 2 comoponents: external and internal. External is guaranteed to be configured first because of its priority. So the internal component (i.e., this component) immediately knows if it is going to be used or not based on whether the external component configuration succeeded or failed. Specifically: regardless of whether the internal component (i.e., this component) is going to be used, we have to invoke HWLOC_SETUP_CORE. But we can manage the side effects: allow the side effects when this/internal component is going to be used, and avoid the side effects when this/internal component is not going to be used. This is a little less clean than I would have liked, but because of Autoconf's oddity about its AC_CONFIG_* macros, this is the only solution I could come up with. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `01e4570af7`)	2018-08-11 12:00:35 -07:00
Jeff Squyres	714f203985	libevent2022/configure.m4: always invoke sub-configure In order to make "make distclean" (and friends) work, we need to always invoke the embedded configure script -- even if we know that we're not going to use this component. But in cases where we know we're not going to use this component, we also need to avoid the side effects of the code path that is used when we do want to use this component. So split the two possibilities into two different macros: 1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing except invoke the underlying "configure" script. 2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real work (including invoking the underlying "configure" script). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `69aa46e167`)	2018-08-11 12:00:35 -07:00
Jeff Squyres	5c5246f655	libevent2022/configure.m4: trivial cleanup Put argument to AM_CONDITIONAL inside []. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `80df3f040b`)	2018-08-11 12:00:35 -07:00
Jeff Squyres	63d68ded48	libevent2022/configure.m4: minor comment cleanup Change # -> dnl. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `17aa64e438`)	2018-08-11 12:00:35 -07:00
Jeff Squyres	6cb3d61dd1	libevent2022: only configure if event:external fails We know that event:external will be configured first (because of its priority). Take advantage of that here in libevent2022 by having it refuse to configure / politely fail if event:external succeeded. Also print out some additional lines in configure output indicating what is going on (i.e., event:external succeeded, so this component will be skipped, or event:external failed, so this component will be used). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `b063cb6b0f`)	2018-08-09 06:47:08 -07:00
Jeff Squyres	eca16720de	hwloc201: only configure if hwloc:external fails We know that hwloc:external will be configured first (because of its priority). Take advantage of that here in hwloc201 by having it refuse to configure / politely fail if hwloc:external succeeded. Also print out some additional lines in configure output indicating what is going on (i.e., hwloc:external succeeded, so this component will be skipped, or hwloc:external failed, so this component will be used). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit `4e5f432786`)	2018-08-09 06:47:08 -07:00
Ralph Castain	0cdf49ed8a	Fix typo Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `f7a537cf04`)	2018-07-25 21:37:03 -07:00
Ralph Castain	511319c316	Fix the multiple pe/proc option Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up. Thanks to @tonyreina for the error report Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `bcdb1f45ac`)	2018-07-25 19:55:28 -07:00
Ralph Castain	04054d63eb	Cleanup pmix selection check Allow for versions > 3 Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `55cefedf9b`)	2018-07-25 19:55:18 -07:00
Ralph Castain	2c2f9b8169	Leave opal_event_external_support exposed as global var Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit open-mpi/ompi@5cab823979)	2018-07-24 09:53:00 +09:00
Jeff Squyres	1ea021933f	event/external: prefer external event component Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit open-mpi/ompi@a70ecf5267)	2018-07-24 09:52:58 +09:00
Jeff Squyres	6f5a453492	event: trivial comment change Switch from #-style to dnl-style. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit open-mpi/ompi@83e4a45a9f)	2018-07-24 09:52:56 +09:00
Gilles Gouaillardet	aa7a4d0f6f	hwloc: prefer external hwloc component Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit open-mpi/ompi@ce2c9fffd4)	2018-07-24 09:52:53 +09:00
Nathan Hjelm	b6bd3d33f1	btl/uct: fix compile warnings/errors Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit `47ed8e8830`) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-23 14:05:17 -06:00
Ralph Castain	508c3f391f	Default to internal PMIx if newer than external Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit `1e6aaf7f22`)	2018-07-19 11:59:17 -07:00
Howard Pritchard	4447738098	Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default btl/openib: only look for iwarp/roce by default	2018-07-17 20:08:00 -06:00
Howard Pritchard	bc8134dae1	Merge pull request #5448 from thananon/ofi_context btl/ofi: Added FI_CONTEXT as requirement.	2018-07-17 19:51:13 -06:00
Howard Pritchard	6818272392	btl/openib: only look for iwarp/roce by default Due to decreasing support by vendors/other orgs for the OpenIB BTL, only look for iWarp/RoCE devices by default. Allow IB HCAs with ports configured for ethernet. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-07-17 19:11:37 -06:00
Thananon Patinyasakdikul	033c364ee0	btl/ofi: Added FI_CONTEXT as requirement. OFI BTL uses context for completion but never ask for it in fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT to eliminate any potential error. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-07-17 12:18:43 -07:00
Sergey Oblomov	a4b8253fa2	MCA/COMMON/UCX: fixed initialization of malloc hooks Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 20:09:50 +03:00
Sergey Oblomov	1c7ae22dfb	MCA/COMMON/UCX: shift opal memhooks into common UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 13:46:38 +03:00
Ralph Castain	4a596d35f7	Remove the PMIx ext4x component Update configury to redirect anything at or above v3 to the ext3x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-13 19:51:50 -07:00
Nathan Hjelm	9d3a79925b	btl/vader: fix bugs in rma emulation This commit fixes two bugs in the RMA/atomic emulation code: 1) Fix a fragment leak when using AMO emulation. 2) Always initialize the single-copy emulation code. This is required to use the AMO support. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-12 15:50:50 -06:00
Ralph Castain	fdca304268	Default to external PMIx installation Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-10 16:12:52 -07:00
Nathan Hjelm	8b090103e2	opal/fifo: fix 128-bit atomic fifo on Power9 This commit updates the atomic fifo code to fix a consistency issue observed on Power9 systems when builtin atomics are used. The cause was two things: 1) a missing write memory barrier in fifo push, and 2) a read ordering issue when reading the fifo head non-atomically. This commit fixes both issues and appears to correct then inconsistency. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-10 15:37:11 -06:00
KAWASHIMA Takahiro	3e179ba95f	hwloc/external: Suppress missing-include-dirs warning If OMPI is configured with `--with-hwloc=external` or `--with-hwloc=DIR` and gfortran is used, I see a lot of warnings when compiling files under the `ompi/mpi/fortran` directory. ``` f951: Warning: Nonexistent include directory 'BUILD_DIR/opal/mca/hwloc/external/hwloc/include' [-Wmissing-include-dirs] ``` There is no such `include` directory in the source tree and `configure`- created tree. I think these lines in the `configure.m4` file are wrongly copied from that for the embedded `hwlocXXX` component in the past. The `-Wmissing-include-dirs` option is enabled in gfortran by default but it is not enabled by default (or even with `-Wall`) in gcc and g++. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-07-09 10:55:33 +09:00
Yossi Itigin	e77e31b50b	Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging MCA/COMMON/UCX: unified logging across all UCX modules	2018-07-08 12:45:26 +03:00
Ralph Castain	17c4cf0db8	Install PMIx v3.0.0 release Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-06 06:38:02 -07:00
Sergey Oblomov	bef47b792c	MCA/COMMON/UCX: unified logging across all UCX modules - added common logging infrastructure for all UCX modules - all UCX modules are switched to new infra Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-05 16:25:39 +03:00
Sergey Oblomov	8080283b3d	MCA/COMMON/UCX: changed return type for wait_request - for now wait_request returns OMPI status - updated callers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 23:29:38 +03:00
Sergey Oblomov	f574c14e3a	ATOMICS/UCX: redefine atomic module API - now it accepts integer values directily instead of pointers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 14:41:45 +03:00
Yossi Itigin	4962651567	Merge pull request #5366 from hoopoepg/topic/mca-common-ucx-unify-2 MCA/COMMON/UCX: minor unification of del_proces calls	2018-07-04 14:38:37 +03:00
Nathan Hjelm	bd5cd62df9	btl/ugni: fix up some warnings Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 16:30:44 -06:00
Nathan Hjelm	d8916a4672	btl/ugni: fix race condition in completing frags The descriptor flags field in a fragment were being ready after the fragment may have been freed. This commit reads the flags before calling the user callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 10:48:54 -06:00
Nathan Hjelm	87d41da62b	btl/vader: add support for atomics and emulated rdma This commit adds support for atomic operations as well as rdma for systems without rdma support. This support is implemented using an internal send tag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-02 13:57:11 -06:00
Nathan T. Weeks	08f9ae97ee	btl/ugni: update BTL_VERBOSE argument list Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>	2018-07-02 09:23:30 -06:00
Sergey Oblomov	c2bd6af9f2	MCA/COMMON/UCX: minor unification of del_proces calls - some common functionality of del_procs calls is moved into mca_common module - blocking ucp_put call is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-02 15:10:53 +03:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00
Jeff Squyres	57bc657e7f	btl/tcp: fix hash map usage Fix two facepalms: 1. The "uint32" in the hash map functions refer to the key size, not the value size. The values are always 64 bits. 2. Pass the straight value to the "set" functions -- not the pointer to the value. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-28 15:29:41 -07:00
Yossi Itigin	3a7271ef4e	Merge pull request #5344 from hoopoepg/topic/mca-common-ucx-fixed-build MCA/COMMON/UCX: fixed build scripts	2018-06-28 15:14:04 +03:00
Sergey Oblomov	624d59604b	MCA/COMMON/UCX: minor optimization of build scripts Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-28 12:58:07 +03:00
Thananon Patinyasakdikul	304cf97ab5	Merge pull request #5334 from thananon/ofi_progress_fix btl/ofi: progress now happens after a threshold.	2018-06-27 12:51:33 -07:00
Sergey Oblomov	de8568c822	MCA/COMMON/UCX: enabled fallback into older UCX API Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 19:59:40 +03:00
Sergey Oblomov	1223b05811	MCA/COMMON/UCX: fixed build scripts - updated evaluation of UCX lib - used call from UCX v1.3 - updated makefile compilation flags Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 11:10:25 +03:00
Thananon Patinyasakdikul	be76896f7c	btl/ofi: progress now happens after a threshold. This commit changed the way btl/ofi call progress. Before, we force progression with every rdma/atomic call. This gives performance boost in some case and slow down on others. Now we only force progression after some number of rdma calls which result in better performance overall. Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set the threshold number. The new default is 64. Also: Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion is generated after the message has been received on the remote side. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-26 10:39:45 -07:00
Nathan Hjelm	b0ac6276a6	btl/ugni: improve multi-threaded RDMA performance This commit improves the injection rate and latency for RDMA operations. This is done by the following improvements: - If C11's _Thread_local keyword is available then always use the same virtual device index for the same thread when using RDMA. If the keyword is not available then attempt to use any device that isn't already in use. The binding support is enabled by default but can be disabled via the btl_ugni_bind_devices MCA variable. - When posting FMA and RDMA operations always attempt to reap completions after posting the operation. This allows us to better balance the work of reaping completions across all application threads. - Limit the total number of outstanding BTE transactions. This fixes a performance bug when using many threads. - Split out RDMA and local SMSG completion queue sizes. The RDMA queue size is better tuned for performance with RMA-MT. - Split out put and get FMA limits. The old btl_ugni_fma_limit MCA variable is deprecated. The new variable names are: btl_ugni_fma_put_limit and btl_ugni_fma_get_limit. - Change how post descriptors are handled. They are no longer allocated seperately from the RDMA endpoints. - Some cleanup to move error code out of the critical path. - Disable the FMA sharing flag on the CDM when we detect that there should be enough FMA descriptors for the number of virtual devices we plan will create. If the user sets this flag we will not unset it. This change should improve the small-message RMA performance by ~ 10%. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-26 11:31:35 -06:00
Ralph Castain	0ddbc75ce5	Merge pull request #4930 from kizill/fix-ipv6 fixed ipv6 OOB connection problems (fix issue #1585)	2018-06-26 09:13:53 -07:00
Nathan Hjelm	abb87f9137	Merge pull request #5338 from ggouaillardet/topic/uct btl/uct: misc fixes	2018-06-26 08:56:40 -06:00
Yossi Itigin	ee873f4f79	Merge pull request #5322 from hoopoepg/topic/mca-ucx-common MCA/UCX: added common module	2018-06-26 13:54:12 +03:00
Gilles Gouaillardet	b40b835a70	btl/uct: remove debug code Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 16:03:16 +09:00
Gilles Gouaillardet	552d0809aa	btl/uct: add missing include file Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 14:53:02 +09:00
Nathan Hjelm	6c089518e7	btl/uct: make uct endpoints array a flexible array member Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Nathan Hjelm	c5c5b42307	btl: add a new btl for the UCT layer in OpenUCX This commit adds a new btl for one-sided and two-sided. This btl uses the uct layer in OpenUCX. This btl makes use of multiple uct contexts and per-thread device pinning to provide good performance when using threads and osc/rdma. This btl has been tested extensively with osc/rdma and passes all MTT tests on aries and IB hardware. For now this new component disables itself but can be enabled by setting the btl_ucx_transports MCA variable with a comma-delimited list of supported memory domains/transport layers. For example: --mca btl_uct_memory_domains ib/mlx5_0. The specific transports used can be selected using --mca btl_uct_transports. The default is to use any available transport. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Sergey Oblomov	bf7fd480e9	MCA/COMMON/UCX: added non-blocking implementations of atomics - added implementation of swap/cswap/fadd operations - blocking add64 is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 12:25:31 +03:00
Sergey Oblomov	63e7ba6843	MCA/COMMON/UCX: added parameter for UCX/opal progress - added parameter to set UCX/opal progresses - minor refactoring of request wait routines Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 11:00:12 +03:00
Jeff Squyres	3767ce27c0	btl/tcp: trivial whitespace clean No code/logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:04:12 -07:00
Jeff Squyres	9034717876	btl/tcp: use a hash map for kernel IP interface indexes The giant size of the TCP proc struct is causing a problem in some environments (because it is allocated on the stack), and it was too big, anyway. Instead, use a hash map. That way, it starts small and can grow if it needs to. It also makes no assumptions about the values of the kernel interface indexes. Fixes #5292. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:03:30 -07:00
Jeff Squyres	e3d6c5ce3a	pmix3/pmix_server.c: minor compiler warning stomp Submitted upstream https://github.com/pmix/pmix/pull/776. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 06:35:09 -07:00
Sergey Oblomov	d57ae62dee	MCA/UCX: added common module - implemented non-blocking routines for flush operations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-22 16:41:09 +03:00
Jeff Squyres	e305e80aff	Merge pull request #5317 from jsquyres/pr/update-bind-to-cpulist-option orterun: use consistent CLI option name for --bind-to	2018-06-21 12:43:18 -04:00
Jeff Squyres	4603852740	orterun: use consistent CLI option name for --bind-to Since the new binding option is tied to the --cpu-list orterun CLI option, make the --bind-to option reflect the same name (vs. the --cpu-set CLI option, which is entirely different). For example: mpirun --bind-to cpu-list:ordered ... Note that "--bind-to cpulist:ordered" is accepted as a synonym, because people will be lazy. Also add some minor updates to the orterun.1in man page for clarification. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-21 08:22:00 -07:00
Edgar Gabriel	fb16d40775	Merge pull request #5196 from edgargabriel/topic/cuda io/ompio: introduce initial support for cuda buffers in ompio	2018-06-21 10:14:43 -05:00
Edgar Gabriel	8c2ea0ef49	opal/dataype: add additional interface to retrieve more details about cuda buffer the existing interface in opal_datatype_cuda do not allow to distinguish whether a buffer is a managed or unmanaged cuda buffer. Add an interface that allows to retrieve this information throug a convertor, since the information is actually available in the mca_common_cuda_* routines. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-06-21 09:25:50 -05:00
Ralph Castain	f17d47087a	Define a new binding method and qualifier Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 21:26:09 -07:00
Ralph Castain	5ac2ce6346	Cover all the PMIx data types Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 09:06:19 -07:00
Boris Karasev	39c9cb12bb	pmix/ext2x: fixed detection PMIx v2.0 by pmix component Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-06-20 13:23:51 +03:00
Ralph Castain	08707c9762	Sync to updated PMIx v3.0.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-19 21:25:43 -07:00
Noah Evans	a64abadf97	Fix mca_base_var_files separator In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch. Signed-off-by: Noah Evans <noah.evans@gmail.com>	2018-06-19 12:59:07 -06:00
Ralph Castain	f0a0d606a0	Correct accounting for tools Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)	2018-06-18 13:24:25 -07:00
Jeff Squyres	cd8d169599	Merge pull request #5276 from jsquyres/pr/info-key-len-fix util/info: tighten up error detection on key length	2018-06-18 12:03:21 -04:00
Thananon Patinyasakdikul	13f58f3191	Merge pull request #5274 from thananon/ofi_sep btl/ofi: add scalable endpoint support.	2018-06-18 08:41:06 -07:00
Jeff Squyres	266d5b2110	Merge pull request #5277 from jsquyres/pr/cygwin-patch external libevent: fix for Cygwin	2018-06-18 11:08:30 -04:00
Ralph Castain	7981818b84	Update PMIx atomics Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 10:03:49 -07:00
Ralph Castain	fa18ba395d	Sync to latest PMIx v3.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 02:41:46 -07:00
Ralph Castain	ac7bb15505	Fix other typo in help message Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:30:52 -07:00
Ralph Castain	8cfce583c0	Correct typo to properly check for PMIx v4 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:29:05 -07:00
Jeff Squyres	07c8ec6a3c	external libevent: fix for Cygwin Fix from Marco Atzeri for building on Cygwin. Signed-off-by: Marco Atzeri <marco.atzeri@gmail.com> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 09:08:58 -07:00
Jeff Squyres	2670a7f55c	util/info: tighten up error detection on key length Fix CID 1435996: use the proper % type to render the size. Also use opal_output(), not fprintf(). For debug builds, abort without dumping core (dumping core is very unfriendly when running thousands of automated tests) -- the stderr output is sufficient to find the coding error. For non-debug builds, truncate the key and emit a warning that it almost certainly will not work properly. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 08:45:03 -07:00
Ralph Castain	94794011f1	Silence warnings and ignore test binary Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-15 11:32:22 -07:00
Ralph Castain	e2e6da4379	Merge pull request #5258 from rhc54/topic/pmix4 Sync to PMIx v3.0rc and add ext4x	2018-06-15 09:41:08 -07:00
Thananon Patinyasakdikul	dae3c9447c	btl/ofi: add scalable endpoint support. This commit add support for scalable endpoint to enhance multithreaded application performance. The BTL will detect the support from ofi provider and will fallback to normal usage of scalable endpoint is not supported. NEW MCA parameters: - mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint. - mca_btl_ofi_num_contexts_per_module: number of communication context to create (should be the same as number of thread). Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 15:44:29 -07:00
Thananon Patinyasakdikul	08cfacddee	Merge pull request #5272 from thananon/opal__thread opal/threads: Added keyword `opal_thread_local` for TLS	2018-06-14 14:45:47 -07:00
Thananon Patinyasakdikul	469a404c3b	opal/thread: Added keyword `opal_thread_local` for TLS. configure: add checks for `__thread` on top of current check for `_Thread_local` and define OPAL_HAVE_THREAD_LOCAL if the compiler support TLS. Added `opal_thread_local` keyword to unify the definition. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 13:25:04 -07:00
Nathan Hjelm	009a3f093b	opal/asm: fix 32-bit fetch and op atomics These functions were erroneously returing the new, not the old value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-14 13:54:52 -06:00
Howard Pritchard	7dcab6e4a4	Merge pull request #5269 from hppritcha/topic/squash_gcc7.3.0_warnings topo/treematch - quash compiler warning	2018-06-13 21:13:04 -05:00
Howard Pritchard	64de269cc3	topo/treematch - quash compiler warning quash a compiler warning showing up with gcc 7.3 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-06-13 16:34:17 -05:00
Thananon Patinyasakdikul	390d72addd	Merge pull request #4885 from davideberius/spc_pr Initial Software-based Performance Counters PR	2018-06-12 14:04:49 -07:00
Jeff Squyres	591b225527	Merge pull request #5245 from PeterGottesman/hwloc-fix Ensure required hwloc directories are in dist tarballs	2018-06-12 13:08:35 -04:00
Peter Gottesman	afe363b73b	Ensure required hwloc directories are in dist tarballs Fixes MTT failure when running autogen on a tarball Signed-off-by: Peter Gottesman <pgottesm@cisco.com>	2018-06-12 08:33:50 -07:00
Howard Pritchard	b43f94895f	Merge pull request #5261 from thananon/ofi_new_mr btl/ofi: change required mr mode bits.	2018-06-11 20:54:09 -06:00
David Eberius	d377a6b6f4	Added Software-based Performance Counters driver code along with several counters. This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI. All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example. Signed-off-by: David Eberius <deberius@vols.utk.edu>	2018-06-11 22:48:16 -04:00
Thananon Patinyasakdikul	b6e07bfac6	btl/ofi: change required mr mode bits. FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits (FI_MR_PROV_KEY \| FI_MR_ALLOCATED \| FI_MR_VIRT_ADDR). The btl functionality remains the same. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-11 09:25:57 -07:00
Ralph Castain	48f27655a6	Sync to PMIx v3.0rc and add ext4x Sync to the draft rc for PMIx v3.0. Add an external component for PMIx master, which is at v4.0 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-11 05:54:23 -07:00
Jeff Squyres	84701cd2b0	Merge pull request #5204 from markalle/info_snprintf fix info-subscribe to use snprintf() and warn on long key	2018-06-08 15:22:55 -04:00
Nicolas Morey-Chaisemartin	038ceb0b61	btl: openib: Add support for Broadcom Cumulus RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:58 +02:00
Nicolas Morey-Chaisemartin	d3416345ab	btl: openib: add support for QLogic RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:41 +02:00
Arm	ce4d769aee	Merge pull request #5235 from thananon/ofi_einr_fix btl/ofi: handles FI_EINTR properly.	2018-06-06 11:30:48 -07:00
Ralph Castain	840fb42f93	PMIx rte component does support dynamics Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-05 21:55:19 -07:00
Thananon Patinyasakdikul	f34a73af70	btl/ofi: handles FI_EINTR properly. OFI sockets provider will sometime return FI_EINTR. Apparently this flag does not exist in OFI version 1.5. This commit added an ifdef check. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-05 18:40:00 -07:00
Nathan Hjelm	8192eb9aa6	patch/overwrite: add support for aarch64 This commit adds patcher support of aarch64. The current implementation uses mov, movk, and br to perform the jump. This uses 5 instructions. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-04 16:04:26 -06:00
Arm	555029b323	Merge pull request #5216 from thananon/btl_ofi btl/ofi: the component now compiles with libfabric < 1.5.	2018-06-04 13:47:00 -07:00
Thananon Patinyasakdikul	c18cf7315f	btl/ofi: configury: will not build if OFI ver < 1.5. This commit fixes problem where users have libfabric version < 1.5 installed and build failed because the new MR modebits does not exist. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-04 10:35:50 -07:00
Ralph Castain	014bb3c8de	Fix external hwloc builds Remove spurious comma in header file definition. Remove unused variables Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-03 11:24:21 -07:00
Ralph Castain	55ac526a67	Enable the PMIx ompi/rte component Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using: * autogen --no-orte * with external libevent, external hwloc, and external PMIx master * configuring PMIx master with the same libevent and hwloc * execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)	2018-06-03 07:25:12 -07:00
Ralph Castain	4ceaed3a76	Merge https://github.com/bgoglin/ompi into topic/rte2	2018-06-03 07:24:12 -07:00
Howard Pritchard	623e36de8a	Merge pull request #5205 from thananon/btl_ofi new btl/ofi: RDMA only btl using libfabric.	2018-06-02 13:00:50 -06:00
Mark Allen	93fefc4d70	fix info-subscribe to use snprintf() and warn on long key This checkin mainly concerns our internal info keys that are registering for callbacks via opal_infosubscribe_subscribe(). Those keys need to have an extra __IN_<key>/val stored to preserve their pre-callback value. So that means our internal keys are limited to 5 chars shorter than the usual key length limit. The code previously would have been silently inactive if a large key happened to come in, now it warns and also uses snprintf() to avoid compiler warnings. I'm also making the top-level MPI_Info_set warn if the user uses our reserved "__IN_" prefix. I had wanted the feature to be more invisible than that, but it would require a more sophisticated approach to change that. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-06-01 18:31:32 -04:00
Thananon Patinyasakdikul	6efb8069ec	new btl/ofi: RDMA only btl using libfabric. This commit added new transport layer to be used with osc rdma module. This BTL provides put, get, atomic and fetch atomic operations. It can be used with multiple hardware vendors as long as they have their provider under Libfabric and have the right capabilities. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-01 15:22:04 -07:00
Nathan Hjelm	f8dbf62879	opal/asm: change ll/sc atomics to macros This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References #3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-05-31 19:45:19 -06:00
Jeff Squyres	fb0473acb5	pmix3x: compiler warning stomp This fix was already included in pmix upstream (https://github.com/pmix/pmix/commit/fb7af8af2). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-05-30 10:14:37 -07:00
Jeff Squyres	dec247d96e	opal/datatype: minor compiler warning stomp Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-05-30 10:08:19 -07:00

... 2 3 4 5 6 ...

5471 Коммитов