openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	f609542bbf	Implement process set name support Add the --pset option for app_contexts so the user can provide a string name for each app_context. Use the new PMIx pset attribute to store the names in the PMIx local storage for retrieval Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-11-27 11:07:58 -08:00
Josh Hursey	3339e2aa33	Merge pull request #5986 from jjhursey/vpid-unpack Add OPAL_VPID to unpacking	2018-11-21 11:42:02 -06:00
KAWASHIMA Takahiro	cacd6f389c	datatype: Remove `#if HAVE_[TYPE]` for C99 types Now Open MPI requires a C99 compiler. Checking availability of the following types is no more needed. - `long long` (`signed` and `unsigned`) - `long double` - `float _Complex` - `double _Complex` - `long double _Complex` Furthermore, the `#if HAVE_[TYPE]` style checking is not correct. Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`. `AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h` if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]` (instead of defining as `0`) if it is not available. So even if we need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`. I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac` since someone may use `HAVE_[TYPE]` macros somewhere. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-11-14 09:32:52 +09:00
Thananon Patinyasakdikul	3dc1629771	Merge pull request #5241 from thananon/opal_progress Add MCA param for multithread opal_progress().	2018-11-09 12:30:07 -05:00
Gilles Gouaillardet	efe72d3d92	Merge pull request #6055 from ggouaillardet/topic/c11_atomics Misc C11 atomics related fixes	2018-11-08 09:11:42 +09:00
Gilles Gouaillardet	07970f192a	atomic/c11: fix include header path simply #include "opal_stdint.h" in order to work with --devel-headers Refs open-mpi/ompi#6053 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-11-07 13:23:28 +09:00
Thananon Patinyasakdikul	d9bd54c628	btl/ofi: fixed compiler warning on OSX. This commit closes #6049 Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2018-11-06 15:37:25 -05:00
Nathan Hjelm	30119ee339	opal/asm: work around possible gcc compiler bug It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should not cause any performance issues as it is equivalent to the memory barrier in the hand-written atomics. References #6014 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-11-05 11:45:42 -07:00
Thananon Patinyasakdikul	4e23fedccb	opal progress: Added MCA for multithreaded opal_progress. This commit added MCA param `opal_max_thread_in_progress` to set the number of threads allowed to do opal_progress concurrently. The default value is 1. Component with multithreaded design can benefit from this change to parallelize their component progress function. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2018-11-01 13:34:20 -04:00
Howard Pritchard	8126779a35	btl/openib: fix a problem with ib query Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: #5810 Fixes: #5914 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-31 14:50:15 -06:00
Brian Barrett	a1e85b03aa	btl tcp: Fix compile error in IPv6 In `457f058` I broke the TCP BTL with --enable-ipv6. This patch fixes the compile error, so IPv6 works again. Fixed #5996 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-30 12:31:04 -07:00
Joshua Hursey	a557c4130c	Add OPAL_VPID to unpacking * Needed to properly read PMIx job data like the following - `OPAL_PMIX_LOCALLDR` - `OPAL_PMIX_RANK` - `OPAL_PMIX_GLOBAL_RANK` - `OPAL_PMIX_APPLDR` - `OPAL_PMIX_APP_RANK` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2018-10-29 14:31:47 -05:00
Gilles Gouaillardet	b205039205	event/external: fix version requirement Only default to the external component if its version is greater or equal than the internal libevent (2.0.22) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-29 10:12:56 +09:00
Gilles Gouaillardet	35e77a286c	event/external: misc configury fixes - Always use the external component when configure'd with --with-libevent=external - Fix the external libevent library version detection by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros - Use the event2/event.h header (event.h is deprecated since libevent 2.0 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-29 10:12:56 +09:00
Aurelien Bouteiller	f2e6d7891e	Merge pull request #5975 from ICLDisco/export/pmix-fini-threadinterlock Avoid a double lock interlock when calling pmix_finalize	2018-10-26 10:01:23 -04:00
Gilles Gouaillardet	b715dd2657	btl/uct: fix AC_CHECK_DECLS usage AC_CHECK_DECLS take a comma separated list of macros/symbols, so replace the whitespace separator with a comma. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-26 15:36:02 +09:00
Aurelien Bouteiller	50cf707f3f	Avoid a double lock interlock when calling pmix_finalize Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2018-10-25 16:36:58 -04:00
Nathan Hjelm	7b34c9b3fe	Merge pull request #5958 from hjelmn/btl_uct_sync btl/uct: update for UCT_CB_FLAG_SYNC removal	2018-10-22 23:17:27 -06:00
Yossi Itigin	4c442f2601	Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg COMMON/UCX: suppressed coverity warnings	2018-10-22 11:00:47 +03:00
Nathan Hjelm	1b37328ba8	btl/uct: update for UCT_CB_FLAG_SYNC removal Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-21 18:57:42 -06:00
bosilca	690024917d	Merge pull request #5938 from bwbarrett/bugfix/tcp-multi-address Always use same IP address for module and modex	2018-10-21 09:19:43 -07:00
Sergey Oblomov	1099d5f023	COMMON/UCX: added error code to log output Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-21 11:37:25 +03:00
George Bosilca	dc972f0b92	Fix the PML monitoring. The monitoring PML hides it's existence from the OMPI infrastructure by removing itself from the list of PML loaded components, remaining hidden until MPI_Finalize. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-18 00:29:23 -04:00
Brian Barrett	457f058e73	btl tcp: Simplify modex address selection Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs #5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-18 02:23:21 +00:00
Howard Pritchard	7730db9982	Merge pull request #5937 from hppritcha/topic/remove_crs_components remove some dead crs components	2018-10-17 16:06:36 -06:00
Jeff Squyres	c2608fb597	Merge pull request #5939 from jsquyres/pr/coverity-cid-1440332 opal/os_path: fix minor string overrun	2018-10-17 14:24:42 -07:00
Howard Pritchard	6564d3d217	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 10:29:00 -06:00
Brian Barrett	4f19221af2	btl tcp: Simplify module address storage Today, a btl tcp module is associated with exactly one IP address (IPv4 or IPv6). There's no need to reserve space for both an IPv4 and IPv6 address in the module structure, since the module will only be associated with one or the other. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-17 16:21:17 +00:00
Jeff Squyres	290d1c0534	opal/os_path: fix minor string overrun This is a follow on to `54ca3310ea` to fix a minor bug identified by Coverity. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-17 08:57:30 -07:00
Sergey Oblomov	df765595e3	COMMON/UCX: suppressed coverity warnings - suppressed coverity warnings - added log messages on failed calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-17 16:11:03 +03:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Nathan Hjelm	37a3f32e47	Merge pull request #5913 from hjelmn/uct_btl_fixes btl/uct: fix deadlock in connection code	2018-10-16 19:11:54 -06:00
Nathan Hjelm	707d35deeb	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 18:28:47 -06:00
Nathan Hjelm	1c001319d4	Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now opal/free_list: fix race condition	2018-10-16 15:27:05 -06:00
Brian Barrett	da1189d771	Merge pull request #5916 from bwbarrett/revert/6acebc4 Revert "Handle error cases in TCP BTL"	2018-10-16 13:54:18 -07:00
Nathan Hjelm	5c770a7bec	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 13:17:09 -06:00
Nathan Hjelm	9ea5dfa799	class/opal_fifo: fix warning Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-15 19:18:31 -06:00
Jeff Squyres	cef6cf0ac5	opal: update some string handling Make various string handling locations a little more robust. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:04:28 -07:00
Jeff Squyres	e6de42c379	opal/util/info.c: fix compiler warning Remove unused variable. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:03:28 -07:00
Brian Barrett	5162011428	Revert "Handle error cases in TCP BTL" This reverts commit `6acebc40a1`. This patch is causing numerous "Socket closed" messages which are causing most of the failures on Cisco's MTT run. See https://github.com/open-mpi/ompi/issues/5849 for more information. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 15:01:54 -07:00
Brian Barrett	8feff49f6c	pmix: Fix filename typo Looks like a filename was missed when pmix sucked in the installdirs framework. Fixing the typo fixes "make ctags" and "make cscope". Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 17:05:01 +00:00
Nathan Hjelm	6ed68da870	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:33:17 -06:00
Brian Barrett	902b57919c	btl tcp: Print number of endpoints on error While trying to debug #3035, it's not clear whether there is an issue with the modex data or printing the address list. Print the number of endpoints on the error, which will help determine which case is happening to Cisco. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Brian Barrett	bd07cc6707	btl tcp: Improve initialization debugging When creating TCP BTL modules, print more information about the module's ethernet association, including the first address associated with the device, as debug output. Fix a flipped output string for IPv4 and IPv6 addresses in the modex send code. Add the addresses being published in the modex to the debugging output in modex send, to help match failures in endpoint match. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Gilles Gouaillardet	ca9009e3d0	pmix/ext3x: fix minor typos refer PMIx 3x instead of previous 2x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:18:36 +09:00
Gilles Gouaillardet	b3bd785ce0	pmix/ext3x: add missing header files into the dist tarball Refs. open-mpi/ompi#5871 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:17:54 +09:00
Gilles Gouaillardet	67402f5f19	pmix/ext2x: add missing header files into the dist tarball Refs. open-mpi/ompi#5871 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:17:42 +09:00
Nathan Hjelm	39be6ec15c	btl/uct: bug fixes and general improvements This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-09 15:15:45 -06:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Brian Barrett	c087365ead	opal/util: Always support BSD behavior of asprintf Open MPI's developers like to assume that asprintf() always sets the ptr to NULL on error, but the standard (and Linux glibc) do not guarantee this. As a result, we're making opal_asprintf() always available for developers, which will guarantee that ptr is set to NULL on error. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Nathan Hjelm	8291f6722d	btl/vader: fix race condition in writing header Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-05 16:30:06 -06:00
Ralph Castain	76c11a1496	Remove build product - fix autogen.pl mode Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 13:38:39 -07:00
Ralph Castain	c498a7e77a	Protect PMIx from bad configure entry Ignore with-hwloc=internal or external as those are meaningless to pmix (will upstream) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 07:07:05 -07:00
Ralph Castain	f379ba9c8e	Fail configure if pmix won't build If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 06:29:36 -07:00
Geoff Paulsen	56a55cdb4b	Merge pull request #5786 from jsquyres/pr/string-madness Replace strcpy() and strncpy() with (new) opal_string_copy()	2018-10-04 16:12:46 -05:00
Ralph Castain	86702b71bc	Correctly notify upon process failure We only need to pass a custom range if the target is a single process. Otherwise, we let the range be "session". Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-03 20:19:42 -07:00
Nathan Hjelm	dfa8d3a81a	btl/vader: work around Oracle compiler bug This commit works around an Oracle C compiler bug in 5.15 (not sure when it was introduced). The bug is triggered when we chain assignments of atomic variables. Ex: _Atomic intptr x, y; intptr_t z = 0; x = y = z; Will produce a compiler error of the form: operand cannot have void type: op "=" assignment type mismatch: long "=" void To work around the issue we are removing the chain assignment and setting the head and tail on different lines. Fixes #5814 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 08:55:51 -06:00
Nathan Hjelm	66a7dc4c72	btl/vader: ensure the fast box tag is always read first On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 07:17:32 -06:00
Ralph Castain	31f6c75498	Merge pull request #5819 from bosilca/fix/local_bind Fix/local bind	2018-10-02 11:27:36 -07:00
Brian Barrett	a25df3f29e	opal: Remove outdated MacOS workaround Remove the pack/unpack pragma around net/if.h on MacOS, which was added to fix a bug in MacOS X 10.4.x on 64-bit platforms. The bug was fixed in Mac OS X 10.5.0 and, sometime in the last 11 years, compilers started emitting warnings about the fact that the Apple header stomped over the pragma pack settings from the workaround. We already don't support versions of MacOS earlier than 10.5, so there's no point in keeping the workaround. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-02 13:35:46 -04:00
Brian Barrett	19e16d5fd0	opal: Disable memory patcher component on MacOS Open MPI doesn't support any transports on MacOS which require memory manager hooks. The memory patcher component uses the syscall interface, which has been deprecated in recent versions of MacOS. Since we don't need it and it emits warnings about deprecation, disable the memory patcher component on MacOS. Fixes #5671 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-02 13:35:15 -04:00
George Bosilca	a3a492b42c	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:08:18 -04:00
George Bosilca	9164e26e2f	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:06:40 -04:00
Ralph Castain	fcc1d30ab3	Merge pull request #5775 from karasevb/check_old_topo_key pmix: check the old topo key to keep compatibility with old RMs	2018-10-02 08:12:42 -07:00
Jeff Squyres	5dae086f7e	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 16:12:57 -07:00
Jeff Squyres	293a938d29	string_copy: assert() fail if the copy is too long Add a hueristic: if the string copy is "too long", fail an assert(). This is based on the premise that Open MPI doesn't do large string copies. So if we see a dest_len that is over a certain threshhold (currently set at 128K), this is likely a programmer error, and on debug builds, we should fail an assert(). In production builds, it will work just fine (assuming that it's not a programmer error). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 13:34:15 -07:00
Ralph Castain	172e770154	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-28 12:43:32 -07:00
Ralph Castain	c836c4282c	Update to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-27 20:57:23 -07:00
Jeff Squyres	28e51ad4d4	opal: convert from strncpy() -> opal_string_copy() In many cases, this was a simple string replace. In a few places, it entailed: 1. Updating some comments and removing now-redundant foo[size-1]='\0' statements. 2. Updating passing (size-1) to (size) (because opal_string_copy() wants the entire destination buffer length). This commit actually fixes a bunch of potential (yet quite unlikely) bugs where we could have ended up with non-null-terminated strings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 11:56:18 -07:00
Jeff Squyres	e1cf8757ae	opal/util/string_copy: add "safe" string copy function Do essentially the same thing as strncpy(3), but a) ensure to always terminate the destination buffer with a \0, and b) do not \0-pad to the right. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 10:48:04 -07:00
Jeff Squyres	38fefcf1cf	opal/util/Makefile.am: fix whitespace (remove tabs) No code or logic changes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 09:43:03 -07:00
Jeff Squyres	6bb356ab87	Squash a bunch of harmless compiler warnings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-26 12:15:21 -07:00
Boris Karasev	ed42f568ae	pmix: check the old topo key to keep compatibility with old RMs Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-09-25 18:13:54 +03:00
Gilles Gouaillardet	db65dbd9a8	ucx: use the c99 __func__ macro instead __FUNCTION__ macro was never standardized and should not be used. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-09-25 11:19:18 +09:00
Jeff Squyres	176da51aec	mca_base_var: fix output bug about settable vars Fix the test that determined whether we output "writeable" or "read-only" for MCA vars (it was checking the wrong flag). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-22 07:18:45 -07:00
Nathan Hjelm	c6230657a2	opal: fix warning Fixes #5713 Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-09-21 00:51:40 -06:00
Brian Barrett	9344afd485	openib: Disable CUDA async by default Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-09-20 07:58:10 -07:00
Howard Pritchard	dccf780546	Merge pull request #5737 from hppritcha/topic/remove_scif_support SCIF: remove it	2018-09-19 11:38:27 -06:00
Howard Pritchard	b9ac3d8931	SCIF: remove it KNC is effectively dead. Remove corresponding SCIF support in Open MPI. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-09-19 10:39:52 -06:00
bosilca	b0e2975f86	Merge pull request #5701 from ICLDisco/export/errors_ib Adding error handling in OpenIB BTL	2018-09-19 09:45:58 -04:00
bosilca	464e1abbab	Merge pull request #5700 from ICLDisco/export/tcp_errors Handle error cases in TCP BTL	2018-09-19 09:44:38 -04:00
Michael Kuron	17b0f1fcc3	Deal with EOPNOTSUPP returned from getsockopt() This can be returned when running on QEMU user-mode emulation, which does not support getsockopt with SO_RCVTIMEO. Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>	2018-09-16 14:55:28 +02:00
Jeff Squyres	3970b06134	misc: passing a bool to va_start() is undefined According to clang on MacOS, passing a bool parameter -- which undergoes default parameter promotion -- to va_start() results in undefined behavior. So just change these params to int and avoid the issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-15 06:04:13 -07:00
Jeff Squyres	8f2620d3af	misc: compiler warning fixes A variety of small compiler warning fixes. The 2 PMIx fixes are already committed upstream. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-15 06:04:13 -07:00
Nathan Hjelm	6363889cd0	Merge pull request #5696 from hjelmn/vader_5375 btl/vader: ensure that the send tag is always written last	2018-09-14 12:33:35 -06:00
Nathan Hjelm	1071d72130	Merge pull request #5445 from hjelmn/asm_type Update opal to use C11 atomics if available	2018-09-14 12:32:56 -06:00
Nathan Hjelm	fe6528b0d5	opal/atomic: always use C11 atomics if available This commit disables the use of both the builtin and hand-written atomics if proper C11 atomic support is detected. This is the first step towards requiring the availability of C11 atomics for the C compiler used to build Open MPI. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:51:05 -06:00
Nathan Hjelm	000f9eed4d	opal: add types for atomic variables This commit updates the entire codebase to use specific opal types for all atomic variables. This is a change from the prior atomic support which required the use of the volatile keyword. This is the first step towards implementing support for C11 atomics as that interface requires the use of types declared with the _Atomic keyword. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:48:55 -06:00
Nathan Hjelm	850fbff441	btl/vader: ensure that the send tag is always written last To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes #5375, #5638 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:35:35 -06:00
Ralph Castain	7facb3f3e9	Pickup and deploy network-specific envars Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-13 09:24:19 -07:00
Ralph Castain	c7939aeebc	Pickup some minor debugging changes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-13 09:11:28 -07:00
Ralph Castain	466cad6cb2	Update master to PMIx v4 Retain ext3x for PMIx 3 compatibility Get the blasted permissions correct on config files Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-13 08:24:17 -07:00
Ralph Castain	fb67b1703f	Merge pull request #5689 from rhc54/topic/psm2 Silence warning of unused function	2018-09-12 16:40:23 -07:00
Ralph Castain	853dc96c5d	Silence warning of unused function Requires protection for HAVE___CLEAR_CACHE Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-12 15:03:59 -07:00
Jeff Squyres	78c1469f93	Merge pull request #5682 from selvintxavier/brcm_hcas Add support for different Broadcom HCAs	2018-09-12 11:59:21 -04:00
Selvin Xavier	a53a6f7650	Add support for different Broadcom HCAs Adds device ids of different Broadcom adapters from BCM57XXX and BCM58XXX family of HCAs. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>	2018-09-12 02:22:34 -07:00
Ralph Castain	83ba589084	Silence warning Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-11 18:01:12 -07:00
Nathan Hjelm	9a514c6794	Merge pull request #5647 from hjelmn/clear_cache patcher/base: improve instruction cache flush for aarch64	2018-09-10 15:43:35 -06:00
Jeff Squyres	a1b879d176	Merge pull request #5651 from jsquyres/pr/dont-let-openib-yell-about-no-nics btl/openib: don't complain about no NICs	2018-09-07 15:00:24 -04:00
Howard Pritchard	dc02e54320	Merge pull request #5516 from thananon/ofi_send btl/ofi: Added 2 sided communication support.	2018-09-06 18:39:23 -06:00
Jeff Squyres	098ec55e37	btl/openib: don't complain about no NICs Since openib is on its long, slow way out the door, don't let it complain about not being able to find any NICs at run time. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-06 11:26:58 -07:00
Nathan Hjelm	1cdbceb095	patcher/base: improve instruction cache flush for aarch64 This commit updates the patcher component to either use the __clear_cache intrinsic or the correct assembly to flush the instruction cache. Fixes #5631 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-06 09:47:53 -06:00
Nathan Hjelm	5b9a24f13b	Merge pull request #5648 from hjelmn/btl_uct_fix btl/uct: ad missing opal_mem_hooks_unregister_release call	2018-09-05 13:59:22 -06:00
Nathan Hjelm	36c206d2d6	btl/uct: add missing opal_mem_hooks_unregister_release call This commit fixes a bug when using the UCT btl with the UCX memory hooks disabled. We were misssing a call to opal_mem_hooks_unregister_release to remove the btl memory hook callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-05 13:01:45 -06:00
Jeff Squyres	4173ac6dd0	mca/base: enforce max string lengths Ensure that the project, framework, component, and variable names are lower than max lengths. This is a follow-on to `992a8e8297`, per discussion on https://github.com/open-mpi/ompi/pull/5642. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-05 08:42:00 -07:00
George Bosilca	992a8e8297	Use less memory for the MCA variable names. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-08-31 11:36:40 +02:00
Jeff Squyres	05e5f61fe1	common/verbs-usnic: check that it will actually compile If someone specifies --with-verbs-usnic, actually do a configury check to ensure that it will compile (vs. assuming that it will compile if someone asks for it). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-30 13:10:49 -07:00
Gilles Gouaillardet	ad2c207a7e	mpool/memkind: plug a memory leak in mca_mpool_memkind_close() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-08-30 10:07:17 +09:00
Gilles Gouaillardet	7556dd0abb	opal/util: plug a memory leak in the opal_infosubscriber_t destructor Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-08-30 10:07:17 +09:00
Gilles Gouaillardet	aeddd2f249	pmix/pmix3x: plug a memory leak in external_register() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-08-30 10:07:16 +09:00
Gilles Gouaillardet	6e47c5708e	pmix/base: plug a memory leak in opal_pmix_base_select() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-08-30 10:07:16 +09:00
Yossi Itigin	68206a5635	Merge pull request #5569 from hoopoepg/topic/optimize-blocked-calls PML/UCX: blocked calls optimizations	2018-08-29 14:19:09 +03:00
Yossi Itigin	4bb6845888	Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno MCA/COMMON/UCX: added synonym to opal_mem_hook variable	2018-08-29 14:16:33 +03:00
Artem Polyakov	00d12c058f	Merge pull request #5596 from karasevb/fix_hwloc_numa_obj Fixed the NUMA obj detection for hwloc ver >= 2.0.0	2018-08-27 22:15:41 -07:00
Sergey Oblomov	2cd9e04166	PML/UCX: optimization of mprobe call - renamed vars - renamed of internal variable names - used unsigned datatypes Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:39 +03:00
Sergey Oblomov	38e908f83e	PML/UCX: optimization of mprobe call - refactoring of opal/UCX progress calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:38 +03:00
Sergey Oblomov	b0f87f2235	PML/UCX: blocked calls optimizations - added UCX progress priority Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-27 09:50:38 +03:00
Boris Karasev	beb0697f24	Fixed copyrights of prev commit. Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-08-27 09:50:11 +03:00
Gilles Gouaillardet	1665b8db8f	Merge pull request #5519 from extrowerk/haiku_patches fcntl include bugfix	2018-08-27 09:46:43 +09:00
Sergey Oblomov	b72dd83f05	MCA/COMMON/UCX: added synonims for common ucx variables - added synonims for atomic/osc modules Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-26 18:25:21 +03:00
Boris Karasev	e5291ccc34	Fixed the NUMA obj detection for hwloc ver >= 2.0.0 Since version hwloc 2.0.0 has a new organization of NUMA nodes on the topology tree. This commit adds the detection of local NUMA object for hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist component. Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-08-24 19:11:52 +03:00
Jeff Squyres	fe0852bcb4	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-24 07:39:14 -07:00
Nathan Hjelm	0a9c358ef8	Merge pull request #5577 from hjelmn/btl_vader_sc_emu_warnings btl/vader: clean up debuging and squash warning	2018-08-23 09:41:13 -06:00
Nathan Hjelm	c74cf666a9	btl/vader: clean up debuging and squash warning References #5512 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-22 10:57:32 -06:00
Nathan Hjelm	a99dba6cb7	Merge pull request #5553 from markalle/info_snprintf2 snprintf() length fix for info	2018-08-22 10:27:14 -06:00
Sergey Oblomov	6a7f66d9c2	MCA/COMMON/UCX: renamed synonim to opal_mem_hook variable Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-22 14:12:33 +03:00
Sergey Oblomov	e00f7a68ba	MCA/COMMON/UCX: added synonim to opal_mem_hook variable - added synonim to opal_mem_hook variable to allow to print it in opal_info -a Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-08-21 15:05:12 +03:00
Ralph Castain	5cfa2a7fca	Complete integration of job_control Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-08-20 16:10:50 -07:00
Ralph Castain	9948084130	Update to PMIx v3.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-08-20 15:05:24 -07:00
Aurélien Bouteiller	e46c907468	Adding error handling in OpenIB BTL bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster Make it compile in non-debug builds Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks Changing the default IB retry timeouts is not a good idea. We'll need to find another way to speedup credit recovery in failure cases. Remove ULFM specific cases Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2018-08-16 16:53:17 -04:00
Mark Allen	ee6eb9b12b	snprintf() length fix for info The important part of this fix is a couple places 5 was hard-coded that needed to be strlen(OPAL_INFO_SAVE_PREFIX). But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There was an "if" statement making sure all the arguments had appropriate strlen(), but gcc still complained about the following snprintf() because the size of the struct element is iterator->ie_key[OPAL_MAX_INFO_KEY + 1]. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-08-16 15:32:04 -04:00
Aurelien Bouteiller	6acebc40a1	Handle error cases in TCP BTL When an error is returned by the socket operations, trigger the appropriate error path in the PML to give an opportunity for rerouting/error handling. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2018-08-14 15:35:24 -04:00
Nathan Hjelm	dca3516765	btl/vader: move memory barrier to where it belongs The write memory barrier was intended to precede setting a fast-box header but instead follows it. This commit moves the memory barrier to the intended location. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-13 10:14:34 -06:00
Nathan Hjelm	e0f73866ef	Merge pull request #5525 from hjelmn/event_threading opal/progress: protect against multiple threads in event base	2018-08-13 09:21:44 -06:00
Jeff Squyres	01e4570af7	hwloc201/configure.m4: make it safe when used with hwloc:external The Autoconf AC_CONFIG_* macros can only be instantiated exacly once for any given file, and they must be in a code execution path at run time for the target file to be generated at the end of configure. For example, if you want to generate file ABC at the end of configure, you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that will get executed when configure is run. That's pretty straightforward. What's not straightforward is two corner cases: 1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file more than once. If you do, autoreconf will fail (even before you can run configure). 2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by configure, the file ABC is not registered properly, and ABC will not be generated at the end of configure. This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles (etc.) so that targets like "make distclean" and "make distcheck" will work properly. Hence, we have to invoke HWLOC_SETUP_CORE. However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side effects. It would be nice to do able to do something like this: ``` if hwloc:extern is going to be used: Invoke minimal HWLOC_SETUP_CORE (with no side effects) else Invoke full HWLOC_SETUP_CORE (with side effects) fi ``` But we can't, because autoreconf will detect that AC_CONFIG_FILES has been invoked on the same files more than once (regardless of whether those code paths will be executed at run time or not). Kaboom. Similarly, we can't do this: ``` if hwloc:extern is not going to be used: Invoke full HWLOC_SETUP_CORE (with side effects) fi ``` Because then hwloc's AC_CONFIG_FILES won't be registered properly when hwloc:external is used (i.e., when the HWLOC_SETUP_CORE macro is not in a code path that is executed at run time), and targets like "make distclean" will fail because hwloc's Makefiles won't have been setup. Kaboom. But remember that the hwloc framework is a bit special: there will only ever be 2 comoponents: external and internal. External is guaranteed to be configured first because of its priority. So the internal component (i.e., this component) immediately knows if it is going to be used or not based on whether the external component configuration succeeded or failed. Specifically: regardless of whether the internal component (i.e., this component) is going to be used, we have to invoke HWLOC_SETUP_CORE. But we can manage the side effects: allow the side effects when this/internal component is going to be used, and avoid the side effects when this/internal component is not going to be used. This is a little less clean than I would have liked, but because of Autoconf's oddity about its AC_CONFIG_* macros, this is the only solution I could come up with. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-11 11:05:23 -07:00
Jeff Squyres	69aa46e167	libevent2022/configure.m4: always invoke sub-configure In order to make "make distclean" (and friends) work, we need to always invoke the embedded configure script -- even if we know that we're not going to use this component. But in cases where we know we're not going to use this component, we also need to avoid the side effects of the code path that is used when we do want to use this component. So split the two possibilities into two different macros: 1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing except invoke the underlying "configure" script. 2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real work (including invoking the underlying "configure" script). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-11 06:03:46 -07:00
Jeff Squyres	80df3f040b	libevent2022/configure.m4: trivial cleanup Put argument to AM_CONDITIONAL inside []. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-10 10:30:45 -07:00
Jeff Squyres	17aa64e438	libevent2022/configure.m4: minor comment cleanup Change # -> dnl. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-10 10:30:04 -07:00
Jeff Squyres	b063cb6b0f	libevent2022: only configure if event:external fails We know that event:external will be configured first (because of its priority). Take advantage of that here in libevent2022 by having it refuse to configure / politely fail if event:external succeeded. Also print out some additional lines in configure output indicating what is going on (i.e., event:external succeeded, so this component will be skipped, or event:external failed, so this component will be used). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-08 10:22:38 -07:00
Jeff Squyres	4e5f432786	hwloc201: only configure if hwloc:external fails We know that hwloc:external will be configured first (because of its priority). Take advantage of that here in hwloc201 by having it refuse to configure / politely fail if hwloc:external succeeded. Also print out some additional lines in configure output indicating what is going on (i.e., hwloc:external succeeded, so this component will be skipped, or hwloc:external failed, so this component will be used). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-08 10:22:38 -07:00
Nathan Hjelm	bdbb853461	opal/progress: protect against multiple threads in event base libevent does not support multiple threads calling the event loop on the same event base. This causes external libevent's to print out re-entrant warning messages. This commit fixes the issue by protecting the call to the event loop with an atomic swap check. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-07 16:13:43 -06:00
Nathan Hjelm	eeae3f9b93	Merge pull request #5517 from bosilca/topic/treematch_warnings Remove few warnings identified by @rhc in #5514.	2018-08-06 13:25:07 -06:00
Zoltán Mizsei	ac3f8a16ed	fcntl include bugfix Signed-off-by: Zoltán Mizsei <zmizsei@extrowerk.com>	2018-08-06 19:45:59 +02:00
Boris Karasev	57683366ca	pmix: added check for pmix fence status Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-08-06 15:01:57 +06:00
George Bosilca	6d11a45f44	Remove few warnings identified by @rhc in #5514 . Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-08-03 16:21:06 -04:00
Thananon Patinyasakdikul	080115d440	btl/ofi: Added 2 side communication support. The 2 sided communication support is added for non-tagmatching provider to take advantage of this BTL and PML OB1. The current state is "functional" and not optimized for performance. Two sided support is disabled by default and can be turned on by mca parameter: "mca_btl_ofi_mode". Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-08-03 12:30:03 -07:00
Ralph Castain	f7a537cf04	Fix typo Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-25 20:05:33 -07:00
Ralph Castain	bcdb1f45ac	Fix the multiple pe/proc option Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up. Thanks to @tonyreina for the error report Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-25 18:47:39 -07:00
Ralph Castain	55cefedf9b	Cleanup pmix selection check Allow for versions > 3 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-25 16:11:32 -07:00
Nathan Hjelm	47ed8e8830	btl/uct: fix compile warnings/errors Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-23 14:04:38 -06:00
Ralph Castain	5cab823979	Leave opal_event_external_support exposed as global var Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-23 09:18:48 -07:00
Jeff Squyres	a70ecf5267	event/external: prefer external event component Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-07-23 09:20:27 +09:00
Jeff Squyres	83e4a45a9f	event: trivial comment change Switch from #-style to dnl-style. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-07-23 09:20:27 +09:00
Gilles Gouaillardet	ce2c9fffd4	hwloc: prefer external hwloc component Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-07-23 09:20:27 +09:00
Ralph Castain	8a6dc0b211	Merge pull request #5456 from rhc54/topic/px Default to internal PMIx if newer than external	2018-07-19 11:53:47 -07:00
Ralph Castain	1e6aaf7f22	Default to internal PMIx if newer than external Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-19 11:02:03 -07:00
Yossi Itigin	29812494f2	Merge pull request #5402 from hoopoepg/topic/common-del-procs MCA/COMMON/UCX: del_procs calls are unified to common module	2018-07-19 11:19:45 +03:00
Sergey Oblomov	920cc2e0d9	MCA/COMMON/UCX: del_procs calls are unified to common module Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-18 07:37:25 +03:00
Howard Pritchard	4447738098	Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default btl/openib: only look for iwarp/roce by default	2018-07-17 20:08:00 -06:00
Howard Pritchard	bc8134dae1	Merge pull request #5448 from thananon/ofi_context btl/ofi: Added FI_CONTEXT as requirement.	2018-07-17 19:51:13 -06:00
Howard Pritchard	6818272392	btl/openib: only look for iwarp/roce by default Due to decreasing support by vendors/other orgs for the OpenIB BTL, only look for iWarp/RoCE devices by default. Allow IB HCAs with ports configured for ethernet. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-07-17 19:11:37 -06:00
Thananon Patinyasakdikul	033c364ee0	btl/ofi: Added FI_CONTEXT as requirement. OFI BTL uses context for completion but never ask for it in fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT to eliminate any potential error. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-07-17 12:18:43 -07:00
Sergey Oblomov	a4b8253fa2	MCA/COMMON/UCX: fixed initialization of malloc hooks Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 20:09:50 +03:00
Sergey Oblomov	1c7ae22dfb	MCA/COMMON/UCX: shift opal memhooks into common UCX Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-17 13:46:38 +03:00
Ralph Castain	4a596d35f7	Remove the PMIx ext4x component Update configury to redirect anything at or above v3 to the ext3x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-13 19:51:50 -07:00
Nathan Hjelm	9d3a79925b	btl/vader: fix bugs in rma emulation This commit fixes two bugs in the RMA/atomic emulation code: 1) Fix a fragment leak when using AMO emulation. 2) Always initialize the single-copy emulation code. This is required to use the AMO support. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-12 15:50:50 -06:00
Ralph Castain	fdca304268	Default to external PMIx installation Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-10 16:12:52 -07:00
Nathan Hjelm	8b090103e2	opal/fifo: fix 128-bit atomic fifo on Power9 This commit updates the atomic fifo code to fix a consistency issue observed on Power9 systems when builtin atomics are used. The cause was two things: 1) a missing write memory barrier in fifo push, and 2) a read ordering issue when reading the fifo head non-atomically. This commit fixes both issues and appears to correct then inconsistency. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-10 15:37:11 -06:00
KAWASHIMA Takahiro	3e179ba95f	hwloc/external: Suppress missing-include-dirs warning If OMPI is configured with `--with-hwloc=external` or `--with-hwloc=DIR` and gfortran is used, I see a lot of warnings when compiling files under the `ompi/mpi/fortran` directory. ``` f951: Warning: Nonexistent include directory 'BUILD_DIR/opal/mca/hwloc/external/hwloc/include' [-Wmissing-include-dirs] ``` There is no such `include` directory in the source tree and `configure`- created tree. I think these lines in the `configure.m4` file are wrongly copied from that for the embedded `hwlocXXX` component in the past. The `-Wmissing-include-dirs` option is enabled in gfortran by default but it is not enabled by default (or even with `-Wall`) in gcc and g++. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-07-09 10:55:33 +09:00
Yossi Itigin	e77e31b50b	Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging MCA/COMMON/UCX: unified logging across all UCX modules	2018-07-08 12:45:26 +03:00
Ralph Castain	17c4cf0db8	Install PMIx v3.0.0 release Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-07-06 06:38:02 -07:00
Sergey Oblomov	bef47b792c	MCA/COMMON/UCX: unified logging across all UCX modules - added common logging infrastructure for all UCX modules - all UCX modules are switched to new infra Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-05 16:25:39 +03:00
Sergey Oblomov	8080283b3d	MCA/COMMON/UCX: changed return type for wait_request - for now wait_request returns OMPI status - updated callers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 23:29:38 +03:00
Sergey Oblomov	f574c14e3a	ATOMICS/UCX: redefine atomic module API - now it accepts integer values directily instead of pointers Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-04 14:41:45 +03:00
Yossi Itigin	4962651567	Merge pull request #5366 from hoopoepg/topic/mca-common-ucx-unify-2 MCA/COMMON/UCX: minor unification of del_proces calls	2018-07-04 14:38:37 +03:00
Nathan Hjelm	bd5cd62df9	btl/ugni: fix up some warnings Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 16:30:44 -06:00
Nathan Hjelm	d8916a4672	btl/ugni: fix race condition in completing frags The descriptor flags field in a fragment were being ready after the fragment may have been freed. This commit reads the flags before calling the user callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 10:48:54 -06:00
Nathan Hjelm	87d41da62b	btl/vader: add support for atomics and emulated rdma This commit adds support for atomic operations as well as rdma for systems without rdma support. This support is implemented using an internal send tag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-02 13:57:11 -06:00
Nathan T. Weeks	08f9ae97ee	btl/ugni: update BTL_VERBOSE argument list Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>	2018-07-02 09:23:30 -06:00
Sergey Oblomov	c2bd6af9f2	MCA/COMMON/UCX: minor unification of del_proces calls - some common functionality of del_procs calls is moved into mca_common module - blocking ucp_put call is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-07-02 15:10:53 +03:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00
Jeff Squyres	57bc657e7f	btl/tcp: fix hash map usage Fix two facepalms: 1. The "uint32" in the hash map functions refer to the key size, not the value size. The values are always 64 bits. 2. Pass the straight value to the "set" functions -- not the pointer to the value. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-28 15:29:41 -07:00
Yossi Itigin	3a7271ef4e	Merge pull request #5344 from hoopoepg/topic/mca-common-ucx-fixed-build MCA/COMMON/UCX: fixed build scripts	2018-06-28 15:14:04 +03:00
Sergey Oblomov	624d59604b	MCA/COMMON/UCX: minor optimization of build scripts Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-28 12:58:07 +03:00
Thananon Patinyasakdikul	304cf97ab5	Merge pull request #5334 from thananon/ofi_progress_fix btl/ofi: progress now happens after a threshold.	2018-06-27 12:51:33 -07:00
Sergey Oblomov	de8568c822	MCA/COMMON/UCX: enabled fallback into older UCX API Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 19:59:40 +03:00
Sergey Oblomov	1223b05811	MCA/COMMON/UCX: fixed build scripts - updated evaluation of UCX lib - used call from UCX v1.3 - updated makefile compilation flags Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-27 11:10:25 +03:00
Thananon Patinyasakdikul	be76896f7c	btl/ofi: progress now happens after a threshold. This commit changed the way btl/ofi call progress. Before, we force progression with every rdma/atomic call. This gives performance boost in some case and slow down on others. Now we only force progression after some number of rdma calls which result in better performance overall. Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set the threshold number. The new default is 64. Also: Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion is generated after the message has been received on the remote side. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-26 10:39:45 -07:00
Nathan Hjelm	b0ac6276a6	btl/ugni: improve multi-threaded RDMA performance This commit improves the injection rate and latency for RDMA operations. This is done by the following improvements: - If C11's _Thread_local keyword is available then always use the same virtual device index for the same thread when using RDMA. If the keyword is not available then attempt to use any device that isn't already in use. The binding support is enabled by default but can be disabled via the btl_ugni_bind_devices MCA variable. - When posting FMA and RDMA operations always attempt to reap completions after posting the operation. This allows us to better balance the work of reaping completions across all application threads. - Limit the total number of outstanding BTE transactions. This fixes a performance bug when using many threads. - Split out RDMA and local SMSG completion queue sizes. The RDMA queue size is better tuned for performance with RMA-MT. - Split out put and get FMA limits. The old btl_ugni_fma_limit MCA variable is deprecated. The new variable names are: btl_ugni_fma_put_limit and btl_ugni_fma_get_limit. - Change how post descriptors are handled. They are no longer allocated seperately from the RDMA endpoints. - Some cleanup to move error code out of the critical path. - Disable the FMA sharing flag on the CDM when we detect that there should be enough FMA descriptors for the number of virtual devices we plan will create. If the user sets this flag we will not unset it. This change should improve the small-message RMA performance by ~ 10%. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-26 11:31:35 -06:00
Ralph Castain	0ddbc75ce5	Merge pull request #4930 from kizill/fix-ipv6 fixed ipv6 OOB connection problems (fix issue #1585)	2018-06-26 09:13:53 -07:00
Nathan Hjelm	abb87f9137	Merge pull request #5338 from ggouaillardet/topic/uct btl/uct: misc fixes	2018-06-26 08:56:40 -06:00
Yossi Itigin	ee873f4f79	Merge pull request #5322 from hoopoepg/topic/mca-ucx-common MCA/UCX: added common module	2018-06-26 13:54:12 +03:00
Gilles Gouaillardet	b40b835a70	btl/uct: remove debug code Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 16:03:16 +09:00
Gilles Gouaillardet	552d0809aa	btl/uct: add missing include file Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-06-26 14:53:02 +09:00
Nathan Hjelm	6c089518e7	btl/uct: make uct endpoints array a flexible array member Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Nathan Hjelm	c5c5b42307	btl: add a new btl for the UCT layer in OpenUCX This commit adds a new btl for one-sided and two-sided. This btl uses the uct layer in OpenUCX. This btl makes use of multiple uct contexts and per-thread device pinning to provide good performance when using threads and osc/rdma. This btl has been tested extensively with osc/rdma and passes all MTT tests on aries and IB hardware. For now this new component disables itself but can be enabled by setting the btl_ucx_transports MCA variable with a comma-delimited list of supported memory domains/transport layers. For example: --mca btl_uct_memory_domains ib/mlx5_0. The specific transports used can be selected using --mca btl_uct_transports. The default is to use any available transport. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-25 18:14:58 -06:00
Sergey Oblomov	bf7fd480e9	MCA/COMMON/UCX: added non-blocking implementations of atomics - added implementation of swap/cswap/fadd operations - blocking add64 is replaced by non-blocking routine Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 12:25:31 +03:00
Sergey Oblomov	63e7ba6843	MCA/COMMON/UCX: added parameter for UCX/opal progress - added parameter to set UCX/opal progresses - minor refactoring of request wait routines Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-25 11:00:12 +03:00
Jeff Squyres	3767ce27c0	btl/tcp: trivial whitespace clean No code/logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:04:12 -07:00
Jeff Squyres	9034717876	btl/tcp: use a hash map for kernel IP interface indexes The giant size of the TCP proc struct is causing a problem in some environments (because it is allocated on the stack), and it was too big, anyway. Instead, use a hash map. That way, it starts small and can grow if it needs to. It also makes no assumptions about the values of the kernel interface indexes. Fixes #5292. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 08:03:30 -07:00
Jeff Squyres	e3d6c5ce3a	pmix3/pmix_server.c: minor compiler warning stomp Submitted upstream https://github.com/pmix/pmix/pull/776. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-23 06:35:09 -07:00
Sergey Oblomov	d57ae62dee	MCA/UCX: added common module - implemented non-blocking routines for flush operations Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-06-22 16:41:09 +03:00
Jeff Squyres	e305e80aff	Merge pull request #5317 from jsquyres/pr/update-bind-to-cpulist-option orterun: use consistent CLI option name for --bind-to	2018-06-21 12:43:18 -04:00
Jeff Squyres	4603852740	orterun: use consistent CLI option name for --bind-to Since the new binding option is tied to the --cpu-list orterun CLI option, make the --bind-to option reflect the same name (vs. the --cpu-set CLI option, which is entirely different). For example: mpirun --bind-to cpu-list:ordered ... Note that "--bind-to cpulist:ordered" is accepted as a synonym, because people will be lazy. Also add some minor updates to the orterun.1in man page for clarification. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-21 08:22:00 -07:00
Edgar Gabriel	fb16d40775	Merge pull request #5196 from edgargabriel/topic/cuda io/ompio: introduce initial support for cuda buffers in ompio	2018-06-21 10:14:43 -05:00
Edgar Gabriel	8c2ea0ef49	opal/dataype: add additional interface to retrieve more details about cuda buffer the existing interface in opal_datatype_cuda do not allow to distinguish whether a buffer is a managed or unmanaged cuda buffer. Add an interface that allows to retrieve this information throug a convertor, since the information is actually available in the mca_common_cuda_* routines. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-06-21 09:25:50 -05:00
Ralph Castain	f17d47087a	Define a new binding method and qualifier Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 21:26:09 -07:00
Ralph Castain	5ac2ce6346	Cover all the PMIx data types Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-20 09:06:19 -07:00
Boris Karasev	39c9cb12bb	pmix/ext2x: fixed detection PMIx v2.0 by pmix component Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-06-20 13:23:51 +03:00
Ralph Castain	08707c9762	Sync to updated PMIx v3.0.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-19 21:25:43 -07:00
Noah Evans	a64abadf97	Fix mca_base_var_files separator In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch. Signed-off-by: Noah Evans <noah.evans@gmail.com>	2018-06-19 12:59:07 -06:00
Ralph Castain	f0a0d606a0	Correct accounting for tools Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)	2018-06-18 13:24:25 -07:00
Jeff Squyres	cd8d169599	Merge pull request #5276 from jsquyres/pr/info-key-len-fix util/info: tighten up error detection on key length	2018-06-18 12:03:21 -04:00
Thananon Patinyasakdikul	13f58f3191	Merge pull request #5274 from thananon/ofi_sep btl/ofi: add scalable endpoint support.	2018-06-18 08:41:06 -07:00
Jeff Squyres	266d5b2110	Merge pull request #5277 from jsquyres/pr/cygwin-patch external libevent: fix for Cygwin	2018-06-18 11:08:30 -04:00
Ralph Castain	7981818b84	Update PMIx atomics Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 10:03:49 -07:00
Ralph Castain	fa18ba395d	Sync to latest PMIx v3.0rc Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-17 02:41:46 -07:00
Ralph Castain	ac7bb15505	Fix other typo in help message Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:30:52 -07:00
Ralph Castain	8cfce583c0	Correct typo to properly check for PMIx v4 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-16 16:29:05 -07:00
Jeff Squyres	07c8ec6a3c	external libevent: fix for Cygwin Fix from Marco Atzeri for building on Cygwin. Signed-off-by: Marco Atzeri <marco.atzeri@gmail.com> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 09:08:58 -07:00
Jeff Squyres	2670a7f55c	util/info: tighten up error detection on key length Fix CID 1435996: use the proper % type to render the size. Also use opal_output(), not fprintf(). For debug builds, abort without dumping core (dumping core is very unfriendly when running thousands of automated tests) -- the stderr output is sufficient to find the coding error. For non-debug builds, truncate the key and emit a warning that it almost certainly will not work properly. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-16 08:45:03 -07:00
Ralph Castain	94794011f1	Silence warnings and ignore test binary Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-15 11:32:22 -07:00
Ralph Castain	e2e6da4379	Merge pull request #5258 from rhc54/topic/pmix4 Sync to PMIx v3.0rc and add ext4x	2018-06-15 09:41:08 -07:00
Thananon Patinyasakdikul	dae3c9447c	btl/ofi: add scalable endpoint support. This commit add support for scalable endpoint to enhance multithreaded application performance. The BTL will detect the support from ofi provider and will fallback to normal usage of scalable endpoint is not supported. NEW MCA parameters: - mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint. - mca_btl_ofi_num_contexts_per_module: number of communication context to create (should be the same as number of thread). Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 15:44:29 -07:00
Thananon Patinyasakdikul	08cfacddee	Merge pull request #5272 from thananon/opal__thread opal/threads: Added keyword `opal_thread_local` for TLS	2018-06-14 14:45:47 -07:00
Thananon Patinyasakdikul	469a404c3b	opal/thread: Added keyword `opal_thread_local` for TLS. configure: add checks for `__thread` on top of current check for `_Thread_local` and define OPAL_HAVE_THREAD_LOCAL if the compiler support TLS. Added `opal_thread_local` keyword to unify the definition. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-14 13:25:04 -07:00
Nathan Hjelm	009a3f093b	opal/asm: fix 32-bit fetch and op atomics These functions were erroneously returing the new, not the old value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-14 13:54:52 -06:00
Howard Pritchard	7dcab6e4a4	Merge pull request #5269 from hppritcha/topic/squash_gcc7.3.0_warnings topo/treematch - quash compiler warning	2018-06-13 21:13:04 -05:00
Howard Pritchard	64de269cc3	topo/treematch - quash compiler warning quash a compiler warning showing up with gcc 7.3 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-06-13 16:34:17 -05:00
Thananon Patinyasakdikul	390d72addd	Merge pull request #4885 from davideberius/spc_pr Initial Software-based Performance Counters PR	2018-06-12 14:04:49 -07:00
Jeff Squyres	591b225527	Merge pull request #5245 from PeterGottesman/hwloc-fix Ensure required hwloc directories are in dist tarballs	2018-06-12 13:08:35 -04:00
Peter Gottesman	afe363b73b	Ensure required hwloc directories are in dist tarballs Fixes MTT failure when running autogen on a tarball Signed-off-by: Peter Gottesman <pgottesm@cisco.com>	2018-06-12 08:33:50 -07:00
Howard Pritchard	b43f94895f	Merge pull request #5261 from thananon/ofi_new_mr btl/ofi: change required mr mode bits.	2018-06-11 20:54:09 -06:00
David Eberius	d377a6b6f4	Added Software-based Performance Counters driver code along with several counters. This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI. All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example. Signed-off-by: David Eberius <deberius@vols.utk.edu>	2018-06-11 22:48:16 -04:00
Thananon Patinyasakdikul	b6e07bfac6	btl/ofi: change required mr mode bits. FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits (FI_MR_PROV_KEY \| FI_MR_ALLOCATED \| FI_MR_VIRT_ADDR). The btl functionality remains the same. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-11 09:25:57 -07:00
Ralph Castain	48f27655a6	Sync to PMIx v3.0rc and add ext4x Sync to the draft rc for PMIx v3.0. Add an external component for PMIx master, which is at v4.0 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-11 05:54:23 -07:00
Jeff Squyres	84701cd2b0	Merge pull request #5204 from markalle/info_snprintf fix info-subscribe to use snprintf() and warn on long key	2018-06-08 15:22:55 -04:00
Nicolas Morey-Chaisemartin	038ceb0b61	btl: openib: Add support for Broadcom Cumulus RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:58 +02:00
Nicolas Morey-Chaisemartin	d3416345ab	btl: openib: add support for QLogic RoCE HCA Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>	2018-06-07 15:16:41 +02:00
Arm	ce4d769aee	Merge pull request #5235 from thananon/ofi_einr_fix btl/ofi: handles FI_EINTR properly.	2018-06-06 11:30:48 -07:00
Ralph Castain	840fb42f93	PMIx rte component does support dynamics Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-05 21:55:19 -07:00
Thananon Patinyasakdikul	f34a73af70	btl/ofi: handles FI_EINTR properly. OFI sockets provider will sometime return FI_EINTR. Apparently this flag does not exist in OFI version 1.5. This commit added an ifdef check. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-05 18:40:00 -07:00
Nathan Hjelm	8192eb9aa6	patch/overwrite: add support for aarch64 This commit adds patcher support of aarch64. The current implementation uses mov, movk, and br to perform the jump. This uses 5 instructions. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-06-04 16:04:26 -06:00
Arm	555029b323	Merge pull request #5216 from thananon/btl_ofi btl/ofi: the component now compiles with libfabric < 1.5.	2018-06-04 13:47:00 -07:00
Thananon Patinyasakdikul	c18cf7315f	btl/ofi: configury: will not build if OFI ver < 1.5. This commit fixes problem where users have libfabric version < 1.5 installed and build failed because the new MR modebits does not exist. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-06-04 10:35:50 -07:00
Ralph Castain	014bb3c8de	Fix external hwloc builds Remove spurious comma in header file definition. Remove unused variables Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-06-03 11:24:21 -07:00
Ralph Castain	55ac526a67	Enable the PMIx ompi/rte component Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using: * autogen --no-orte * with external libevent, external hwloc, and external PMIx master * configuring PMIx master with the same libevent and hwloc * execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)	2018-06-03 07:25:12 -07:00
Ralph Castain	4ceaed3a76	Merge https://github.com/bgoglin/ompi into topic/rte2	2018-06-03 07:24:12 -07:00
Howard Pritchard	623e36de8a	Merge pull request #5205 from thananon/btl_ofi new btl/ofi: RDMA only btl using libfabric.	2018-06-02 13:00:50 -06:00
Mark Allen	93fefc4d70	fix info-subscribe to use snprintf() and warn on long key This checkin mainly concerns our internal info keys that are registering for callbacks via opal_infosubscribe_subscribe(). Those keys need to have an extra __IN_<key>/val stored to preserve their pre-callback value. So that means our internal keys are limited to 5 chars shorter than the usual key length limit. The code previously would have been silently inactive if a large key happened to come in, now it warns and also uses snprintf() to avoid compiler warnings. I'm also making the top-level MPI_Info_set warn if the user uses our reserved "__IN_" prefix. I had wanted the feature to be more invisible than that, but it would require a more sophisticated approach to change that. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-06-01 18:31:32 -04:00

... 3 4 5 6 7 ...

5575 Коммитов