openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	6564d3d217	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 10:29:00 -06:00
Brian Barrett	4f19221af2	btl tcp: Simplify module address storage Today, a btl tcp module is associated with exactly one IP address (IPv4 or IPv6). There's no need to reserve space for both an IPv4 and IPv6 address in the module structure, since the module will only be associated with one or the other. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-17 16:21:17 +00:00
Jeff Squyres	290d1c0534	opal/os_path: fix minor string overrun This is a follow on to 54ca3310ea35b7dc857afc59e00ffbb8f57e15ec to fix a minor bug identified by Coverity. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-17 08:57:30 -07:00
Sergey Oblomov	df765595e3	COMMON/UCX: suppressed coverity warnings - suppressed coverity warnings - added log messages on failed calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-17 16:11:03 +03:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Nathan Hjelm	37a3f32e47	Merge pull request #5913 from hjelmn/uct_btl_fixes btl/uct: fix deadlock in connection code	2018-10-16 19:11:54 -06:00
Nathan Hjelm	707d35deeb	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 18:28:47 -06:00
Nathan Hjelm	1c001319d4	Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now opal/free_list: fix race condition	2018-10-16 15:27:05 -06:00
Brian Barrett	da1189d771	Merge pull request #5916 from bwbarrett/revert/6acebc4 Revert "Handle error cases in TCP BTL"	2018-10-16 13:54:18 -07:00
Nathan Hjelm	5c770a7bec	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 13:17:09 -06:00
Nathan Hjelm	9ea5dfa799	class/opal_fifo: fix warning Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-15 19:18:31 -06:00
Jeff Squyres	cef6cf0ac5	opal: update some string handling Make various string handling locations a little more robust. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:04:28 -07:00
Jeff Squyres	e6de42c379	opal/util/info.c: fix compiler warning Remove unused variable. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:03:28 -07:00
Brian Barrett	5162011428	Revert "Handle error cases in TCP BTL" This reverts commit 6acebc40a194c92ab38a28553c2c8b04eb391820. This patch is causing numerous "Socket closed" messages which are causing most of the failures on Cisco's MTT run. See https://github.com/open-mpi/ompi/issues/5849 for more information. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 15:01:54 -07:00
Brian Barrett	8feff49f6c	pmix: Fix filename typo Looks like a filename was missed when pmix sucked in the installdirs framework. Fixing the typo fixes "make ctags" and "make cscope". Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 17:05:01 +00:00
Nathan Hjelm	6ed68da870	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:33:17 -06:00
Brian Barrett	902b57919c	btl tcp: Print number of endpoints on error While trying to debug #3035, it's not clear whether there is an issue with the modex data or printing the address list. Print the number of endpoints on the error, which will help determine which case is happening to Cisco. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Brian Barrett	bd07cc6707	btl tcp: Improve initialization debugging When creating TCP BTL modules, print more information about the module's ethernet association, including the first address associated with the device, as debug output. Fix a flipped output string for IPv4 and IPv6 addresses in the modex send code. Add the addresses being published in the modex to the debugging output in modex send, to help match failures in endpoint match. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Gilles Gouaillardet	ca9009e3d0	pmix/ext3x: fix minor typos refer PMIx 3x instead of previous 2x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:18:36 +09:00
Gilles Gouaillardet	b3bd785ce0	pmix/ext3x: add missing header files into the dist tarball Refs. open-mpi/ompi#5871 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:17:54 +09:00
Gilles Gouaillardet	67402f5f19	pmix/ext2x: add missing header files into the dist tarball Refs. open-mpi/ompi#5871 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-10 13:17:42 +09:00
Nathan Hjelm	39be6ec15c	btl/uct: bug fixes and general improvements This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-09 15:15:45 -06:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Brian Barrett	c087365ead	opal/util: Always support BSD behavior of asprintf Open MPI's developers like to assume that asprintf() always sets the ptr to NULL on error, but the standard (and Linux glibc) do not guarantee this. As a result, we're making opal_asprintf() always available for developers, which will guarantee that ptr is set to NULL on error. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Nathan Hjelm	8291f6722d	btl/vader: fix race condition in writing header Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-05 16:30:06 -06:00
Ralph Castain	76c11a1496	Remove build product - fix autogen.pl mode Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 13:38:39 -07:00
Ralph Castain	c498a7e77a	Protect PMIx from bad configure entry Ignore with-hwloc=internal or external as those are meaningless to pmix (will upstream) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 07:07:05 -07:00
Ralph Castain	f379ba9c8e	Fail configure if pmix won't build If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-05 06:29:36 -07:00
Geoff Paulsen	56a55cdb4b	Merge pull request #5786 from jsquyres/pr/string-madness Replace strcpy() and strncpy() with (new) opal_string_copy()	2018-10-04 16:12:46 -05:00
Ralph Castain	86702b71bc	Correctly notify upon process failure We only need to pass a custom range if the target is a single process. Otherwise, we let the range be "session". Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-03 20:19:42 -07:00
Nathan Hjelm	dfa8d3a81a	btl/vader: work around Oracle compiler bug This commit works around an Oracle C compiler bug in 5.15 (not sure when it was introduced). The bug is triggered when we chain assignments of atomic variables. Ex: _Atomic intptr x, y; intptr_t z = 0; x = y = z; Will produce a compiler error of the form: operand cannot have void type: op "=" assignment type mismatch: long "=" void To work around the issue we are removing the chain assignment and setting the head and tail on different lines. Fixes #5814 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 08:55:51 -06:00
Nathan Hjelm	66a7dc4c72	btl/vader: ensure the fast box tag is always read first On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 07:17:32 -06:00
Ralph Castain	31f6c75498	Merge pull request #5819 from bosilca/fix/local_bind Fix/local bind	2018-10-02 11:27:36 -07:00
Brian Barrett	a25df3f29e	opal: Remove outdated MacOS workaround Remove the pack/unpack pragma around net/if.h on MacOS, which was added to fix a bug in MacOS X 10.4.x on 64-bit platforms. The bug was fixed in Mac OS X 10.5.0 and, sometime in the last 11 years, compilers started emitting warnings about the fact that the Apple header stomped over the pragma pack settings from the workaround. We already don't support versions of MacOS earlier than 10.5, so there's no point in keeping the workaround. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-02 13:35:46 -04:00
Brian Barrett	19e16d5fd0	opal: Disable memory patcher component on MacOS Open MPI doesn't support any transports on MacOS which require memory manager hooks. The memory patcher component uses the syscall interface, which has been deprecated in recent versions of MacOS. Since we don't need it and it emits warnings about deprecation, disable the memory patcher component on MacOS. Fixes #5671 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-02 13:35:15 -04:00
George Bosilca	a3a492b42c	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:08:18 -04:00
George Bosilca	9164e26e2f	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:06:40 -04:00
Ralph Castain	fcc1d30ab3	Merge pull request #5775 from karasevb/check_old_topo_key pmix: check the old topo key to keep compatibility with old RMs	2018-10-02 08:12:42 -07:00
Jeff Squyres	5dae086f7e	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 16:12:57 -07:00
Jeff Squyres	293a938d29	string_copy: assert() fail if the copy is too long Add a hueristic: if the string copy is "too long", fail an assert(). This is based on the premise that Open MPI doesn't do large string copies. So if we see a dest_len that is over a certain threshhold (currently set at 128K), this is likely a programmer error, and on debug builds, we should fail an assert(). In production builds, it will work just fine (assuming that it's not a programmer error). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 13:34:15 -07:00
Ralph Castain	172e770154	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-28 12:43:32 -07:00
Ralph Castain	c836c4282c	Update to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-27 20:57:23 -07:00
Jeff Squyres	28e51ad4d4	opal: convert from strncpy() -> opal_string_copy() In many cases, this was a simple string replace. In a few places, it entailed: 1. Updating some comments and removing now-redundant foo[size-1]='\0' statements. 2. Updating passing (size-1) to (size) (because opal_string_copy() wants the entire destination buffer length). This commit actually fixes a bunch of potential (yet quite unlikely) bugs where we could have ended up with non-null-terminated strings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 11:56:18 -07:00
Jeff Squyres	e1cf8757ae	opal/util/string_copy: add "safe" string copy function Do essentially the same thing as strncpy(3), but a) ensure to always terminate the destination buffer with a \0, and b) do not \0-pad to the right. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 10:48:04 -07:00
Jeff Squyres	38fefcf1cf	opal/util/Makefile.am: fix whitespace (remove tabs) No code or logic changes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 09:43:03 -07:00
Jeff Squyres	6bb356ab87	Squash a bunch of harmless compiler warnings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-26 12:15:21 -07:00
Boris Karasev	ed42f568ae	pmix: check the old topo key to keep compatibility with old RMs Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2018-09-25 18:13:54 +03:00
Gilles Gouaillardet	db65dbd9a8	ucx: use the c99 __func__ macro instead __FUNCTION__ macro was never standardized and should not be used. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-09-25 11:19:18 +09:00
Jeff Squyres	176da51aec	mca_base_var: fix output bug about settable vars Fix the test that determined whether we output "writeable" or "read-only" for MCA vars (it was checking the wrong flag). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-22 07:18:45 -07:00
Nathan Hjelm	c6230657a2	opal: fix warning Fixes #5713 Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-09-21 00:51:40 -06:00

... 3 4 5 6 7 ...

5549 Коммитов