1
1

5549 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
6564d3d217 remove some dead crs components
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-17 10:29:00 -06:00
Brian Barrett
4f19221af2 btl tcp: Simplify module address storage
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6).  There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-17 16:21:17 +00:00
Jeff Squyres
290d1c0534 opal/os_path: fix minor string overrun
This is a follow on to 54ca3310ea35b7dc857afc59e00ffbb8f57e15ec to fix
a minor bug identified by Coverity.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-17 08:57:30 -07:00
Sergey Oblomov
df765595e3 COMMON/UCX: suppressed coverity warnings
- suppressed coverity warnings - added log messages on failed calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-17 16:11:03 +03:00
Brian Barrett
2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
Nathan Hjelm
37a3f32e47
Merge pull request #5913 from hjelmn/uct_btl_fixes
btl/uct: fix deadlock in connection code
2018-10-16 19:11:54 -06:00
Nathan Hjelm
707d35deeb btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 18:28:47 -06:00
Nathan Hjelm
1c001319d4
Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now
opal/free_list: fix race condition
2018-10-16 15:27:05 -06:00
Brian Barrett
da1189d771
Merge pull request #5916 from bwbarrett/revert/6acebc4
Revert "Handle error cases in TCP BTL"
2018-10-16 13:54:18 -07:00
Nathan Hjelm
5c770a7bec opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 13:17:09 -06:00
Nathan Hjelm
9ea5dfa799 class/opal_fifo: fix warning
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-15 19:18:31 -06:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Jeff Squyres
e6de42c379 opal/util/info.c: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:03:28 -07:00
Brian Barrett
5162011428 Revert "Handle error cases in TCP BTL"
This reverts commit 6acebc40a194c92ab38a28553c2c8b04eb391820.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
https://github.com/open-mpi/ompi/issues/5849 for more information.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 15:01:54 -07:00
Brian Barrett
8feff49f6c pmix: Fix filename typo
Looks like a filename was missed when pmix sucked in the installdirs
framework.  Fixing the typo fixes "make ctags" and "make cscope".

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 17:05:01 +00:00
Nathan Hjelm
6ed68da870 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:33:17 -06:00
Brian Barrett
902b57919c btl tcp: Print number of endpoints on error
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
bd07cc6707 btl tcp: Improve initialization debugging
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.

Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.

Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Gilles Gouaillardet
ca9009e3d0 pmix/ext3x: fix minor typos
refer PMIx 3x instead of previous 2x

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:18:36 +09:00
Gilles Gouaillardet
b3bd785ce0 pmix/ext3x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:54 +09:00
Gilles Gouaillardet
67402f5f19 pmix/ext2x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:42 +09:00
Nathan Hjelm
39be6ec15c btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 15:15:45 -06:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Brian Barrett
c087365ead opal/util: Always support BSD behavior of asprintf
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this.  As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
8291f6722d btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-05 16:30:06 -06:00
Ralph Castain
76c11a1496 Remove build product - fix autogen.pl mode
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 13:38:39 -07:00
Ralph Castain
c498a7e77a Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 07:07:05 -07:00
Ralph Castain
f379ba9c8e Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 06:29:36 -07:00
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Nathan Hjelm
dfa8d3a81a btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 08:55:51 -06:00
Nathan Hjelm
66a7dc4c72 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 07:17:32 -06:00
Ralph Castain
31f6c75498
Merge pull request #5819 from bosilca/fix/local_bind
Fix/local bind
2018-10-02 11:27:36 -07:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Brian Barrett
19e16d5fd0 opal: Disable memory patcher component on MacOS
Open MPI doesn't support any transports on MacOS which require
memory manager hooks.  The memory patcher component uses the
syscall interface, which has been deprecated in recent versions
of MacOS.  Since we don't need it and it emits warnings about
deprecation, disable the memory patcher component on MacOS.

Fixes #5671

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:15 -04:00
George Bosilca
a3a492b42c
Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:08:18 -04:00
George Bosilca
9164e26e2f
Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:06:40 -04:00
Ralph Castain
fcc1d30ab3
Merge pull request #5775 from karasevb/check_old_topo_key
pmix: check the old topo key to keep compatibility with old RMs
2018-10-02 08:12:42 -07:00
Jeff Squyres
5dae086f7e btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 16:12:57 -07:00
Jeff Squyres
293a938d29 string_copy: assert() fail if the copy is too long
Add a hueristic: if the string copy is "too long", fail an assert().
This is based on the premise that Open MPI doesn't do large string
copies.  So if we see a dest_len that is over a certain threshhold
(currently set at 128K), this is likely a programmer error, and on
debug builds, we should fail an assert().  In production builds, it
will work just fine (assuming that it's not a programmer error).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 13:34:15 -07:00
Ralph Castain
172e770154 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-28 12:43:32 -07:00
Ralph Castain
c836c4282c Update to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-27 20:57:23 -07:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Jeff Squyres
e1cf8757ae opal/util/string_copy: add "safe" string copy function
Do essentially the same thing as strncpy(3), but a) ensure to always
terminate the destination buffer with a \0, and b) do not \0-pad to
the right.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 10:48:04 -07:00
Jeff Squyres
38fefcf1cf opal/util/Makefile.am: fix whitespace (remove tabs)
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 09:43:03 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Boris Karasev
ed42f568ae pmix: check the old topo key to keep compatibility with old RMs
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-09-25 18:13:54 +03:00
Gilles Gouaillardet
db65dbd9a8 ucx: use the c99 __func__ macro instead
__FUNCTION__ macro was never standardized and should not be used.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-09-25 11:19:18 +09:00
Jeff Squyres
176da51aec mca_base_var: fix output bug about settable vars
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-22 07:18:45 -07:00
Nathan Hjelm
c6230657a2 opal: fix warning
Fixes #5713

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-09-21 00:51:40 -06:00