1
1
Граф коммитов

5442 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
1c001319d4
Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now
opal/free_list: fix race condition
2018-10-16 15:27:05 -06:00
Brian Barrett
da1189d771
Merge pull request #5916 from bwbarrett/revert/6acebc4
Revert "Handle error cases in TCP BTL"
2018-10-16 13:54:18 -07:00
Nathan Hjelm
5c770a7bec opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 13:17:09 -06:00
Nathan Hjelm
9ea5dfa799 class/opal_fifo: fix warning
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-15 19:18:31 -06:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Jeff Squyres
e6de42c379 opal/util/info.c: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:03:28 -07:00
Brian Barrett
5162011428 Revert "Handle error cases in TCP BTL"
This reverts commit 6acebc40a1.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
https://github.com/open-mpi/ompi/issues/5849 for more information.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 15:01:54 -07:00
Brian Barrett
8feff49f6c pmix: Fix filename typo
Looks like a filename was missed when pmix sucked in the installdirs
framework.  Fixing the typo fixes "make ctags" and "make cscope".

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 17:05:01 +00:00
Nathan Hjelm
6ed68da870 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:33:17 -06:00
Brian Barrett
902b57919c btl tcp: Print number of endpoints on error
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
bd07cc6707 btl tcp: Improve initialization debugging
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.

Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.

Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Gilles Gouaillardet
ca9009e3d0 pmix/ext3x: fix minor typos
refer PMIx 3x instead of previous 2x

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:18:36 +09:00
Gilles Gouaillardet
b3bd785ce0 pmix/ext3x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:54 +09:00
Gilles Gouaillardet
67402f5f19 pmix/ext2x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:42 +09:00
Nathan Hjelm
39be6ec15c btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 15:15:45 -06:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Brian Barrett
c087365ead opal/util: Always support BSD behavior of asprintf
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this.  As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
8291f6722d btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-05 16:30:06 -06:00
Ralph Castain
76c11a1496 Remove build product - fix autogen.pl mode
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 13:38:39 -07:00
Ralph Castain
c498a7e77a Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 07:07:05 -07:00
Ralph Castain
f379ba9c8e Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 06:29:36 -07:00
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Nathan Hjelm
dfa8d3a81a btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 08:55:51 -06:00
Nathan Hjelm
66a7dc4c72 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 07:17:32 -06:00
Ralph Castain
31f6c75498
Merge pull request #5819 from bosilca/fix/local_bind
Fix/local bind
2018-10-02 11:27:36 -07:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Brian Barrett
19e16d5fd0 opal: Disable memory patcher component on MacOS
Open MPI doesn't support any transports on MacOS which require
memory manager hooks.  The memory patcher component uses the
syscall interface, which has been deprecated in recent versions
of MacOS.  Since we don't need it and it emits warnings about
deprecation, disable the memory patcher component on MacOS.

Fixes #5671

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:15 -04:00
George Bosilca
a3a492b42c
Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:08:18 -04:00
George Bosilca
9164e26e2f
Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:06:40 -04:00
Ralph Castain
fcc1d30ab3
Merge pull request #5775 from karasevb/check_old_topo_key
pmix: check the old topo key to keep compatibility with old RMs
2018-10-02 08:12:42 -07:00
Jeff Squyres
5dae086f7e btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 16:12:57 -07:00
Jeff Squyres
293a938d29 string_copy: assert() fail if the copy is too long
Add a hueristic: if the string copy is "too long", fail an assert().
This is based on the premise that Open MPI doesn't do large string
copies.  So if we see a dest_len that is over a certain threshhold
(currently set at 128K), this is likely a programmer error, and on
debug builds, we should fail an assert().  In production builds, it
will work just fine (assuming that it's not a programmer error).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 13:34:15 -07:00
Ralph Castain
172e770154 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-28 12:43:32 -07:00
Ralph Castain
c836c4282c Update to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-27 20:57:23 -07:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Jeff Squyres
e1cf8757ae opal/util/string_copy: add "safe" string copy function
Do essentially the same thing as strncpy(3), but a) ensure to always
terminate the destination buffer with a \0, and b) do not \0-pad to
the right.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 10:48:04 -07:00
Jeff Squyres
38fefcf1cf opal/util/Makefile.am: fix whitespace (remove tabs)
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 09:43:03 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Boris Karasev
ed42f568ae pmix: check the old topo key to keep compatibility with old RMs
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-09-25 18:13:54 +03:00
Gilles Gouaillardet
db65dbd9a8 ucx: use the c99 __func__ macro instead
__FUNCTION__ macro was never standardized and should not be used.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-09-25 11:19:18 +09:00
Jeff Squyres
176da51aec mca_base_var: fix output bug about settable vars
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-22 07:18:45 -07:00
Nathan Hjelm
c6230657a2 opal: fix warning
Fixes #5713

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-09-21 00:51:40 -06:00
Brian Barrett
9344afd485 openib: Disable CUDA async by default
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References #3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-09-20 07:58:10 -07:00
Howard Pritchard
dccf780546
Merge pull request #5737 from hppritcha/topic/remove_scif_support
SCIF: remove it
2018-09-19 11:38:27 -06:00
Howard Pritchard
b9ac3d8931 SCIF: remove it
KNC is effectively dead.  Remove corresponding SCIF
support in Open MPI.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-09-19 10:39:52 -06:00
bosilca
b0e2975f86
Merge pull request #5701 from ICLDisco/export/errors_ib
Adding error handling in OpenIB BTL
2018-09-19 09:45:58 -04:00
bosilca
464e1abbab
Merge pull request #5700 from ICLDisco/export/tcp_errors
Handle error cases in TCP BTL
2018-09-19 09:44:38 -04:00
Michael Kuron
17b0f1fcc3 Deal with EOPNOTSUPP returned from getsockopt()
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.

Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
2018-09-16 14:55:28 +02:00
Jeff Squyres
3970b06134 misc: passing a bool to va_start() is undefined
According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior.  So just change these params to int and avoid the
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Jeff Squyres
8f2620d3af misc: compiler warning fixes
A variety of small compiler warning fixes.  The 2 PMIx fixes are
already committed upstream.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Nathan Hjelm
6363889cd0
Merge pull request #5696 from hjelmn/vader_5375
btl/vader: ensure that the send tag is always written last
2018-09-14 12:33:35 -06:00
Nathan Hjelm
1071d72130
Merge pull request #5445 from hjelmn/asm_type
Update opal to use C11 atomics if available
2018-09-14 12:32:56 -06:00
Nathan Hjelm
fe6528b0d5 opal/atomic: always use C11 atomics if available
This commit disables the use of both the builtin and hand-written
atomics if proper C11 atomic support is detected. This is the first
step towards requiring the availability of C11 atomics for the C
compiler used to build Open MPI.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:51:05 -06:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Nathan Hjelm
850fbff441 btl/vader: ensure that the send tag is always written last
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes #5375, #5638

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:35:35 -06:00
Ralph Castain
7facb3f3e9 Pickup and deploy network-specific envars
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:24:19 -07:00
Ralph Castain
c7939aeebc Pickup some minor debugging changes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:11:28 -07:00
Ralph Castain
466cad6cb2 Update master to PMIx v4
Retain ext3x for PMIx 3  compatibility
Get the blasted permissions correct on config files

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 08:24:17 -07:00
Ralph Castain
fb67b1703f
Merge pull request #5689 from rhc54/topic/psm2
Silence warning of unused function
2018-09-12 16:40:23 -07:00
Ralph Castain
853dc96c5d Silence warning of unused function
Requires protection for HAVE___CLEAR_CACHE

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-12 15:03:59 -07:00
Jeff Squyres
78c1469f93
Merge pull request #5682 from selvintxavier/brcm_hcas
Add support for different Broadcom HCAs
2018-09-12 11:59:21 -04:00
Selvin Xavier
a53a6f7650 Add support for different Broadcom HCAs
Adds device ids of different Broadcom adapters from
BCM57XXX and BCM58XXX family of HCAs.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
2018-09-12 02:22:34 -07:00
Ralph Castain
83ba589084 Silence warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-11 18:01:12 -07:00
Nathan Hjelm
9a514c6794
Merge pull request #5647 from hjelmn/clear_cache
patcher/base: improve instruction cache flush for aarch64
2018-09-10 15:43:35 -06:00
Jeff Squyres
a1b879d176
Merge pull request #5651 from jsquyres/pr/dont-let-openib-yell-about-no-nics
btl/openib: don't complain about no NICs
2018-09-07 15:00:24 -04:00
Howard Pritchard
dc02e54320
Merge pull request #5516 from thananon/ofi_send
btl/ofi: Added 2 sided communication support.
2018-09-06 18:39:23 -06:00
Jeff Squyres
098ec55e37 btl/openib: don't complain about no NICs
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-06 11:26:58 -07:00
Nathan Hjelm
1cdbceb095 patcher/base: improve instruction cache flush for aarch64
This commit updates the patcher component to either use the
__clear_cache intrinsic or the correct assembly to flush the
instruction cache.

Fixes #5631

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-06 09:47:53 -06:00
Nathan Hjelm
5b9a24f13b
Merge pull request #5648 from hjelmn/btl_uct_fix
btl/uct: ad missing opal_mem_hooks_unregister_release call
2018-09-05 13:59:22 -06:00
Nathan Hjelm
36c206d2d6 btl/uct: add missing opal_mem_hooks_unregister_release call
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-05 13:01:45 -06:00
Jeff Squyres
4173ac6dd0 mca/base: enforce max string lengths
Ensure that the project, framework, component, and variable names are
lower than max lengths.  This is a follow-on to
992a8e8297, per discussion on
https://github.com/open-mpi/ompi/pull/5642.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-05 08:42:00 -07:00
George Bosilca
992a8e8297
Use less memory for the MCA variable names.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-08-31 11:36:40 +02:00
Jeff Squyres
05e5f61fe1 common/verbs-usnic: check that it will actually compile
If someone specifies --with-verbs-usnic, actually do a configury check
to ensure that it will compile (vs. assuming that it will compile if
someone asks for it).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-30 13:10:49 -07:00
Gilles Gouaillardet
ad2c207a7e mpool/memkind: plug a memory leak in mca_mpool_memkind_close()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Gilles Gouaillardet
7556dd0abb opal/util: plug a memory leak in the opal_infosubscriber_t destructor
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Gilles Gouaillardet
aeddd2f249 pmix/pmix3x: plug a memory leak in external_register()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:16 +09:00
Gilles Gouaillardet
6e47c5708e pmix/base: plug a memory leak in opal_pmix_base_select()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:16 +09:00
Yossi Itigin
68206a5635
Merge pull request #5569 from hoopoepg/topic/optimize-blocked-calls
PML/UCX: blocked calls optimizations
2018-08-29 14:19:09 +03:00
Yossi Itigin
4bb6845888
Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno
MCA/COMMON/UCX: added synonym to opal_mem_hook variable
2018-08-29 14:16:33 +03:00
Artem Polyakov
00d12c058f
Merge pull request #5596 from karasevb/fix_hwloc_numa_obj
Fixed the NUMA obj detection for hwloc ver >= 2.0.0
2018-08-27 22:15:41 -07:00
Sergey Oblomov
2cd9e04166 PML/UCX: optimization of mprobe call - renamed vars
- renamed of internal variable names
- used unsigned datatypes

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:39 +03:00
Sergey Oblomov
38e908f83e PML/UCX: optimization of mprobe call
- refactoring of opal/UCX progress calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Sergey Oblomov
b0f87f2235 PML/UCX: blocked calls optimizations
- added UCX progress priority

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Boris Karasev
beb0697f24 Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-27 09:50:11 +03:00
Gilles Gouaillardet
1665b8db8f
Merge pull request #5519 from extrowerk/haiku_patches
fcntl include bugfix
2018-08-27 09:46:43 +09:00
Sergey Oblomov
b72dd83f05 MCA/COMMON/UCX: added synonims for common ucx variables
- added synonims for atomic/osc modules

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-26 18:25:21 +03:00
Boris Karasev
e5291ccc34 Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-24 19:11:52 +03:00
Jeff Squyres
fe0852bcb4 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-24 07:39:14 -07:00
Nathan Hjelm
0a9c358ef8
Merge pull request #5577 from hjelmn/btl_vader_sc_emu_warnings
btl/vader: clean up debuging and squash warning
2018-08-23 09:41:13 -06:00
Nathan Hjelm
c74cf666a9 btl/vader: clean up debuging and squash warning
References #5512

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-22 10:57:32 -06:00
Nathan Hjelm
a99dba6cb7
Merge pull request #5553 from markalle/info_snprintf2
snprintf() length fix for info
2018-08-22 10:27:14 -06:00
Sergey Oblomov
6a7f66d9c2 MCA/COMMON/UCX: renamed synonim to opal_mem_hook variable
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-22 14:12:33 +03:00
Sergey Oblomov
e00f7a68ba MCA/COMMON/UCX: added synonim to opal_mem_hook variable
- added synonim to opal_mem_hook variable to allow
  to print it in opal_info -a

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-21 15:05:12 +03:00
Ralph Castain
5cfa2a7fca Complete integration of job_control
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 16:10:50 -07:00
Ralph Castain
9948084130 Update to PMIx v3.0.1
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 15:05:24 -07:00
Aurélien Bouteiller
e46c907468
Adding error handling in OpenIB BTL
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster

Make it compile in non-debug builds

Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks

Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.

Remove ULFM specific cases

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-08-16 16:53:17 -04:00
Mark Allen
ee6eb9b12b snprintf() length fix for info
The important part of this fix is a couple places 5 was hard-coded that needed to be
strlen(OPAL_INFO_SAVE_PREFIX).

But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There
was an "if" statement making sure all the arguments had appropriate strlen(), but gcc
still complained about the following snprintf() because the size of the struct element
is iterator->ie_key[OPAL_MAX_INFO_KEY + 1].

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2018-08-16 15:32:04 -04:00
Aurelien Bouteiller
6acebc40a1
Handle error cases in TCP BTL
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-08-14 15:35:24 -04:00
Nathan Hjelm
dca3516765 btl/vader: move memory barrier to where it belongs
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-13 10:14:34 -06:00
Nathan Hjelm
e0f73866ef
Merge pull request #5525 from hjelmn/event_threading
opal/progress: protect against multiple threads in event base
2018-08-13 09:21:44 -06:00
Jeff Squyres
01e4570af7 hwloc201/configure.m4: make it safe when used with hwloc:external
The Autoconf AC_CONFIG_* macros can only be instantiated exacly once
for any given file, *and* they must be in a code execution path at run
time for the target file to be generated at the end of configure.

For example, if you want to generate file ABC at the end of configure,
you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that
will get executed when configure is run.

That's pretty straightforward.

What's not straightforward is two corner cases:

1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file
   more than once.  If you do, autoreconf will fail (even before you
   can run configure).
2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by
   configure, the file ABC is not registered properly, and ABC will
   not be generated at the end of configure.

This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls
both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles
(etc.) so that targets like "make distclean" and "make distcheck" will
work properly.  Hence, we *have* to invoke HWLOC_SETUP_CORE.

However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side
effects.  It would be nice to do able to do something like this:

```
    if hwloc:extern is going to be used:
        Invoke minimal HWLOC_SETUP_CORE (with no side effects)
    else
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

But we can't, because autoreconf will detect that AC_CONFIG_FILES has
been invoked on the same files more than once (regardless of whether
those code paths will be executed at run time or not).  Kaboom.

Similarly, we can't do this:

```
    if hwloc:extern is not going to be used:
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

Because then hwloc's AC_CONFIG_FILES won't be registered properly when
hwloc:external *is* used (i.e., when the HWLOC_SETUP_CORE macro is not
in a code path that is executed at run time), and targets like "make
distclean" will fail because hwloc's Makefiles won't have been setup.
Kaboom.

But remember that the hwloc framework is a bit special: there will
only ever be 2 comoponents: external and internal.  External is
guaranteed to be configured first because of its priority.  So the
internal component (i.e., this component) immediately knows if it is
going to be used or not based on whether the external component
configuration succeeded or failed.

Specifically: regardless of whether the internal component (i.e., this
component) is going to be used, we have to invoke HWLOC_SETUP_CORE.
But we can manage the side effects: allow the side effects when
this/internal component is going to be used, and avoid the side
effects when this/internal component is not going to be used.

This is a little less clean than I would have liked, but because of
Autoconf's oddity about its AC_CONFIG_* macros, this is the only
solution I could come up with.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-11 11:05:23 -07:00
Jeff Squyres
69aa46e167 libevent2022/configure.m4: always invoke sub-configure
In order to make "make distclean" (and friends) work, we need to
*always* invoke the embedded configure script -- even if we know that
we're not going to use this component.

But in cases where we know we're not going to use this component, we
also need to avoid the side effects of the code path that is used when
we *do* want to use this component.  So split the two possibilities
into two different macros:

1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing
   except invoke the underlying "configure" script.
2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real
   work (including invoking the underlying "configure" script).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-11 06:03:46 -07:00
Jeff Squyres
80df3f040b libevent2022/configure.m4: trivial cleanup
Put argument to AM_CONDITIONAL inside [].  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-10 10:30:45 -07:00
Jeff Squyres
17aa64e438 libevent2022/configure.m4: minor comment cleanup
Change # -> dnl.  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-10 10:30:04 -07:00
Jeff Squyres
b063cb6b0f libevent2022: only configure if event:external fails
We know that event:external will be configured first (because of its
priority).  Take advantage of that here in libevent2022 by having it
refuse to configure / politely fail if event:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., event:external succeeded, so this component
will be skipped, or event:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-08 10:22:38 -07:00
Jeff Squyres
4e5f432786 hwloc201: only configure if hwloc:external fails
We know that hwloc:external will be configured first (because of its
priority).  Take advantage of that here in hwloc201 by having it
refuse to configure / politely fail if hwloc:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., hwloc:external succeeded, so this component
will be skipped, or hwloc:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-08 10:22:38 -07:00
Nathan Hjelm
bdbb853461 opal/progress: protect against multiple threads in event base
libevent does not support multiple threads calling the event loop on
the same event base. This causes external libevent's to print out
re-entrant warning messages. This commit fixes the issue by protecting
the call to the event loop with an atomic swap check.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-07 16:13:43 -06:00
Nathan Hjelm
eeae3f9b93
Merge pull request #5517 from bosilca/topic/treematch_warnings
Remove few warnings identified by @rhc in #5514.
2018-08-06 13:25:07 -06:00
Zoltán Mizsei
ac3f8a16ed fcntl include bugfix
Signed-off-by: Zoltán Mizsei <zmizsei@extrowerk.com>
2018-08-06 19:45:59 +02:00
Boris Karasev
57683366ca pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-06 15:01:57 +06:00
George Bosilca
6d11a45f44
Remove few warnings identified by @rhc in #5514.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-08-03 16:21:06 -04:00
Thananon Patinyasakdikul
080115d440 btl/ofi: Added 2 side communication support.
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.

Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-08-03 12:30:03 -07:00
Ralph Castain
f7a537cf04 Fix typo
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 20:05:33 -07:00
Ralph Castain
bcdb1f45ac Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 18:47:39 -07:00
Ralph Castain
55cefedf9b Cleanup pmix selection check
Allow for versions > 3

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 16:11:32 -07:00
Nathan Hjelm
47ed8e8830 btl/uct: fix compile warnings/errors
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-23 14:04:38 -06:00
Ralph Castain
5cab823979 Leave opal_event_external_support exposed as global var
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-23 09:18:48 -07:00
Jeff Squyres
a70ecf5267 event/external: prefer external event component
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-07-23 09:20:27 +09:00
Jeff Squyres
83e4a45a9f event: trivial comment change
Switch from #-style to dnl-style.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-07-23 09:20:27 +09:00
Gilles Gouaillardet
ce2c9fffd4 hwloc: prefer external hwloc component
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-07-23 09:20:27 +09:00
Ralph Castain
8a6dc0b211
Merge pull request #5456 from rhc54/topic/px
Default to internal PMIx if newer than external
2018-07-19 11:53:47 -07:00
Ralph Castain
1e6aaf7f22 Default to internal PMIx if newer than external
Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-19 11:02:03 -07:00
Yossi Itigin
29812494f2
Merge pull request #5402 from hoopoepg/topic/common-del-procs
MCA/COMMON/UCX: del_procs calls are unified to common module
2018-07-19 11:19:45 +03:00
Sergey Oblomov
920cc2e0d9 MCA/COMMON/UCX: del_procs calls are unified to common module
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-18 07:37:25 +03:00
Howard Pritchard
4447738098
Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default
btl/openib: only look for iwarp/roce by default
2018-07-17 20:08:00 -06:00
Howard Pritchard
bc8134dae1
Merge pull request #5448 from thananon/ofi_context
btl/ofi: Added FI_CONTEXT as requirement.
2018-07-17 19:51:13 -06:00
Howard Pritchard
6818272392 btl/openib: only look for iwarp/roce by default
Due to decreasing support by vendors/other orgs for the OpenIB BTL,
only look for iWarp/RoCE devices by default.  Allow IB HCAs
with ports configured for ethernet.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-07-17 19:11:37 -06:00
Thananon Patinyasakdikul
033c364ee0 btl/ofi: Added FI_CONTEXT as requirement.
OFI BTL uses context for completion but never ask for it in
fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT
to eliminate any potential error.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-07-17 12:18:43 -07:00
Sergey Oblomov
a4b8253fa2 MCA/COMMON/UCX: fixed initialization of malloc hooks
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-17 20:09:50 +03:00
Sergey Oblomov
1c7ae22dfb MCA/COMMON/UCX: shift opal memhooks into common UCX
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-17 13:46:38 +03:00
Ralph Castain
4a596d35f7 Remove the PMIx ext4x component
Update configury to redirect anything at or above v3 to the ext3x component

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-13 19:51:50 -07:00
Nathan Hjelm
9d3a79925b btl/vader: fix bugs in rma emulation
This commit fixes two bugs in the RMA/atomic emulation code:

 1) Fix a fragment leak when using AMO emulation.

 2) Always initialize the single-copy emulation code. This is required
 to use the AMO support.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-12 15:50:50 -06:00
Ralph Castain
fdca304268 Default to external PMIx installation
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 16:12:52 -07:00
Nathan Hjelm
8b090103e2 opal/fifo: fix 128-bit atomic fifo on Power9
This commit updates the atomic fifo code to fix a consistency issue
observed on Power9 systems when builtin atomics are used. The cause
was two things: 1) a missing write memory barrier in fifo push, and 2)
a read ordering issue when reading the fifo head non-atomically. This
commit fixes both issues and appears to correct then inconsistency.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-10 15:37:11 -06:00
KAWASHIMA Takahiro
3e179ba95f hwloc/external: Suppress missing-include-dirs warning
If OMPI is configured with `--with-hwloc=external` or `--with-hwloc=DIR`
and gfortran is used, I see a lot of warnings when compiling files
under the `ompi/mpi/fortran` directory.

```
f951: Warning: Nonexistent include directory
'BUILD_DIR/opal/mca/hwloc/external/hwloc/include' [-Wmissing-include-dirs]
```

There is no such `include` directory in the source tree and `configure`-
created tree. I think these lines in the `configure.m4` file are wrongly
copied from that for the embedded `hwlocXXX` component in the past.

The `-Wmissing-include-dirs` option is enabled in gfortran by default
but it is not enabled by default (or even with `-Wall`) in gcc and g++.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-07-09 10:55:33 +09:00
Yossi Itigin
e77e31b50b
Merge pull request #5378 from hoopoepg/topic/unify-ucx-logging
MCA/COMMON/UCX: unified logging across all UCX modules
2018-07-08 12:45:26 +03:00
Ralph Castain
17c4cf0db8 Install PMIx v3.0.0 release
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-06 06:38:02 -07:00
Sergey Oblomov
bef47b792c MCA/COMMON/UCX: unified logging across all UCX modules
- added common logging infrastructure for all
  UCX modules
- all UCX modules are switched to new infra

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-05 16:25:39 +03:00
Sergey Oblomov
8080283b3d MCA/COMMON/UCX: changed return type for wait_request
- for now wait_request returns OMPI status
- updated callers

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-04 23:29:38 +03:00
Sergey Oblomov
f574c14e3a ATOMICS/UCX: redefine atomic module API
- now it accepts integer values directily instead of
  pointers

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-04 14:41:45 +03:00
Yossi Itigin
4962651567
Merge pull request #5366 from hoopoepg/topic/mca-common-ucx-unify-2
MCA/COMMON/UCX: minor unification of del_proces calls
2018-07-04 14:38:37 +03:00
Nathan Hjelm
bd5cd62df9 btl/ugni: fix up some warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-03 16:30:44 -06:00
Nathan Hjelm
d8916a4672 btl/ugni: fix race condition in completing frags
The descriptor flags field in a fragment were being ready after the
fragment may have been freed. This commit reads the flags before
calling the user callback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-03 10:48:54 -06:00
Nathan Hjelm
87d41da62b btl/vader: add support for atomics and emulated rdma
This commit adds support for atomic operations as well as rdma for
systems without rdma support. This support is implemented using an
internal send tag.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-02 13:57:11 -06:00
Nathan T. Weeks
08f9ae97ee btl/ugni: update BTL_VERBOSE argument list
Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>
2018-07-02 09:23:30 -06:00
Sergey Oblomov
c2bd6af9f2 MCA/COMMON/UCX: minor unification of del_proces calls
- some common functionality of del_procs calls is moved into
  mca_common module
- blocking ucp_put call is replaced by non-blocking routine

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-02 15:10:53 +03:00
Jeff Squyres
7b0dd03e92 tcp/btl: fix a cast
The current cast is *functional*, but isn't really the way it should
be done.  This commit makes the cast the way it should be done.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-29 07:25:46 -07:00
Jeff Squyres
57bc657e7f btl/tcp: fix hash map usage
Fix two facepalms:

1. The "uint32" in the hash map functions refer to the *key* size, not
   the *value* size.  The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
   to the value.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-28 15:29:41 -07:00
Yossi Itigin
3a7271ef4e
Merge pull request #5344 from hoopoepg/topic/mca-common-ucx-fixed-build
MCA/COMMON/UCX: fixed build scripts
2018-06-28 15:14:04 +03:00