1
1
Граф коммитов

5442 Коммитов

Автор SHA1 Сообщение Дата
bosilca
f96305f96b
Merge pull request #6199 from aravindksg/opal-threads-fix
opal/sync: Fix assert during multi-threaded progress invocation
2018-12-20 13:27:07 -05:00
Nathan Hjelm
cf49957af6
Merge pull request #6207 from hjelmn/opal_cleanup
opal: fix common symbols introduced in opal cleanup
2018-12-19 15:39:39 -07:00
Jeff Squyres
9b88e60fc8 info_get: ensure to copy all requested characters
When querying an info value, copy out exactly as many characters as
the caller asked for -- do not artificially truncate the target just
to ensure that it is \0-terminated.

Specifically: do not use opal_string_copy() to copy info values,
because opal_string_copy() will guarantee to \0-terminate the target,
even if it means truncating the target.  E.g., if the caller calls
opal_info_get_nolock() with valuelen=5, opal_string_copy() will return
"1234\0" -- which is wrong.  This commit fixes the behavior to return
"12345".

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-19 13:01:45 -08:00
Nathan Hjelm
fd852c8f63 opal: fix common symbols introduced in opal cleanup
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-19 12:27:14 -07:00
Nathan Hjelm
4944508603
Merge pull request #6136 from hjelmn/opal_cleanup
opal: clean up init/finalize
2018-12-18 15:23:32 -07:00
Nathan Hjelm
0edfd328f8 opal: clean up init/finalize
This commit contains the following changes:

 - Remove the unused opal_test_init/opal_test_finalize
   functions. These functions are not used by anything in the code
   base or MTT. Tests use opal_init_util/opal_finalize_util instead.

 - Get rid of gotos in opal_init_util and opal_init. Replaced them
   with a cleaner solution.

 - Automatically register cleanup functions in init functions. The
   cleanup functions are executed in the reverse order of the
   initialization functions. The cleanup functions are run in
   opal_finalize_util() before tearing down the class system.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-18 14:37:04 -07:00
Gilles Gouaillardet
b0a668457c pmix/pmi4x: refresh to latest PMIx
refresh to pmix/pmix@2d4c2874fd

Refs. open-mpi/ompi#6222

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-18 17:11:01 +09:00
Ralph Castain
ece7696c2c Fix external PMIx v3 support
Don't erase the source files during "make clean"!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-12-17 15:54:56 -08:00
Aravind Gopalakrishnan
2f3c5218ff opal/sync: Fix assert during multi-threaded progress invocation
PR #5241 provided an MCA variable to allow multi-threaded opal_progress.
However, it allowed to update the linked list even when multiple threads was
allowed to call opal_progress. This caused a scenario when a more recent thread
could complete it's progress and fail the assert(sync ==
wait_sync_list).

Allowing to update the linked list only for the case when the number of threads
exceeds the threshold fixes the problem.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
2018-12-17 12:14:27 -08:00
Gilles Gouaillardet
b89deeb1bb btl/uct: fix a typo in configure.m4
remove whitespace around '=' when setting btl_uct_LIBS

Thanks Ake Sandgren for reporting this

Refs. open-mpi/ompi#6173

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-11 14:23:53 +09:00
Gilles Gouaillardet
78aa6fdd1d btl/uct: fix a warning
Use the PRIsize_t macro to correctly print a size_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-07 16:16:35 +09:00
George Bosilca
88a693bf71 Add a test for very large data.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-12-06 13:30:58 +09:00
Gilles Gouaillardet
fbb5bb8860 opal/datatype: correctly handle large datatypes
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi/ompi#6016

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-06 13:30:58 +09:00
Nathan Hjelm
e07a64c52d btl/uct: fix some issues when using UCX over ugni
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-05 16:30:54 -07:00
Gilles Gouaillardet
fccb3e7514
Merge pull request #6137 from ggouaillardet/topic/ucx_warning
btl/openib: delay UCX warning to add_procs()
2018-12-05 11:50:10 +09:00
Yossi Itigin
83cca9d52a ucx: add owner.txt for components
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-12-01 17:14:03 +02:00
Gilles Gouaillardet
0a2ce58040 btl/openib: delay UCX warning to add_procs()
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-11-29 16:20:55 +09:00
Ralph Castain
f609542bbf Implement process set name support
Add the --pset option for app_contexts so the user can provide a string
name for each app_context. Use the new PMIx pset attribute to store the
names in the PMIx local storage for retrieval

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-27 11:07:58 -08:00
Josh Hursey
3339e2aa33
Merge pull request #5986 from jjhursey/vpid-unpack
Add OPAL_VPID to unpacking
2018-11-21 11:42:02 -06:00
KAWASHIMA Takahiro
cacd6f389c datatype: Remove #if HAVE_[TYPE] for C99 types
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.

- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`

Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.

I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-14 09:32:52 +09:00
Thananon Patinyasakdikul
3dc1629771
Merge pull request #5241 from thananon/opal_progress
Add MCA param for multithread opal_progress().
2018-11-09 12:30:07 -05:00
Gilles Gouaillardet
efe72d3d92
Merge pull request #6055 from ggouaillardet/topic/c11_atomics
Misc C11 atomics related fixes
2018-11-08 09:11:42 +09:00
Gilles Gouaillardet
07970f192a atomic/c11: fix include header path
simply #include "opal_stdint.h" in order to work with --devel-headers

Refs open-mpi/ompi#6053

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-11-07 13:23:28 +09:00
Thananon Patinyasakdikul
d9bd54c628 btl/ofi: fixed compiler warning on OSX.
This commit closes #6049

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2018-11-06 15:37:25 -05:00
Nathan Hjelm
30119ee339 opal/asm: work around possible gcc compiler bug
It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a
no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go
ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should
not cause any performance issues as it is equivalent to the memory barrier
in the hand-written atomics.

References #6014

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-11-05 11:45:42 -07:00
Thananon Patinyasakdikul
4e23fedccb opal progress: Added MCA for multithreaded opal_progress.
This commit added MCA param `opal_max_thread_in_progress` to set the
number of threads allowed to do opal_progress concurrently. The default
value is 1.

Component with multithreaded design can benefit from this change to
parallelize their component progress function.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2018-11-01 13:34:20 -04:00
Howard Pritchard
8126779a35 btl/openib: fix a problem with ib query
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: #5810
Fixes: #5914

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-31 14:50:15 -06:00
Brian Barrett
a1e85b03aa btl tcp: Fix compile error in IPv6
In 457f058 I broke the TCP BTL with --enable-ipv6.  This patch
fixes the compile error, so IPv6 works again.

Fixed #5996

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-30 12:31:04 -07:00
Joshua Hursey
a557c4130c Add OPAL_VPID to unpacking
* Needed to properly read PMIx job data like the following
   - `OPAL_PMIX_LOCALLDR`
   - `OPAL_PMIX_RANK`
   - `OPAL_PMIX_GLOBAL_RANK`
   - `OPAL_PMIX_APPLDR`
   - `OPAL_PMIX_APP_RANK`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2018-10-29 14:31:47 -05:00
Gilles Gouaillardet
b205039205 event/external: fix version requirement
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-29 10:12:56 +09:00
Gilles Gouaillardet
35e77a286c event/external: misc configury fixes
- Always use the external component when configure'd with --with-libevent=external
 - Fix the external libevent library version detection
   by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
 - Use the event2/event.h header (event.h is deprecated since libevent 2.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-29 10:12:56 +09:00
Aurelien Bouteiller
f2e6d7891e
Merge pull request #5975 from ICLDisco/export/pmix-fini-threadinterlock
Avoid a double lock interlock when calling pmix_finalize
2018-10-26 10:01:23 -04:00
Gilles Gouaillardet
b715dd2657 btl/uct: fix AC_CHECK_DECLS usage
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-26 15:36:02 +09:00
Aurelien Bouteiller
50cf707f3f
Avoid a double lock interlock when calling pmix_finalize
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-10-25 16:36:58 -04:00
Nathan Hjelm
7b34c9b3fe
Merge pull request #5958 from hjelmn/btl_uct_sync
btl/uct: update for UCT_CB_FLAG_SYNC removal
2018-10-22 23:17:27 -06:00
Yossi Itigin
4c442f2601
Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg
COMMON/UCX: suppressed coverity warnings
2018-10-22 11:00:47 +03:00
Nathan Hjelm
1b37328ba8 btl/uct: update for UCT_CB_FLAG_SYNC removal
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-21 18:57:42 -06:00
bosilca
690024917d
Merge pull request #5938 from bwbarrett/bugfix/tcp-multi-address
Always use same IP address for module and modex
2018-10-21 09:19:43 -07:00
Sergey Oblomov
1099d5f023 COMMON/UCX: added error code to log output
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-21 11:37:25 +03:00
George Bosilca
dc972f0b92
Fix the PML monitoring.
The monitoring PML hides it's existence from the OMPI infrastructure by
removing itself from the list of PML loaded components, remaining hidden
until MPI_Finalize.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-18 00:29:23 -04:00
Brian Barrett
457f058e73 btl tcp: Simplify modex address selection
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs #5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-18 02:23:21 +00:00
Howard Pritchard
7730db9982
Merge pull request #5937 from hppritcha/topic/remove_crs_components
remove some dead crs components
2018-10-17 16:06:36 -06:00
Jeff Squyres
c2608fb597
Merge pull request #5939 from jsquyres/pr/coverity-cid-1440332
opal/os_path: fix minor string overrun
2018-10-17 14:24:42 -07:00
Howard Pritchard
6564d3d217 remove some dead crs components
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-17 10:29:00 -06:00
Brian Barrett
4f19221af2 btl tcp: Simplify module address storage
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6).  There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-17 16:21:17 +00:00
Jeff Squyres
290d1c0534 opal/os_path: fix minor string overrun
This is a follow on to 54ca3310ea to fix
a minor bug identified by Coverity.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-17 08:57:30 -07:00
Sergey Oblomov
df765595e3 COMMON/UCX: suppressed coverity warnings
- suppressed coverity warnings - added log messages on failed calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-17 16:11:03 +03:00
Brian Barrett
2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
Nathan Hjelm
37a3f32e47
Merge pull request #5913 from hjelmn/uct_btl_fixes
btl/uct: fix deadlock in connection code
2018-10-16 19:11:54 -06:00
Nathan Hjelm
707d35deeb btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 18:28:47 -06:00
Nathan Hjelm
1c001319d4
Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now
opal/free_list: fix race condition
2018-10-16 15:27:05 -06:00
Brian Barrett
da1189d771
Merge pull request #5916 from bwbarrett/revert/6acebc4
Revert "Handle error cases in TCP BTL"
2018-10-16 13:54:18 -07:00
Nathan Hjelm
5c770a7bec opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 13:17:09 -06:00
Nathan Hjelm
9ea5dfa799 class/opal_fifo: fix warning
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-15 19:18:31 -06:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Jeff Squyres
e6de42c379 opal/util/info.c: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:03:28 -07:00
Brian Barrett
5162011428 Revert "Handle error cases in TCP BTL"
This reverts commit 6acebc40a1.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
https://github.com/open-mpi/ompi/issues/5849 for more information.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 15:01:54 -07:00
Brian Barrett
8feff49f6c pmix: Fix filename typo
Looks like a filename was missed when pmix sucked in the installdirs
framework.  Fixing the typo fixes "make ctags" and "make cscope".

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 17:05:01 +00:00
Nathan Hjelm
6ed68da870 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:33:17 -06:00
Brian Barrett
902b57919c btl tcp: Print number of endpoints on error
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
bd07cc6707 btl tcp: Improve initialization debugging
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.

Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.

Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Gilles Gouaillardet
ca9009e3d0 pmix/ext3x: fix minor typos
refer PMIx 3x instead of previous 2x

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:18:36 +09:00
Gilles Gouaillardet
b3bd785ce0 pmix/ext3x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:54 +09:00
Gilles Gouaillardet
67402f5f19 pmix/ext2x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:42 +09:00
Nathan Hjelm
39be6ec15c btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 15:15:45 -06:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Brian Barrett
c087365ead opal/util: Always support BSD behavior of asprintf
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this.  As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
8291f6722d btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-05 16:30:06 -06:00
Ralph Castain
76c11a1496 Remove build product - fix autogen.pl mode
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 13:38:39 -07:00
Ralph Castain
c498a7e77a Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 07:07:05 -07:00
Ralph Castain
f379ba9c8e Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 06:29:36 -07:00
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Nathan Hjelm
dfa8d3a81a btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 08:55:51 -06:00
Nathan Hjelm
66a7dc4c72 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 07:17:32 -06:00
Ralph Castain
31f6c75498
Merge pull request #5819 from bosilca/fix/local_bind
Fix/local bind
2018-10-02 11:27:36 -07:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Brian Barrett
19e16d5fd0 opal: Disable memory patcher component on MacOS
Open MPI doesn't support any transports on MacOS which require
memory manager hooks.  The memory patcher component uses the
syscall interface, which has been deprecated in recent versions
of MacOS.  Since we don't need it and it emits warnings about
deprecation, disable the memory patcher component on MacOS.

Fixes #5671

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:15 -04:00
George Bosilca
a3a492b42c
Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:08:18 -04:00
George Bosilca
9164e26e2f
Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:06:40 -04:00
Ralph Castain
fcc1d30ab3
Merge pull request #5775 from karasevb/check_old_topo_key
pmix: check the old topo key to keep compatibility with old RMs
2018-10-02 08:12:42 -07:00
Jeff Squyres
5dae086f7e btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 16:12:57 -07:00
Jeff Squyres
293a938d29 string_copy: assert() fail if the copy is too long
Add a hueristic: if the string copy is "too long", fail an assert().
This is based on the premise that Open MPI doesn't do large string
copies.  So if we see a dest_len that is over a certain threshhold
(currently set at 128K), this is likely a programmer error, and on
debug builds, we should fail an assert().  In production builds, it
will work just fine (assuming that it's not a programmer error).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 13:34:15 -07:00
Ralph Castain
172e770154 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-28 12:43:32 -07:00
Ralph Castain
c836c4282c Update to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-27 20:57:23 -07:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Jeff Squyres
e1cf8757ae opal/util/string_copy: add "safe" string copy function
Do essentially the same thing as strncpy(3), but a) ensure to always
terminate the destination buffer with a \0, and b) do not \0-pad to
the right.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 10:48:04 -07:00
Jeff Squyres
38fefcf1cf opal/util/Makefile.am: fix whitespace (remove tabs)
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 09:43:03 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Boris Karasev
ed42f568ae pmix: check the old topo key to keep compatibility with old RMs
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-09-25 18:13:54 +03:00
Gilles Gouaillardet
db65dbd9a8 ucx: use the c99 __func__ macro instead
__FUNCTION__ macro was never standardized and should not be used.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-09-25 11:19:18 +09:00
Jeff Squyres
176da51aec mca_base_var: fix output bug about settable vars
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-22 07:18:45 -07:00
Nathan Hjelm
c6230657a2 opal: fix warning
Fixes #5713

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-09-21 00:51:40 -06:00
Brian Barrett
9344afd485 openib: Disable CUDA async by default
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References #3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-09-20 07:58:10 -07:00
Howard Pritchard
dccf780546
Merge pull request #5737 from hppritcha/topic/remove_scif_support
SCIF: remove it
2018-09-19 11:38:27 -06:00
Howard Pritchard
b9ac3d8931 SCIF: remove it
KNC is effectively dead.  Remove corresponding SCIF
support in Open MPI.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-09-19 10:39:52 -06:00
bosilca
b0e2975f86
Merge pull request #5701 from ICLDisco/export/errors_ib
Adding error handling in OpenIB BTL
2018-09-19 09:45:58 -04:00
bosilca
464e1abbab
Merge pull request #5700 from ICLDisco/export/tcp_errors
Handle error cases in TCP BTL
2018-09-19 09:44:38 -04:00
Michael Kuron
17b0f1fcc3 Deal with EOPNOTSUPP returned from getsockopt()
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.

Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
2018-09-16 14:55:28 +02:00
Jeff Squyres
3970b06134 misc: passing a bool to va_start() is undefined
According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior.  So just change these params to int and avoid the
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00