1
1

5322 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
fb39c7f7e6
Merge pull request #6271 from rhc54/cmr401/pmix3
v4.0.1: Update to PMIx 3.1.2
2019-01-28 15:09:03 -06:00
Ralph Castain
335f8c5100 Update to PMIx 3.1.2
Update the OPAL glue configure code to correctly link the opal/pmix3
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios. Roll to PMIx v3.1.2 for plugging of memory leaks and
addition of faster PMIx_Get response

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-25 15:58:53 +09:00
Howard Pritchard
3ef8a8b253
Merge pull request #6280 from hppritcha/topic/mpool_comp_fix_v40x
Update mpool_hugepage_component.c
2019-01-18 07:34:02 -06:00
heasterday
2fc5ab7c8f Update mpool_hugepage_component.c
Signed-off-by: Hunter Easterday <heasterday@lanl.gov>
(cherry picked from commit ad0d2c451e63301e5a3b595f9df67bd5c813955e)
(cherry picked from commit 509380d99fc7e293f18a2dbb495ef73f9f4cbfef)
2019-01-15 08:10:46 -07:00
Gilles Gouaillardet
61108b6228 pmix/ext3x: fix support for external PMIx v3.1
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

The change only need to be done in ext3x/ext3x.c.
But since this file is automatically generated from pmix3x/pmix3x.c, we have to update
the latter file.

Refs. open-mpi/ompi#6247

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(back-ported from commit open-mpi/ompi@950ba16aa1)
2019-01-07 20:27:34 +09:00
Howard Pritchard
fd2259774d
Merge pull request #6167 from ggouaillardet/topic/v4.0.x/btl_uct_fix_warning
btl/uct: fix misc warnings
2018-12-21 15:31:16 -07:00
Howard Pritchard
4cce2b84fa
Merge pull request #6176 from ggouaillardet/topic/v4.0.x/uct_configury
v4.0.x: btl/uct: fix a typo in configure.m4
2018-12-11 09:19:06 -07:00
Howard Pritchard
ebd3421e53
Merge pull request #6165 from ggouaillardet/topic/v4.0.x/pmix_atomics
pmix/pmix3x: fix macros usage in embedded pmix3x
2018-12-11 09:17:17 -07:00
Gilles Gouaillardet
f446472f06 btl/uct: fix a typo in configure.m4
remove whitespace around '=' when setting btl_uct_LIBS

Thanks Ake Sandgren for reporting this

Refs. open-mpi/ompi#6173

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b89deeb1bb)
2018-12-11 15:15:12 +09:00
Gilles Gouaillardet
dd2b1ce49b btl/uct: fix a warning
Use the PRIsize_t macro to correctly print a size_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@78aa6fdd1d)
2018-12-07 17:02:16 +09:00
Gilles Gouaillardet
ffbe85c65f btl/uct: fix AC_CHECK_DECLS usage
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b715dd2657)
2018-12-07 16:27:57 +09:00
Gilles Gouaillardet
195a07d03d pmix/pmix3x: fix macros usage in embedded pmix3x
Use PMIX_* macros instead of OPAL_* macros
master does things differently, so this is a one-off commit

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-07 16:07:07 +09:00
Nathan Hjelm
0957861689 btl/uct: fix some issues when using UCX over ugni
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit e07a64c52d92adf51732ea78e17b679f6deffa12)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-06 10:57:59 -07:00
Gilles Gouaillardet
057118dbe6 btl/uct: fix AC_CHECK_DECLS usage
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit b715dd26572ab18fdba92f06143456c0f9d6380a)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-06 10:56:45 -07:00
Joshua Hursey
e1f75d5ff1 Add OPAL_VPID to unpacking
* Needed to properly read PMIx job data like the following
   - `OPAL_PMIX_LOCALLDR`
   - `OPAL_PMIX_RANK`
   - `OPAL_PMIX_GLOBAL_RANK`
   - `OPAL_PMIX_APPLDR`
   - `OPAL_PMIX_APP_RANK`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit a557c4130c42a5a41aba5c08e606e7129d0bcb6d)
2018-11-21 11:48:58 -06:00
Howard Pritchard
9275b692b7
Merge pull request #6043 from hjelmn/v4.0.x_fix_this_damn_memory_barrier_bug_that_is_referenced_in_github_bug_6014_now_lets_get_this_release_out_the_door
v4.0.x: opal/asm: work around possible gcc compiler bug
2018-11-08 08:01:33 -07:00
Nathan Hjelm
5efc76ef44 pmix3x: fix potential memory barrier bug with __atomic builtin atomics
See open-mpi/ompi#6014 for more information.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-11-06 10:37:14 -07:00
Nathan Hjelm
e57c3fb3c9 opal/asm: work around possible gcc compiler bug
It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a
no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go
ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should
not cause any performance issues as it is equivalent to the memory barrier
in the hand-written atomics.

References #6014

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 30119ee339eea086f43e3392352899187a4a73c7)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-11-06 10:28:13 -07:00
Howard Pritchard
bbfde1533b btl/openib: fix a problem with ib query
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: #5810
Fixes: #5914

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 8126779a354b3e0c720d3e1790f7b936dd5b93b2)
2018-11-02 10:42:17 -06:00
Sergey Oblomov
0846c9d112 COMMON/UCX: added error code to log output
Also fixes a PGI compilation error with --enable-debug.

Signed-off-by: Geoff Paulsen <gpaulsen@users.noreply.github.com>
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 1099d5f02327329e0c58d9403e3e0a7f1e1d1920)
2018-10-30 09:55:25 -05:00
Gilles Gouaillardet
7f36dce348 event/external: fix version requirement
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit b2050392051ed3d9d842f105326f7ea2223aafc4)
2018-10-29 15:20:48 -06:00
Gilles Gouaillardet
40a1f0c214 event/external: misc configury fixes
- Always use the external component when configure'd with --with-libevent=external
 - Fix the external libevent library version detection
   by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
 - Use the event2/event.h header (event.h is deprecated since libevent 2.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 35e77a286c369c1be66ee7ef9ad5ec2faef47edb)
2018-10-29 15:20:32 -06:00
Nathan Hjelm
270ae73611 btl/uct: update for UCT_CB_FLAG_SYNC removal
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from commit 1b37328ba8e8cbd30a5dcdae8332dc879aea48d7)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-23 10:18:43 -06:00
Howard Pritchard
e8f28a5506
Merge pull request #5882 from hjelmn/btl_uct_updates_for_the_v4_branch
btl/uct: bug fixes and general improvements
2018-10-22 13:12:34 -06:00
Howard Pritchard
210b4c60aa remove some dead crs components
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 6564d3d217c3ebff24d0e1fd72929756dc498dfe)
2018-10-17 16:16:36 -06:00
Nathan Hjelm
e6f84e79de btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 707d35deeb62a93ea8a3806d07e07e3a96c51d19)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-16 19:16:11 -06:00
Nathan Hjelm
eaa98af52c opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 5c770a7becc496f63b9f9a59151206236416f4f4)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:28:20 -06:00
Nathan Hjelm
0c4ba45af2 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6ed68da870c391d88575dc027a3de4826a77f57e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:34:33 -06:00
Geoff Paulsen
c8ff7e3ef2
Merge pull request #5874 from rhc54/cmr40/config
v4.0.0: Fix configury for internal PMIx
2018-10-10 10:47:26 -05:00
Nathan Hjelm
1153082a0f btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39be6ec15c202d31423476f09e70199453d25adc)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 16:16:50 -06:00
Ralph Castain
376e2e4d98 Add missing file to tarball
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-09 06:46:09 -07:00
Ralph Castain
40270bd24b Minor cleanups to the pmix/ext2x component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-09 06:42:08 -07:00
Ralph Castain
12790e8ec6 Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit c498a7e77a377ddc3a7bcc26ea072627a33cb470)
2018-10-09 03:45:58 -07:00
Ralph Castain
3e2cc6f46a Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit f379ba9c8e5ce17641937c351ab46e4b4a82446c)
2018-10-09 03:45:37 -07:00
Nathan Hjelm
fba5eda436 btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from commit 8291f6722d890efd15333bf7b26f0d07952fa41e)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-08 08:49:01 -06:00
Nathan Hjelm
fa768d748f btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit dfa8d3a81ac64e32d2bfb13a9afb20b83d747e03)
2018-10-03 11:36:17 -04:00
Nathan Hjelm
df6dd69db8 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 66a7dc4c72cb25df67e7f872bee7a20b5fa9c763)
2018-10-03 11:36:17 -04:00
Geoff Paulsen
593d652077
Merge pull request #5823 from jsquyres/pr/v4.0.x/fix-tcp-btl-show-help-ip-address
v4.0.x: btl/tcp: output the IP address correctly
2018-10-03 08:29:37 -05:00
George Bosilca
b63bee5da4 Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a3a492b42cd7b114b435fafa7cd46222dc565dd1)
2018-10-02 14:37:31 -04:00
George Bosilca
d450e460d6 Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 9164e26e2f323c43c9a671cb510bb4df03e45628)
2018-10-02 14:37:31 -04:00
Jeff Squyres
81f2f19398 btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5dae086f7e4aee28fbb5a7282a2661286a5f68fe)
2018-10-02 10:49:51 -04:00
Sergey Oblomov
68d3baffd5 OPAL/COMMON/UCX: used __func__ macro instead of __FUNCTION__
- used __func__ macro instead of __FUNCTION__ to unify
  macro usage with other components

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 9a51e257d162e724845024e3505880256194ebe2)
2018-09-27 12:04:07 +03:00
Jeff Squyres
8bdf4553d9 mca_base_var: fix output bug about settable vars
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 176da51aec0955a51f21157b33b21f60b6f28092)
2018-09-24 14:14:51 -07:00
Geoff Paulsen
3d4164e1e1
Merge pull request #5752 from gpaulsen/misc-warnings-fixes
Miscellaneous compiler warning stomps.
2018-09-22 15:01:53 -05:00
Geoff Paulsen
556367af31
Merge pull request #5754 from gpaulsen/event_threading
opal/progress: protect against multiple threads in event base
2018-09-22 15:00:32 -05:00
Mark Allen
5ac3fac6c2 snprintf() length fix for info
The important part of this fix is a couple places 5 was hard-coded that needed to be
strlen(OPAL_INFO_SAVE_PREFIX).

But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There
was an "if" statement making sure all the arguments had appropriate strlen(), but gcc
still complained about the following snprintf() because the size of the struct element
is iterator->ie_key[OPAL_MAX_INFO_KEY + 1].

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2018-09-21 14:47:11 -05:00
Nathan Hjelm
cd88e307fd opal/progress: protect against multiple threads in event base
libevent does not support multiple threads calling the event loop on
the same event base. This causes external libevent's to print out
re-entrant warning messages. This commit fixes the issue by protecting
the call to the event loop with an atomic swap check.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-21 14:40:08 -05:00
Jeff Squyres
2e37f97a38 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit fe0852bcb4d14a6aaf5a3e1021f60b5be32dd42d)
2018-09-21 14:35:51 -05:00
Ralph Castain
192f0f6fff Remove the OFI/BTL component
Remove this component pending re-architecture of the overall OFI
components. We have had similar issues before when multiple components
use the same library - typical issues are race conditions, initialize
and finalize errors, etc. We are seeing similar problems here as we get
broader exposure to different library version and environment
combinations.

The correct fix in the past has been to centralize the library
interactions in a "common" component. We will pursue that here by moving
some additional functions (e.g., endpoint creation) into the existing
opal/mca/common/ofi component. We can't do that and thoroughly test it
in time for the v4.0.0 release, so we'll simply remove this component
from the release.

Once we have things correctly fixed, we'll submit a PR to restore the
component plus the related fixes to some future v4.x release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-20 18:09:15 -07:00
Geoff Paulsen
4688da0631
Merge pull request #5736 from hoopoepg/topic/topic/common-del-procs-v4.0
MCA/COMMON/UCX: del_procs calls are unified to common module - v4.0
2018-09-20 18:12:25 -05:00