1
1

29110 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
eaa98af52c opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 5c770a7becc496f63b9f9a59151206236416f4f4)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:28:20 -06:00
Howard Pritchard
e2cf1e3ec5
Merge pull request #5887 from yosefe/topic/osc-ucx-fix-finalize-hang-v4.0.x
osc_ucx: fix hang/timeout in component finalize - v4.0
2018-10-16 09:21:50 -06:00
Howard Pritchard
753087ab17
Merge pull request #5888 from hoopoepg/topic/fixed-zero-size-window-v4.0
OSC/UCX: fixed zero-size window processing - v4.0.x
2018-10-16 09:21:08 -06:00
Howard Pritchard
e20284ac4d
Merge pull request #5908 from ggouaillardet/topic/v4.0.x/mpi_sizeof_misc_additions
fortran: add CHARACTER and LOGICAL support to MPI_Sizeof()
2018-10-16 09:16:00 -06:00
Gilles Gouaillardet
2b5a7ca816 fortran: add CHARACTER and LOGICAL support to MPI_Sizeof()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@e4001040b4)
2018-10-12 14:10:45 +09:00
Yossi Itigin
eabc94cab0 osc_ucx: add worker flush before osc module free
Make sure all pending communications are done on all ranks before
closing the window. This way it will be safe to close the endpoints when
closing the component.

(picked from master b8e1af6)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 23:02:19 +03:00
Geoff Paulsen
c8ff7e3ef2
Merge pull request #5874 from rhc54/cmr40/config
v4.0.0: Fix configury for internal PMIx
2018-10-10 10:47:26 -05:00
Sergey Oblomov
274cbc3c03 OSC/UCX: fixed zero-size window processing
- added processing of zero-size MPI window

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit ae6f81983fe354de812ebe2532120fb20ae24d3b)
2018-10-10 16:49:02 +03:00
Howard Pritchard
d18ea98263
Merge pull request #5843 from kawashima-fj/pr/v4.0.x/correct-f08-signatures
v4.0.x: fortran/use-mpi-f08: Correct f08 routine signatures
2018-10-09 10:22:07 -05:00
Ralph Castain
376e2e4d98 Add missing file to tarball
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-09 06:46:09 -07:00
Ralph Castain
40270bd24b Minor cleanups to the pmix/ext2x component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-09 06:42:08 -07:00
Geoff Paulsen
c2e99c3f40
Merge pull request #5867 from ggouaillardet/topic/v4.0.x/hostfile_double_free
util/hostfile: fix a double free error
2018-10-09 08:00:03 -05:00
Geoff Paulsen
9be650c7b9
Merge pull request #5862 from hjelmn/v4.0.x_vader_fix_for_real_this_time
btl/vader: fix race condition in writing header
2018-10-09 07:59:18 -05:00
Ralph Castain
226aee42fd Ignore --with-foo=external arguments in subdirs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 08109acf8cf1e3d5a268da0b73210910fd738cfe)
2018-10-09 03:46:54 -07:00
Jeff Squyres
71b828eb9e opal_config_subdir_args.m4: fix typo
A typo inadvertantly crept in to e836dbd506.  Add the extra '-' to
make it correctly search for --with-*=internal.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7675956b8fd739b150d3bdd14c265fd728786201)
2018-10-09 03:46:37 -07:00
Ralph Castain
12790e8ec6 Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit c498a7e77a377ddc3a7bcc26ea072627a33cb470)
2018-10-09 03:45:58 -07:00
Ralph Castain
3e2cc6f46a Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit f379ba9c8e5ce17641937c351ab46e4b4a82446c)
2018-10-09 03:45:37 -07:00
Ralph Castain
4aa11ec763 Strip --with-foo=internal from opal_subdir_args
Our components that have a --with-foo configure option won't know what
to do with a value of "internal". This scenario only occurs with hwloc
and libevent, both of which are statically contained in libopen-pal

Thanks to @jsquyres for the diff

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit e836dbd506502c797f5bff2f9761510fad4858cd)
2018-10-09 03:41:48 -07:00
Gilles Gouaillardet
2e2366d193 util/hostfile: fix a double free error
As reported at https://stackoverflow.com/questions/52707242/mpirun-segmentation-fault-whenever-i-use-a-hostfile
mpirun crashes when the hostfile contains a "user@host" line.
The root cause is username was not strdup'ed and free'd twice by opal_argv_free() and free()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@5803385d44)
2018-10-09 13:06:55 +09:00
Geoff Paulsen
212419290e
Merge pull request #5859 from amaslenn/mlnx-no-verbs-v4
platform/mellanox: disable openib/verbs — v4.0.x
2018-10-08 14:13:12 -05:00
Nathan Hjelm
fba5eda436 btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from commit 8291f6722d890efd15333bf7b26f0d07952fa41e)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-08 08:49:01 -06:00
Andrey Maslennikov
7a930039fb platform/mellanox: disable openib/verbs
Signed-off-by: Andrey Maslennikov <andreyma@mellanox.com>
(cherry picked from commit 7180ab144a52136f8c0ec0c63a61b1a31dfed023)
2018-10-08 15:36:56 +03:00
Geoff Paulsen
499ddedd7c
Merge pull request #5844 from kawashima-fj/pr/v4.0.x/pcollreq-f08-signatures
v4.0.x: mpiext/pcollreq: Correct f08 routine signatures
2018-10-05 13:42:35 -05:00
Geoff Paulsen
ab7cf1095d
Merge pull request #5845 from kawashima-fj/pr/v4.0.x/pcollreq-man
v4.0.x: mpiext/pcollreq: Add Fortran bindings in man
2018-10-05 13:40:03 -05:00
KAWASHIMA Takahiro
4dd21111f0 mpiext/pcollreq: Add Fortran bindings in man
Fortran bindings were added to persistent collectives in 9e0115c980
but man was not updated.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 43d85dbc815bc5fd02c12810fb7c9f2ddd294f12)
2018-10-05 09:43:39 +09:00
KAWASHIMA Takahiro
092cf1937d man: Correct markup of MPI_Neighbor_allgather
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 994b34525399e2524adeef6933e0bc3f673caf5f)
2018-10-05 09:43:39 +09:00
KAWASHIMA Takahiro
080c52f906 mpiext/pcollreq: Add missing f08 asynchronous
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit be91a26fd8b5c0fb38a0a5bbaeb32e7d4bafeafc)
2018-10-05 09:33:17 +09:00
KAWASHIMA Takahiro
fcc698f27f mpiext/pcollreq: Correct f08 routine signatures
Changes of nonblocking collectives in e98d794e8b and f750c6932c
are applied to persistent collectives.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 357531847e819bf8c33224e0cc9004ed47b42445)
2018-10-05 09:33:16 +09:00
KAWASHIMA Takahiro
b9316d3136 fortran/use-mpi-f08: Correct f08 routine signatures
Following the commit f750c6932c, I compared
`ompi/mpi/fortran/use-mpi-f08/*.F90` and
`ompi/mpi/fortran/use-mpi-f08/profile/p*.F90`, and
`ompi/mpi/fortran/use-mpi-f08/mod/mpi-f08-interfaces.F90` and
`ompi/mpi/fortran/use-mpi-f08/mod/pmpi-f08-interfaces.F90`.

There are many differences. Some are bugs of `MPI_*`, some are
bugs of `PMPI_*`. I'm not sure how these bugs affect applications.

To make it easy to compare these files future, I also removed
editorial differences.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit cf6d28cb66981cf315f84e5efb8f8256b6c6a3a4)
2018-10-05 09:04:17 +09:00
Geoff Paulsen
c0796664b1
Merge pull request #5780 from jsquyres/pr/v4.0.x/moar-fortran-fixes
v4.0.x: Fortran 08 bindings fixes
2018-10-04 16:08:30 -05:00
Geoff Paulsen
704349746d
Merge pull request #5835 from gpaulsen/topic/v4.0.x/rc4
Updating VERSION to v4.0.0rc4
2018-10-04 15:20:33 -05:00
Geoff Paulsen
4e2d26e08b
Merge pull request #5834 from jsquyres/pr/v4.0.x/2-more-vader-fixes
v4.0.x: 2 more vader fixes
2018-10-03 14:09:45 -05:00
Geoffrey Paulsen
ed2bd82075 Updating VERSION to v4.0.0rc4
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2018-10-03 11:49:32 -05:00
Nathan Hjelm
fa768d748f btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit dfa8d3a81ac64e32d2bfb13a9afb20b83d747e03)
2018-10-03 11:36:17 -04:00
Nathan Hjelm
df6dd69db8 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 66a7dc4c72cb25df67e7f872bee7a20b5fa9c763)
2018-10-03 11:36:17 -04:00
Geoff Paulsen
5cae0ec25b
Merge pull request #5794 from bwbarrett/v4.0.x-ofi-mtl-selection
mtl ofi: Change from opt-in to opt-out provider selection
2018-10-03 08:31:07 -05:00
Geoff Paulsen
1844d87e4c
Merge pull request #5802 from jsquyres/pr/v4.0.x/misc-updates
mpi.h: remove MPI_UB/MPI_LB when not enabling MPI-1 compat
2018-10-03 08:30:46 -05:00
Geoff Paulsen
593d652077
Merge pull request #5823 from jsquyres/pr/v4.0.x/fix-tcp-btl-show-help-ip-address
v4.0.x: btl/tcp: output the IP address correctly
2018-10-03 08:29:37 -05:00
Geoff Paulsen
9d1a6db1a0
Merge pull request #5826 from jsquyres/pr/v4.0.x/tcp-btl-socklen-fix
v4.0.x: TCP BTL socklen fix
2018-10-02 16:14:13 -05:00
George Bosilca
b63bee5da4 Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a3a492b42cd7b114b435fafa7cd46222dc565dd1)
2018-10-02 14:37:31 -04:00
George Bosilca
d450e460d6 Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 9164e26e2f323c43c9a671cb510bb4df03e45628)
2018-10-02 14:37:31 -04:00
Jeff Squyres
81f2f19398 btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5dae086f7e4aee28fbb5a7282a2661286a5f68fe)
2018-10-02 10:49:51 -04:00
Geoff Paulsen
1ae1f7d3c6
Merge pull request #5790 from yosefe/topic/shmem-lock-progress-v4.0.x
shmem/lock: progress communications while waiting for shmem_lock
2018-10-01 09:13:19 -05:00
Geoff Paulsen
e6b8738132
Merge pull request #5806 from gpaulsen/topic/v4.0.x/NEWS/mtl_ofi
Topic/v4.0.x/news/mtl ofi
2018-10-01 09:06:24 -05:00
Geoff Paulsen
a4666dc008
Merge pull request #5805 from gpaulsen/topic/v4.0.x/NEWS/vers
Topic/v4.0.x/news/vers
2018-10-01 09:05:40 -05:00
Geoffrey Paulsen
0f984be381 NEWS: PR5794 - change MTL OFI selection
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2018-09-28 16:54:04 -05:00
Geoffrey Paulsen
8138b5b04a NEWS: updated versions of included hwloc and PMIx
Updated versions of included hwloc and PMIx to match
   corresponding VERSION files.

   Updated the spelling of "Open SHMEM" to "OpenSHMEM".

Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2018-09-28 16:47:26 -05:00
Geoff Paulsen
a7e275cf3e
Merge pull request #5771 from jsquyres/pr/v4.0.x/readme-update-configure-cli-with-options
v4.0.x: README: Add note about --with-foo and RPATH
2018-09-28 16:20:47 -05:00
Jeff Squyres
46dd266e45 mpi.h: remove MPI_UB/MPI_LB when not enabling MPI-1 compat
When --enable-mpi1-compatibility was specified, the ompi_mpi_ub/lb
symbols were #if'ed out of mpi.h.  But the #defines for MPI_UB/LB
still remained.  This commit also #if's out the MPI_UB/LB macros when
--enable-mpi1-compatibility is specified.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7223334d4dc1225d49cd2c63714870c3a04ad953)
2018-09-28 10:01:48 -07:00
Brian Barrett
10d0a430c4 mtl ofi: Change from opt-in to opt-out provider selection
Change default provider selection logic for the OFI MTL.  The
old logic was whitelist-only, so any new HPC NIC provider would
have to ask users to do extra work or wait for an OMPI release
to be whitelisted.  The reason for the logic was to avoid
selecting a "generic" provider like sockets or shm that would
frequently have worse performance than the optimized BTL options
Open MPI supports.

With the change, we blacklist the (small, relatively static) list
of providers that duplicate internal capabilities.  Users can use
one of thse blacklisted providers in two ways: first, they can
explicitly request the provider in the include list (which will
override the default exclude list) and second, the can set a new
empty exclude list.

Since most HPC networks require special libraries and therefore
an explicit build of libfabric, it is highly unlikely that this
change will cause users to use libfabric when they didn't want to
do so.  It does, however, solve the whitelisting problem.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit c5eaa38491c7197f7dbc74c299ade18e09bf5f64)
2018-09-27 18:41:47 +00:00