1
1

29252 Коммитов

Автор SHA1 Сообщение Дата
Yossi Itigin
8a329a797c SCOLL/BASIC: Fix invalid pSync pointer passed to barrier func
mca_scoll_basic_alltoall() passed (pSync + 1) to barrier function, but
the value of _SHMEM_ALLTOALL_SYNC_SIZE is 1, which made the barrier
function use an invalid memory location. In particular, this location
was not initialized to _SHMEM_SYNC_VALUE, which broke the barrier
algorithm and it did not complete: One PE could read 0 from its peer and
assume the peer already started the barrier, and then write 1 to the
peer. Then, the peer entered the barrier and overwrote the 1 with 0, and
then it waited forever to see '1' in its pSync.

Found with shmem_verifier test suite.

(picked from master 6754bf1)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-31 12:22:19 +02:00
Geoff Paulsen
cd1e927e1b
Merge pull request #5995 from gpaulsen/topic/v4.0.x/fix_pgi
COMMON/UCX: added error code to log output
2018-10-30 15:11:37 -05:00
Ralph Castain
ba6ad9fe42 Remove stale defunct tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 05ac8fa71c0833eeeaa878b72a31503d361e145e)
2018-10-30 08:51:25 -07:00
Sergey Oblomov
0846c9d112 COMMON/UCX: added error code to log output
Also fixes a PGI compilation error with --enable-debug.

Signed-off-by: Geoff Paulsen <gpaulsen@users.noreply.github.com>
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 1099d5f02327329e0c58d9403e3e0a7f1e1d1920)
2018-10-30 09:55:25 -05:00
Ralph Castain
712ddd326f Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1bd772e8ebf66f705537b9a6e1af2b6093ef8471)
2018-10-30 07:54:35 -07:00
Geoff Paulsen
660743cd3a
Merge pull request #5988 from hppritcha/topic/libevent_configure_4.0.x
Topic/libevent configure 4.0.x
2018-10-30 07:49:46 -05:00
Geoff Paulsen
21d2597afd
Merge pull request #5974 from rhc54/cmr4/mpir
v4.0: Provide deprecation warning of MPIR debugger
2018-10-29 16:29:15 -05:00
Gilles Gouaillardet
7f36dce348 event/external: fix version requirement
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit b2050392051ed3d9d842f105326f7ea2223aafc4)
2018-10-29 15:20:48 -06:00
Gilles Gouaillardet
40a1f0c214 event/external: misc configury fixes
- Always use the external component when configure'd with --with-libevent=external
 - Fix the external libevent library version detection
   by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
 - Use the event2/event.h header (event.h is deprecated since libevent 2.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 35e77a286c369c1be66ee7ef9ad5ec2faef47edb)
2018-10-29 15:20:32 -06:00
Geoff Paulsen
d0968f516a
Merge pull request #5963 from amaslenn/mlnx-no-uct-v4
platform/mellanox: disable btl-uct by default — v4
2018-10-26 13:48:15 -05:00
Howard Pritchard
32e8592c5c
Merge pull request #5966 from hjelmn/v4.x_btl_uct_adjust_for_changes_in_the_openucx_uct_interface_that_break_things
btl/uct: update for UCT_CB_FLAG_SYNC removal
2018-10-26 12:39:35 -06:00
Ralph Castain
8aae7943ab Provide deprecation warning of MPIR debugger
If we detect that we are being debugged by an MPIR-based debugger, then
print a warning that OMPI's MPIR support has been deprecated and will be
removed in a subsequent release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 2cb271716beae08b5e5e20f30a5a2fe3e5c50c5e)
2018-10-25 08:49:00 -07:00
Nathan Hjelm
270ae73611 btl/uct: update for UCT_CB_FLAG_SYNC removal
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from commit 1b37328ba8e8cbd30a5dcdae8332dc879aea48d7)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-23 10:18:43 -06:00
Andrey Maslennikov
79abcc2e93 platform/mellanox: disable btl-uct by default
(cherry picked from commit 074e9cc92c79e6b9f8c492b2613eac3f27e1c522)
Signed-off-by: Andrey Maslennikov <andreyma@mellanox.com>
2018-10-23 10:33:35 +03:00
Howard Pritchard
e8f28a5506
Merge pull request #5882 from hjelmn/btl_uct_updates_for_the_v4_branch
btl/uct: bug fixes and general improvements
2018-10-22 13:12:34 -06:00
Howard Pritchard
f9d2f3b912
Merge pull request #5941 from hppritcha/topic/remove_bfo_pml_v4.0.x
v4.0.x: remove the bfo pml
2018-10-22 09:50:05 -06:00
Howard Pritchard
837f7eb1dd
Merge pull request #5942 from hppritcha/topic/new_for_4.0.0rc5
NEWS: updates for v4.0.0rc5
2018-10-22 09:49:35 -06:00
Howard Pritchard
6c18cb179d
Merge pull request #5945 from hppritcha/topic/remove_crs_for_v4.0.0
v4.0.x: remove some dead crs components
2018-10-18 06:55:44 -06:00
Howard Pritchard
210b4c60aa remove some dead crs components
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 6564d3d217c3ebff24d0e1fd72929756dc498dfe)
2018-10-17 16:16:36 -06:00
Howard Pritchard
48e12cf766 NEWS: updates for v4.0.0rc5
[skip ci]

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-17 14:36:27 -06:00
Howard Pritchard
a806d09450 remove the bfo pml
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 7d6774acf89558c05f415c96c00502429e26e502)
2018-10-17 14:00:11 -06:00
Geoff Paulsen
b8e040c704
Merge pull request #5904 from gpaulsen/topic/v4.0.0rc5
Reving to v4.0.0rc5
2018-10-17 12:59:13 -07:00
Edgar Gabriel
278ecf2205 io/ompio: add verification for data representations.
check for providing a data representation that is actually supported
by ompio.

Add also one check for a non-NULL pointer in mpi/c/file_set_view
for the data representation.

Also fixes parts of issue #5643

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-17 11:22:48 -05:00
Edgar Gabriel
a07c9e96b1 io/ompio: execute barrier before sync
this ensures that all processes are done modifying a file
before syncing. Fixes an error in the testmpio testsuite.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-17 11:22:35 -05:00
Edgar Gabriel
96c1a5b9dc common/ompio: check datatypes when setting file view
return MPI_ERR_ARG if the size of the fileview is not a
multiple of the size of the etype provided.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-17 11:22:19 -05:00
Edgar Gabriel
425a71799e common/ompio: return correct error code for improper access
return MPI_ERR_ACCESS if the user tries to read from  a file
that was opened using MPI_MODE_WRONLY

return MPI_ERR_READ_ONLY if the user tries to write a file
that was opened using MPI_MODE_RDONLY

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-17 11:22:04 -05:00
Edgar Gabriel
c65dda6f5f io/ompio: fix seek position calculation for SEEK_CUR
This commit fixes the calculation of the position where to
seek to, in case SEEK_CUR is used.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-17 11:21:47 -05:00
Howard Pritchard
e74443a8b8
Merge pull request #5929 from hjelmn/v4.0.x_opal_free_list_really_old_race_we_should_really_fix_now
opal/free_list: fix race condition
2018-10-17 10:08:11 -06:00
Nathan Hjelm
e6f84e79de btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 707d35deeb62a93ea8a3806d07e07e3a96c51d19)
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-16 19:16:11 -06:00
Howard Pritchard
cd7d70156c
Merge pull request #5899 from jsquyres/pr/v4.0.x/fix-c99-comments-in-mpih
v4.0.x: mpi.h.in: remove C99-style comments
2018-10-16 16:33:08 -06:00
Howard Pritchard
d2fb9949a5
Merge pull request #5785 from jsquyres/pr/v4.0.x/more-compiler-warnings-fixes
v4.0.x: Squash a bunch of harmless compiler warnings.
2018-10-16 16:30:21 -06:00
Howard Pritchard
2752d43f65
Merge pull request #5875 from kawashima-fj/pr/v4.0.x/javadoc-tag
v4.0.x: java: Fix javadoc build failure with OpenJDK 11
2018-10-16 16:29:34 -06:00
Howard Pritchard
7ceb508b93
Merge pull request #5889 from yosefe/topic/pml-ucx-fix-datatype-leak-v4.0.x
pml_ucx: add ompi datatype attribute to release ucp_datatype - v4.0.x
2018-10-16 16:29:01 -06:00
Nathan Hjelm
eaa98af52c opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 5c770a7becc496f63b9f9a59151206236416f4f4)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:28:20 -06:00
Ralph Castain
05e0545581 Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e24194b37571a8245d8d39ed202e75b)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:21:18 -06:00
Howard Pritchard
e2cf1e3ec5
Merge pull request #5887 from yosefe/topic/osc-ucx-fix-finalize-hang-v4.0.x
osc_ucx: fix hang/timeout in component finalize - v4.0
2018-10-16 09:21:50 -06:00
Howard Pritchard
753087ab17
Merge pull request #5888 from hoopoepg/topic/fixed-zero-size-window-v4.0
OSC/UCX: fixed zero-size window processing - v4.0.x
2018-10-16 09:21:08 -06:00
Howard Pritchard
e20284ac4d
Merge pull request #5908 from ggouaillardet/topic/v4.0.x/mpi_sizeof_misc_additions
fortran: add CHARACTER and LOGICAL support to MPI_Sizeof()
2018-10-16 09:16:00 -06:00
Gilles Gouaillardet
2b5a7ca816 fortran: add CHARACTER and LOGICAL support to MPI_Sizeof()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@e4001040b4)
2018-10-12 14:10:45 +09:00
Geoffrey Paulsen
d936752c17 Reving to v4.0.0rc5
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2018-10-11 16:35:08 -05:00
Nathan Hjelm
0c4ba45af2 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6ed68da870c391d88575dc027a3de4826a77f57e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:34:33 -06:00
Jeff Squyres
600967d2ed mpi.h.in: remove C99-style comments
While we require C99 to build Open MPI, we do not require C99 to build
user MPI applications.  As such, we shouldn't have C99-style comments
(i.e., "//"-style) in mpi.h.in.

Thanks to @AdamSimpson for reporting the issue.

This commit simply converts a //-style comment to a /**/-style
comment.  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f4b3ccabf726eaec6d39cbcff809882da55ae1e5)
2018-10-11 11:54:30 -04:00
Yossi Itigin
eabc94cab0 osc_ucx: add worker flush before osc module free
Make sure all pending communications are done on all ranks before
closing the window. This way it will be safe to close the endpoints when
closing the component.

(picked from master b8e1af6)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 23:02:19 +03:00
Yossi Itigin
4a97d6b9fa pml_ucx: fix return code from mca_pml_ucx_init()
(picked from master 40ac9e4)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:23:49 +03:00
Yossi Itigin
1bffd196ef pml_ucx: add ompi datatype attribute to release ucp_datatype
(picked from master 4763822)

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-10-10 20:23:26 +03:00
Geoff Paulsen
c8ff7e3ef2
Merge pull request #5874 from rhc54/cmr40/config
v4.0.0: Fix configury for internal PMIx
2018-10-10 10:47:26 -05:00
Sergey Oblomov
274cbc3c03 OSC/UCX: fixed zero-size window processing
- added processing of zero-size MPI window

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit ae6f81983fe354de812ebe2532120fb20ae24d3b)
2018-10-10 16:49:02 +03:00
Nathan Hjelm
1153082a0f btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39be6ec15c202d31423476f09e70199453d25adc)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 16:16:50 -06:00
Howard Pritchard
d18ea98263
Merge pull request #5843 from kawashima-fj/pr/v4.0.x/correct-f08-signatures
v4.0.x: fortran/use-mpi-f08: Correct f08 routine signatures
2018-10-09 10:22:07 -05:00
Ralph Castain
376e2e4d98 Add missing file to tarball
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-09 06:46:09 -07:00