1
1

29110 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
7b6a2da71a
Merge pull request #5504 from rhc54/cmr40/ofi
MTL OFI: send/isend split into blocking/non-blocking paths
2018-08-13 14:18:05 -06:00
Jeff Squyres
72e5766a56 hwloc201/configure.m4: make it safe when used with hwloc:external
The Autoconf AC_CONFIG_* macros can only be instantiated exacly once
for any given file, *and* they must be in a code execution path at run
time for the target file to be generated at the end of configure.

For example, if you want to generate file ABC at the end of configure,
you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that
will get executed when configure is run.

That's pretty straightforward.

What's not straightforward is two corner cases:

1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file
   more than once.  If you do, autoreconf will fail (even before you
   can run configure).
2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by
   configure, the file ABC is not registered properly, and ABC will
   not be generated at the end of configure.

This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls
both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles
(etc.) so that targets like "make distclean" and "make distcheck" will
work properly.  Hence, we *have* to invoke HWLOC_SETUP_CORE.

However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side
effects.  It would be nice to do able to do something like this:

```
    if hwloc:extern is going to be used:
        Invoke minimal HWLOC_SETUP_CORE (with no side effects)
    else
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

But we can't, because autoreconf will detect that AC_CONFIG_FILES has
been invoked on the same files more than once (regardless of whether
those code paths will be executed at run time or not).  Kaboom.

Similarly, we can't do this:

```
    if hwloc:extern is not going to be used:
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

Because then hwloc's AC_CONFIG_FILES won't be registered properly when
hwloc:external *is* used (i.e., when the HWLOC_SETUP_CORE macro is not
in a code path that is executed at run time), and targets like "make
distclean" will fail because hwloc's Makefiles won't have been setup.
Kaboom.

But remember that the hwloc framework is a bit special: there will
only ever be 2 comoponents: external and internal.  External is
guaranteed to be configured first because of its priority.  So the
internal component (i.e., this component) immediately knows if it is
going to be used or not based on whether the external component
configuration succeeded or failed.

Specifically: regardless of whether the internal component (i.e., this
component) is going to be used, we have to invoke HWLOC_SETUP_CORE.
But we can manage the side effects: allow the side effects when
this/internal component is going to be used, and avoid the side
effects when this/internal component is not going to be used.

This is a little less clean than I would have liked, but because of
Autoconf's oddity about its AC_CONFIG_* macros, this is the only
solution I could come up with.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 01e4570af759b113b965b63df7bfc72a78d69654)
2018-08-11 12:00:35 -07:00
Jeff Squyres
714f203985 libevent2022/configure.m4: always invoke sub-configure
In order to make "make distclean" (and friends) work, we need to
*always* invoke the embedded configure script -- even if we know that
we're not going to use this component.

But in cases where we know we're not going to use this component, we
also need to avoid the side effects of the code path that is used when
we *do* want to use this component.  So split the two possibilities
into two different macros:

1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing
   except invoke the underlying "configure" script.
2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real
   work (including invoking the underlying "configure" script).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 69aa46e1676c00bb41e54b95bbcf1df3b00dc9c1)
2018-08-11 12:00:35 -07:00
Jeff Squyres
5c5246f655 libevent2022/configure.m4: trivial cleanup
Put argument to AM_CONDITIONAL inside [].  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 80df3f040be2b26162df9cc73a0cfd9e3d11732d)
2018-08-11 12:00:35 -07:00
Jeff Squyres
63d68ded48 libevent2022/configure.m4: minor comment cleanup
Change # -> dnl.  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 17aa64e43825dd27148fc47d3f8f5694fc091d38)
2018-08-11 12:00:35 -07:00
Jeff Squyres
6cb3d61dd1 libevent2022: only configure if event:external fails
We know that event:external will be configured first (because of its
priority).  Take advantage of that here in libevent2022 by having it
refuse to configure / politely fail if event:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., event:external succeeded, so this component
will be skipped, or event:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit b063cb6b0f251052dc72d2496e5745ce83d6b869)
2018-08-09 06:47:08 -07:00
Jeff Squyres
eca16720de hwloc201: only configure if hwloc:external fails
We know that hwloc:external will be configured first (because of its
priority).  Take advantage of that here in hwloc201 by having it
refuse to configure / politely fail if hwloc:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., hwloc:external succeeded, so this component
will be skipped, or hwloc:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 4e5f432786d1ab99304e40dadec40ac45e93e76f)
2018-08-09 06:47:08 -07:00
Todd Kordenbrock
36369f9133 coll-portals4: retry PtlMEUnlink() if PTL_IN_USE
In the cleanup phase, it is possible for PtlMEUnlink() to return
PTL_IN_USE if the NIC is not done with the ME.  This should not
be considered an error.  This commit adds a retry loop around
PtlMEUnlink().

In some cases, the return value of PtlMEUnlink() and PtlCTFree()
was not checked at all.  Check them with the same retry loop as
above.

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit f3f2a826b40cc0d4a45a63614835162ec6eef78e)
2018-08-07 11:23:51 -05:00
Howard Pritchard
9a6f6e61f0
Merge pull request #5499 from nrspruit/ns_cancel_fix_4.0
MTL OFI: Fix Deadlock in fi_cancel given completion during cancel
2018-08-07 09:16:56 -06:00
Howard Pritchard
2386994c9d
Merge pull request #5495 from hoopoepg/topic/ucx-init-c99-v4.0
PML/SPML/UCX: init global objects using C99 style - v4.0
2018-08-04 16:03:56 -06:00
Spruit, Neil R
1fbbae1907 MTL OFI: send/isend split into blocking/non-blocking paths
-Updated blocking send to directly call functionality and
set completion events expected to 0 initally. This allows for optimization for
providers that support fi_tinject up to larger sizes. This also reduces
latency on running the OFI mtl with smaller sizes without requiring
calls to progress given fi_tinject is required to complete the messaging
before returning and will not create any events in the Completion Queue.

-Updated non-blocking send to directly call fi_tsend and avoid calling
fi_tinject as the functionality should not wait on completions. This
resolves a bug where applications calling MPI_Isend can overrun the
TX buffer with small (inject) messages causing a deadlock. In addition
this improves performance in message rates by preventing
waiting on any size message to complete in non-blocking send messages.

-Created common ompi_mtl_ofi_ssend_recv function to post the ssend recv
which is common between isend and send code paths.

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 7dc8c8ba3fa630df8c5c7ab36fcf25249a82bfe7)
2018-08-01 06:45:48 -07:00
Ralph Castain
7830d9971e
Merge pull request #5467 from rhc54/cmr40/ofi
MTL OFI: MTL_OFI_RETRY_UNTIL_DONE support for Resource overflow
2018-07-31 13:08:03 -07:00
Mark Allen
e2b6e9ee09 apply romio314 patch to romio321
When romio314 was first pulled in an extra patch was applied to it, see commit
92f6c7c1e210c559471a05aaac9b19e0bd3d71bb. Most of that patch is already present
in vanilla romio321, but the fix for MPIO_DATATYPE_ISCOMMITTED() isn't.

If that macro doesn't set err_ then some paths end up with a variable being used
uninitialized. In particular you can trace through romio321/romio/mpi-io/read.c
to see what happens with error_code. It's an uninitialized stack variable that goes
through three MPIO_CHECK_* macros none of which set it. The macros consistently set
error_code to a failure if they see something wrong, but they don't consistently
set it to success when things are fine.

And then in the last macro MPIO_CHECK_DATATYPE it tries to look at the value
of error_code that was never set.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit f413ef6b142fc3498bdb575d270f9afd0a840863)
2018-07-30 17:23:59 -04:00
Spruit, Neil R
9cc6bc1ea6 MTL OFI: Fix Deadlock in fi_cancel given completion during cancel
- If a message for a recv that is being cancelled gets completed after
the call to fi_cancel, then the OFI mtl will enter a deadlock state
waiting for ofi_req->super.ompi_req->req_status._cancelled which will
never happen since the recv was successfully finished.

- To resolve this issue, the OFI mtl now checks ofi_req->req_started
to see if the request has been started within the loop waiting for the
event to be cancelled. If the request is being completed, then the loop
is broken and fi_cancel exits setting
ofi_req->super.ompi_req->req_status._cancelled = false;

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 767135c580f75d3dde9cb9c88601dd18afda949a)
2018-07-30 07:17:40 -07:00
Sergey Oblomov
b64502977a PML/SPML/UCX: init global objects using C99 style
- to avoid value mix used C99 style of object initializations

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 2806504290ff2fdd10f254f878e92cae6e90c854)
2018-07-28 16:47:43 +03:00
Howard Pritchard
5704d4fab5
Merge pull request #5474 from mkurnosov/coll-base-allgather-fix-mpi-in-place
coll/base/allgather: fix MPI_IN_PLACE processing
2018-07-26 12:32:45 -06:00
Ralph Castain
0cdf49ed8a Fix typo
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit f7a537cf040d7192f7a86cfe24de14cff754ca29)
2018-07-25 21:37:03 -07:00
Ralph Castain
511319c316 Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcdb1f45aca3f6dfab2646bfdba99775f728ca3b)
2018-07-25 19:55:28 -07:00
Ralph Castain
04054d63eb Cleanup pmix selection check
Allow for versions > 3

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 55cefedf9b8d97c056f7b027d759f32190c9c23e)
2018-07-25 19:55:18 -07:00
Mikhail Kurnosov
c540dfb18c coll-base-allgather: fix MPI_IN_PLACE processing
The call of MPI_Allgather with sendbuf and sendtype parameters equal to MPI_IN_PLACE and NULL correspondingly, produces the segmentation fault.

The problem is that sendtype is used even when sendbuf value is MPI_IN_PLACE. But according to the standard, sendtype and sendcount parameters should be ignored in this case.

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
(cherry picked from commit 540c2d1)
2018-07-25 08:11:28 +07:00
Howard Pritchard
12f391d435
Merge pull request #5475 from hoopoepg/topic/atomic-init-conflict-v4.0
MCA/ATOMIC: atomic_init renamed to atomic_startup - v4.0
2018-07-24 14:55:35 -06:00
Sergey Oblomov
58b7786b70 MCA/ATOMIC: atomic_init renamed to atomic_startup
- there is C11 naming conflict - atomic_init is C macro
  which cause building issue

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 3295b238008a6b2e27941c60d757e64ad796fea2)
2018-07-24 17:23:42 +03:00
Ralph Castain
2c2f9b8169 Leave opal_event_external_support exposed as global var
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

(cherry picked from commit open-mpi/ompi@5cab823979)
2018-07-24 09:53:00 +09:00
Jeff Squyres
1ea021933f event/external: prefer external event component
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit open-mpi/ompi@a70ecf5267)
2018-07-24 09:52:58 +09:00
Jeff Squyres
6f5a453492 event: trivial comment change
Switch from #-style to dnl-style.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit open-mpi/ompi@83e4a45a9f)
2018-07-24 09:52:56 +09:00
Gilles Gouaillardet
aa7a4d0f6f hwloc: prefer external hwloc component
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit open-mpi/ompi@ce2c9fffd4)
2018-07-24 09:52:53 +09:00
Nathan Hjelm
b6bd3d33f1 btl/uct: fix compile warnings/errors
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 47ed8e8830749b6b59c84592c15b7576ea164f0c)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-23 14:05:17 -06:00
Geoff Paulsen
c46600fe6d
Merge pull request #5462 from hoopoepg/topic/fixed-hang-pt2pt-over-pml-ucx-v4.0
PML/UCX: fixed ucp request free on persistent request completion - v4.0
2018-07-23 14:21:34 -05:00
Spruit, Neil R
ac8d2e01f9 MTL OFI: MTL_OFI_RETRY_UNTIL_DONE support for Resource overflow
- Added support in MTL_OFI_RETRY_UNTIL_DONE to handle -FI_EAGAIN
  from the provider and correctly attempt to progress the OFI Completion
  queue by calling ompi_mtl_ofi_progress.

- If events were pending that blocked OFI operations from being enqueued
  they will be completed and the OFI operation will be retried once
  ompi_mtl_ofi_progress has successfully completed.

- Updated MTL_OFI_RETRY_UNTIL_DONE to take a RETURN variable instead of
  requiring the existance of a "ret" variable to pass back the return
  value from completing the OFI operation.

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit d4f408a7f867b2f7bab84b9c966e1eba59f59e0e)
2018-07-23 11:14:42 -07:00
Howard Pritchard
4b840789e9
Merge pull request #5458 from rhc54/cmr40/px
Default to internal PMIx if newer than external
2018-07-22 14:01:35 -06:00
Sergey Oblomov
af0e7b190e PML/UCX: fixed ucp request free on persistent request completion
- in sine cases persistent request was deleted during completion
  callback, this cause double free of linked UCX request (assert
  in debug build or hang in release build)
- UCX request is freed prior completion callback

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 6fe0a73861b53efac12e740c831802057391525d)
2018-07-20 22:20:14 +03:00
Ralph Castain
50e6e14020 Protect against infinite loops
Flag that we provided a notification and ignore it if it attempts to come back up.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit ea0d70bc9396def61545e2ce492a55c4c3aa7772)
2018-07-19 13:57:52 -07:00
Geoff Paulsen
c13c320651
Merge pull request #5455 from hoopoepg/topic/osc-ucx-fox-hang-v4.0
OSC/UCX: fixed hang on OSC UCX init - v4.0
2018-07-19 14:00:28 -05:00
Ralph Castain
508c3f391f Default to internal PMIx if newer than external
Per https://github.com/open-mpi/ompi/issues/5031, if the user didn't specify a particular PMIx installation, then default back to the internal version if it is newer than the discovered external one. PMIx doesn't yet provide a full signature so we have to just get as close as possible for now.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1e6aaf7f226f5a4d940e544079e3977229746c11)
2018-07-19 11:59:17 -07:00
Sergey Oblomov
74d6ad09bc OSC/UCX: fixed hang on OSC init
- there worked progress was missed on startup which caused hang
  on one of ranks

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a081fba0465e0e03472fc45c9a4a7154f539e3f6)
2018-07-19 15:23:01 +03:00
Edgar Gabriel
b6b9552ca9
Merge pull request #5444 from gbossu/fix-file-delete
io/ompio: Call component-specific file_delete function instead of POSIX unlink
2018-07-18 08:45:57 -05:00
Howard Pritchard
4447738098
Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default
btl/openib: only look for iwarp/roce by default
2018-07-17 20:08:00 -06:00
Howard Pritchard
bc8134dae1
Merge pull request #5448 from thananon/ofi_context
btl/ofi: Added FI_CONTEXT as requirement.
2018-07-17 19:51:13 -06:00
Howard Pritchard
6818272392 btl/openib: only look for iwarp/roce by default
Due to decreasing support by vendors/other orgs for the OpenIB BTL,
only look for iWarp/RoCE devices by default.  Allow IB HCAs
with ports configured for ethernet.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-07-17 19:11:37 -06:00
Gilles Gouaillardet
fed1e7766e
Merge pull request #5430 from ggouaillardet/pr/pcollreq-fort
mpiext/pcollreq: add Fortran bindings
2018-07-18 09:52:59 +09:00
Howard Pritchard
9ac84a92a2
Merge pull request #5449 from jladd-mlnx/topic/oshmem-bump-to-v1.4
OSHMEM Specification version: Bump to v1.4
2018-07-17 16:39:21 -06:00
Joshua Ladd
7bfcd4545d OSHMEM Specification version: Bump to v1.4
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
2018-07-18 00:23:24 +03:00
Joshua Ladd
3add13c72e
Merge pull request #5441 from hoopoepg/topic/ucx-memhooks-to-common-module
MCA/COMMON/UCX: shift opal memhooks into common UCX
2018-07-17 15:52:44 -04:00
Thananon Patinyasakdikul
033c364ee0 btl/ofi: Added FI_CONTEXT as requirement.
OFI BTL uses context for completion but never ask for it in
fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT
to eliminate any potential error.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-07-17 12:18:43 -07:00
Matias Cabral
be3cb01cb4
Merge pull request #5397 from nrspruit/ns_ofi_mtl_ssend
MTL OFI: Redesign sync send with reduced tag bits and quick ack
2018-07-17 10:14:33 -07:00
Sergey Oblomov
a4b8253fa2 MCA/COMMON/UCX: fixed initialization of malloc hooks
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-07-17 20:09:50 +03:00
Gaëtan Bossu
8522ba112c MCA/IO/OMPIO: fix MPI_File_delete implementation.
OMPIO now uses the correct delete function depending on the fs

mca_common_ompio_file_delete now works this way instead
of calling POSIX unlink:
 - create a minimal file handle with the given file name
 - select the best fs component using this file handle
 - call the component-specific file delete function

Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
2018-07-17 18:17:13 +02:00
Gaëtan Bossu
ac6f75e3d1 MCA/FS: check communicator validity in query functions
It is needed because the fs components might be queried due to a MPI_File_delete call.
And in this case, we don't have a communicator value.

Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
2018-07-17 18:16:21 +02:00
Josh Hursey
9aa5168795
Merge pull request #5353 from ggouaillardet/topic/romio321_grequests
io/romio321: make grequest extensions internal
2018-07-17 10:53:53 -05:00
Gilles Gouaillardet
1a41482720 coll/libnbc: do not recursively call opal_progress()
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.

the same method is used in ompi_comm_request_progress()

Refs open-mpi/ompi#3901

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-07-17 09:45:08 -06:00