1
1

30737 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
95dacd2086
Fix singletons and ensure adequate PMIx version
OMPI can only support PMIx v3 and above. PRRTE requires at least PMIx
v4, so protect against the case where OMPI is built against an external
PMIx v3.

Fix check of PMIx_Init return code for singleton operations.

Ensure that the PMIx framework gets properly opened.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 10:29:42 -07:00
Austen Lauria
48b52478ef
Merge pull request #7558 from awlauria/revert_hwloc
Revert "Ensure we get our local topology"
2020-03-23 11:54:11 -04:00
Austen Lauria
ecd20ddcac Revert "Ensure we get our local topology"
Per the devel mailing list, we discussed the need/desirability of this change. Here is the logic behind not including it:

If you call "hwloc_topology_load", then hwloc merrily does its discovery and slams many-core systems. If you call "opal_hwloc_get_topology", then that is fine - it checks if we already have it, tries to get it from PMIx (using shared mem for hwloc 2.x), and only does the discovery if no other method is available.

We previously decided to let those who need the topology call "opal_hwloc_get_topology" to ensure the topo was available so that we don't load it unless someone actually needs it - in the case where it isn't available via PMIx, this avoids paying the startup time and memory footprint penalties for no reason. The function is protected so it will simply return SUCCESS if the topology is already defined.

After discussion, it was decided to stick with that "only setup the topology if someone actually needs it" approach. Hence, we will not blanket init the topology, and the mtl/ofi component will call opal_hwloc_get_topology to ensure the topo has been defined prior to using it.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-23 11:15:47 -04:00
Austen Lauria
cf5ca14f7a
Merge pull request #7547 from rhc54/topic/hwloc
Ensure we get our local topology
2020-03-23 10:13:02 -04:00
Austen Lauria
391370a05c
Merge pull request #7546 from devreal/egrequest-obj-retain
grequestx: fix racy initialization and premature destruction
2020-03-23 10:12:32 -04:00
Austen Lauria
b560fc5fae
Merge pull request #7505 from hkuno/john.l.byrne/btl_ofi
Fix btl ofi clean-up logic
2020-03-23 10:10:33 -04:00
Jeff Squyres
758c3f8d05
Merge pull request #7534 from artemry-mlnx/artemry-mlnx/cleanup_workspace_wa
Implemented a W/A for workspace cleanup issue
2020-03-21 08:35:11 -04:00
Ralph Castain
973d10159a
Merge pull request #7548 from jsquyres/pr/usnic-typo
usnic: remove typo
2020-03-20 14:55:46 -07:00
Ralph Castain
a523e9b74e
Merge pull request #7549 from rhc54/topic/mpirun
Update PMIx and PRRTE to reduce mpirun complexity
2020-03-20 14:55:25 -07:00
Ralph Castain
2979bb2ce8
Update PMIx and PRRTE to reduce mpirun complexity
Use "prte" instead of "prun" for proxy execution of cmds like mpirun.
This avoids the fork/exec-rendezvous complexities and should result in
more reliable operation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 13:49:12 -07:00
Jeff Squyres
1870b04017 usnic: remove typo
Remove an amusing -- but harmless -- typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-20 11:16:52 -07:00
Ralph Castain
98893a530b
Ensure we get our local topology
Restore missing call to get_topology - others were doing it in their
components as repeated calls just return success, but let's ensure it is
always present.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 09:28:20 -07:00
Joseph Schuchart
dabdfe7153 grequestx: fix race condition in initialization
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-03-20 14:53:28 +01:00
Joseph Schuchart
4a39a34bab grequestx: retain request object until it is removed from the list
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-03-20 14:52:42 +01:00
Ralph Castain
b3f0bc5490
Merge pull request #7544 from rhc54/topic/buf
Re-enable stream buffering option
2020-03-18 18:25:02 -07:00
Ralph Castain
757621e199
Re-enable stream buffering option
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-18 16:20:57 -07:00
Ralph Castain
4bb13bc733
Merge pull request #7543 from rhc54/topic/up
Update PMIx and PRRTE
2020-03-18 14:03:27 -07:00
Ralph Castain
0dccd3378b
Update PMIx and PRRTE
PMIx
- fix several race conditions

PRRTE
- fix race condition
- extend prun-to-prte connection tries
- pass correct nspace to job ctrl in response to ctrl-c

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-18 11:46:38 -07:00
Ralph Castain
c3ff5958a5
Merge pull request #7540 from jsquyres/pr/fix-pmix-resetting-top-level-flags
opal_check_pmi.m4: properly save top-level flags
2020-03-17 20:22:42 -07:00
Jeff Squyres
cb1e424359 opal_check_pmi.m4: properly save top-level flags
CPPFLAGS, LDFLAGS, and LIBS were only being saved conditionally, but
restored unconditionally.  This could result in wiping out
CPPFLAGS/LDFLAGS/LIB.

Make sure to *always* save these flags so that when they are restored,
they are restored to their proper value.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-17 22:15:41 -04:00
Ralph Castain
b35d714131
Merge pull request #7537 from rhc54/topic/pxup
Update PMIx
2020-03-17 10:34:54 -07:00
Ralph Castain
972f6aea7f
Update PMIx
- Silence a few (valid) warnings

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-17 08:53:43 -07:00
Ralph Castain
ddc19559af
Merge pull request #7535 from rhc54/topic/rte
Cleanup singleton detection and data retrieval
2020-03-16 15:49:20 -07:00
Ralph Castain
6b4fb509e9
Cleanup singleton detection and data retrieval
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.

Provide full support for info values via PMIx

Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.

PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list

PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support

Silence a warning from btl/vader

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-16 12:25:28 -07:00
Artem Ryabov
15e05ebe56 Implemented a W/A for workspace cleanup issue
Signed-off-by: Artem Ryabov <artemry@mellanox.com>
2020-03-15 00:36:13 +03:00
Artem Polyakov
9ffee9859f
Merge pull request #7512 from artpol84/topic/master/timings_update
Topic/master/timings update
2020-03-12 17:37:42 -07:00
Austen Lauria
f973381a46
Merge pull request #7524 from awlauria/update_info_for_prrte
Update two orte specific env's to be more generic
2020-03-12 20:18:42 -04:00
Ralph Castain
c296dada2c
Merge pull request #7525 from rhc54/topic/fence
Correct fence logic in MPI_Init
2020-03-11 16:55:01 -07:00
Austen Lauria
7c31586c6d
Merge pull request #7501 from awlauria/finalize_leaks_ggouaillardet_awlauria
Finalize memchecker calls and one memory leak
2020-03-11 13:04:50 -04:00
Ralph Castain
dd623cec34
Correct fence logic in MPI_Init
The fence logic in MPI_Init got messed up somehow such that we were
always executing a fence, which is not desirable. The logic is supposed
to be:

* if async fence is requested and we are not collecting data, then do
not fence at all

* if async fence is requested and we are collecting data, then execute
the fence in the background - wait for completion at the end of MPI_Init.

* if async fence is not requested, then execute a blocking fence at that
point, collecting data as directed. Note that we cannot actually do a
blocking fence as we need to cycle the event library via opal_progress
as the PMIx progress thread is tied to the OMPI event base.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-11 09:25:07 -07:00
Austen Lauria
b675a76361 Update two orte specific env's to be more generic.
orte is gone, and we don't want to require other
RM's to either use a prrte specific env, or to
set their own.

OMPI_MCA_orte_ess_num_procs -> OMPI_MCA_num_procs
OMPI_MCA_orte_cpu_type -> OMPI_MCA_cpu_type

See PRRTE PR's:

https://github.com/openpmix/prrte/pull/443
https://github.com/openpmix/prrte/pull/440

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-11 10:09:35 -04:00
Mikhail Kurnosov
66b6b8d34e Fix Bcast scatter_allgather
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
2020-03-11 12:47:47 +07:00
Harumi Kuno
ab4875ddc2 set ep to NULL to avoid double close
Per suggestion of @awlauria

Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-03-10 17:39:59 -06:00
Ralph Castain
11028d0322
Merge pull request #7518 from rhc54/topic/prup
Update PRRTE and PMIx
2020-03-10 05:44:45 -07:00
Ralph Castain
18b06430d3
Update PRRTE and PMIx
- Avoid modifying single-dash options of applications
- Fix fetch of node/app-level info
- Return correct status code

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-09 18:23:43 -07:00
Nathan Hjelm
a6212fe767 configure: use -iquote for non-system include paths
This fixes a bug in the configuration system identified by a
change in the C++ standard. C++20 adds a new mandatory header
named version. This (and any system) header will always be
included with <> and not "". On case-insensitive filesystems
this new header conflicts with the VERSION file at the top
level of the build tree.

To fix this issue Open MPI needs to use -iquote instead of -I
for non-system include paths to ensure that these include are
only searched for the quote form of include. This commit also
adds a check to ensure that if the compiler does not support
-iquote that it falls back to -I until support can be added.

References #7155

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-03-09 21:18:40 -04:00
Gilles Gouaillardet
69bc2e8372 misc: fix <> vs "" includes throught the ompi codebase
This commit fixes an issue with the include usage in some
ompi source files. These source files are using the <> form
of include when the "" form is correct (as these are internal,
**not** system headers).

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-03-09 21:13:49 -04:00
Ralph Castain
4c7890dec4
Merge pull request #7517 from rhc54/topic/prte
Report PRRTE build status if autogen'd --no-prrte
2020-03-09 14:39:38 -07:00
Ralph Castain
727bd8a60d
Report PRRTE build status if autogen'd --no-prrte
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-08 11:02:42 -07:00
Harumi Kuno
1bc3dab118 Add comments about order of close ops
Per suggestion of @awlauria, added some comments about
the need to free ep before resources it points to.

Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-03-07 14:08:39 -07:00
Artem Polyakov
0f51ea3fe5 timings: Update/extend OSHMEM timings
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2020-03-07 09:36:43 -08:00
Artem Polyakov
7c17a38c96 timings: Fix timings when 'prefix' is used
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2020-03-07 09:36:43 -08:00
Ralph Castain
836cc5b6a0
Merge pull request #7498 from rhc54/topic/again
Update PRRTE and PMIx
2020-03-06 11:31:33 -08:00
Ralph Castain
d454bf1f20
Update PRRTE and PMIx
PMIx:
- Ensure that launchers open all required frameworks
- Pass back the tool's ID
- Fix race condition in IOF

PRRTE:
- Begin conversion to use of nspace in place of numeric jobid
- Restore support:
    --report-bindings
    --display-map
    --display-devel-map
    --display-topo
    --do-not-launch
    --xml-output
    --display-allocation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-06 10:04:41 -08:00
Howard Pritchard
8d59512a9e
Merge pull request #7506 from hppritcha/topic/address_issue7458
check for external libevent and hwloc
2020-03-06 09:49:15 -07:00
Austen Lauria
04a3a28a74 Some memchecker cleanup and others.
- Port memchecker call from a1d502c.
- Remove unused memcheck macro variables.
- Some code readability improvements.
- Remove some stray +1's in dynamic comm cleanup.
- Re-add OPAL_ENABLE_DEBUG macro to osc header.
- Cleanup some printf's, and includes.
- Refactor cleanup of dpm_disconnect_objs.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-05 16:44:18 -05:00
Howard Pritchard
2990d8d98b check for external libevent and hwloc
when building with external PMIx.

Related #7458

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-05 14:30:57 -07:00
Gilles Gouaillardet
e2ad184db5 pml/ob1: silence valgrind errors
always define and initialize padding in various structs
when OPAL_ENABLE_DEBUG is set

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-03-05 16:10:43 -05:00
Gilles Gouaillardet
5751dfe91a mpi/c: fix memchecker invokation
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-03-05 16:10:42 -05:00
Gilles Gouaillardet
fc2516457b osc/pt2pt: silence valgrind warnings
explicitly add and initialize padding to keep valgrind happy

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-03-05 16:10:42 -05:00