1
1
Граф коммитов

30571 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
92b58987d6
Merge pull request #7584 from ggouaillardet/topic/configury_fpp
configury: try if -fpp flag is needed to preprocess .F90 files
2020-04-01 12:53:57 -04:00
Jeff Squyres
4c89999266
Merge pull request #7587 from jsquyres/pr/resolve-prrte-and-iquote-conflict
configure: fix PRRTE conflict with -iquote
2020-04-01 12:38:07 -04:00
Jeff Squyres
a0e34785e3 configure: fix PRRTE conflict with -iquote
PRRTE needs hwloc and libevent, so it needs to be setup "late" in
configure.ac.  However, we don't want to do it at the absolute bottom
of configure.ac, because right near the bottom, we setup CPPFLAGS (and
others) with values that are expected to be used only in
Makefile[.am]'s -- i.e., "$(foo)" values.  Such values are not able to
be used here in configure.

Hence, move the PRRTE setup up above where we do these "final"/
only-relevant-to-Makefile[.am] CPPFLAGS (etc.) updates occur.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-04-01 08:58:06 -07:00
Josh Hursey
288f6ebb61
Merge pull request #7585 from rlangefe/pr/README-Fix
Grammatical Fix in README
2020-04-01 10:13:57 -05:00
rlangefe
24a45f90ae Grammatical fix in README
Signed-off-by: rlangefe <langrc18@wfu.edu>

Fixed misplaced closing parenthesis on line 184 of the README file
2020-04-01 10:18:59 -04:00
Gilles Gouaillardet
a2c711b54b configury: try if -fpp flag is needed to preprocess .F90 files
.F90 files are preprocessed by gfortran and other compilers.
NAG compilers only preprocess .{ff,ff90,ff95} files, and the -fpp
flag is required to process .F90 files.

Fixes open-mpi/ompi#7583

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-04-01 16:26:47 +09:00
Ralph Castain
538d2de860
Merge pull request #7566 from rhc54/topic/up2
Update PMIx and PRRTE
2020-03-31 10:29:19 -07:00
Ralph Castain
556b3fcc00
PRRTE: Return non-zero status on timeout
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-31 07:03:40 -07:00
Ralph Castain
1aabbe456d
Add extra libs to PRRTE binaries for external deps
libevent, hwloc, and pmix can be external and may require that their
libs be explicitly linked into the PRRTE binaries

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-30 16:06:40 -07:00
Yossi Itigin
5dcd1f4e6c
Merge pull request #7575 from yosefe/topic/pml-ucx-fix-usage-of-mca-pml
pml/ucx: Fix usage of mca_pml_base_pml_check_selected()
2020-03-30 20:06:12 +03:00
Nathan Hjelm
160ff188b8
Merge pull request #7169 from hjelmn/fix_what_wg21_calls_our_problem_not_theirs_seriously__in_some_ways_they_are_correct_but_wtf
configure: use -iquote for non-system include paths
2020-03-30 09:22:54 -07:00
Ralph Castain
f88f271054
Cleanup few errors associated with tool support
Properly mark/detect that a daemon sourced the event broadcast to avoid
reinjecting it into the PMIx server library. Correct the source field
for the event notify call on launcher ready.

Update event notification for tool support
Deal with a variety of race conditions related to tool reconnection to a
different server.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-29 11:58:43 -07:00
Yossi Itigin
124f0c0d1f pml/ucx: Fix usage of mca_pml_base_pml_check_selected()
Pass the correct ompi_proc_t and array length to
mca_pml_base_pml_check_selected() during dynamic modex.

Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2020-03-29 17:46:45 +03:00
Ralph Castain
96759e8d9f
Merge pull request #7573 from rhc54/topic/usfoo
Fix typo in usnic btl
2020-03-28 14:42:50 -07:00
Ralph Castain
95db66d0c8
Fix typo in usnic btl
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-27 20:27:45 -07:00
Howard Pritchard
f136a20cae
Merge pull request #6578 from hppritcha/topic/thread_framework2
Implement a MCA framework for threads
2020-03-27 15:55:48 -06:00
Austen Lauria
8a624ab613
Merge pull request #7523 from mkurnosov/fix-bcast-scatterallgather
Fix Bcast scatter_allgather
2020-03-27 14:17:53 -04:00
Shintaro Iwasaki
d7fba60de8 mca/threads: remove libevent hack
Argobots/Qthreads-aware libevent should be used instead.

Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-03-27 10:16:04 -06:00
Shintaro Iwasaki
a7ea0d9bd7 ompi/request: move REQUEST constants from mca/threads to ompi/request
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-03-27 10:16:04 -06:00
Shintaro Iwasaki
69e8af536a mca/threads: fix tsd management
To suppress Valgrind warnings, opal_tsd_keys_destruct() needs to explicitly
release TSD values of the main thread.  However, they were not freed if keys are
created by non-main threads.  This patch fixes it.

This patch also optimizes allocation of opal_tsd_key_values by doubling its size
when count >= length instead of increasing the size by one.

Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-03-27 10:16:03 -06:00
Shintaro Iwasaki
8cab081770 test/class: fix opal_fifo and opal_lifo
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
2020-03-27 10:16:03 -06:00
Noah Evans
ee3517427e Add threads framework
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads.  Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static.  The theading model to use is selected at
Open MPI configure/build time.

mca/threads: implement Argobots threading layer

config: fix thread configury

- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check

If the poll time is too long, MPI hangs.

This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.

Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.

rework threads MCA framework configury

It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.

qthreads module
some argobots cleanup

Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-27 10:15:45 -06:00
Brian Barrett
36e5863a9a
Merge pull request #7571 from bwbarrett/bugfix/mtl-add-procs
ofi: Call add_procs through PML
2020-03-27 08:24:35 -07:00
Ralph Castain
43e4f6c84b
Merge pull request #7572 from rhc54/topic/auth
Add .mailmap entry for Harumi Kuno
2020-03-27 08:19:06 -07:00
Jeff Squyres
5964f25748 .mailmap: Add entry for Harumi Kuno
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-27 10:41:08 -04:00
Austen Lauria
8128c309c8
Merge pull request #6080 from ggouaillardet/topic/configury_lustre
configure: fix lustre detection
2020-03-27 09:24:08 -04:00
Brian Barrett
64d70b3076 ofi: Call add_procs through PML
Change ompi_mtl_ofi_get_endpoint() to call the active PML's
add_procs() rather than the OFI MTL add_procs() directly when
discovering a new process during operation.

Functionally, this has no impact in correct operation.  However,
the current behavior means that the heterogenous and active PML
checks are not being executed in the dynamic discovery case.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-03-27 06:06:42 -07:00
Ralph Castain
1cf972dcaf
Update PMIx and PRRTE
Deprecate --am and --amca options

Avoid default param files on backend nodes
Any parameters in the PRRTE default or user param files will have been
picked up by prte and included in the environment sent to the prted, so
don't open those files on the backend.

Avoid picking up MCA param file info on backend
Avoid the scaling problem at PRRTE startup by only reading the system
and user param files on the frontend.

Complete revisions to cmd line parser for OMPI
Per specification, enforce following precedence order:

1. system-level default parameter file
1. user-level default parameter file
1. Anything found in the environment
1. "--tune" files. Note that "--amca" goes away and becomes equivalent to "--tune". Okay if it is provided more than once on a cmd line (we will aggregate the list of files, retaining order), but an error if a parameter is referenced in more than one file with a different value
1. "--mca" options. Again, error if the same option appears more than once with a different value. Allowed to override a parameter referenced in a "tune" file
1. "-x" options. Allowed to overwrite options given in a "tune" file, but cannot conflict with an explicit "--mca" option
1. all other options

Fix special handling of "-np"

Get agreement on jobid across the layers
Need all three pieces (PRRTE, PMIx, and OPAL) to agree on the nspace
conversion to jobid method

Ensure prte show_help messages get output
Print abnormal termination messages
Cleanup error reporting in persistent operations

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-26 16:01:11 -07:00
Ralph Castain
c704ed4cc5
Merge pull request #7554 from rhc54/topic/proc1
ompi_proc_t size reduction: part 1
2020-03-26 13:23:06 -07:00
Jeff Squyres
f9575ed026
Merge pull request #7545 from rhc54/topic/readme
README: Document what compiler atomics support is necessary
2020-03-25 16:47:31 -04:00
Jeff Squyres
342cd3c13f README: Remove Solaris 10/11 from tested platforms
Clarify exactly what compiler atomics support is needed to build Open
MPI.

For example, the following configuration now fails to configure on
Solaris:

- Studio 12.5 Sun C 5.14 SunOS_sparc 2016/05/31
- Also tried with GCC 4.0.2

with the following error:

configure: error: No atomic primitives available for sparc-sun-solaris2.11

Thanks to @jdelsign for the report.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-25 16:06:12 -04:00
Ralph Castain
33ab928e1b ompi_proc_t size reduction: part 1
We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.

Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.

With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.

All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.

Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:

(a) in an error message
(b) in a verbose output for debugging purposes

Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 12:49:44 -07:00
Ralph Castain
9bb06d0077
Merge pull request #7559 from rhc54/topic/fixes
Bunch of fixes plus PMIx/PRRTE updates
2020-03-23 12:49:18 -07:00
Ralph Castain
43f79be2e3
Update PMIx and PRRTE
Fix singleton operations and ensure notification upon tool connection.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 11:18:23 -07:00
Ralph Castain
a608e053a6
Silence compiler warning
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 11:18:13 -07:00
Ralph Castain
95dacd2086
Fix singletons and ensure adequate PMIx version
OMPI can only support PMIx v3 and above. PRRTE requires at least PMIx
v4, so protect against the case where OMPI is built against an external
PMIx v3.

Fix check of PMIx_Init return code for singleton operations.

Ensure that the PMIx framework gets properly opened.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 10:29:42 -07:00
Austen Lauria
48b52478ef
Merge pull request #7558 from awlauria/revert_hwloc
Revert "Ensure we get our local topology"
2020-03-23 11:54:11 -04:00
Austen Lauria
ecd20ddcac Revert "Ensure we get our local topology"
Per the devel mailing list, we discussed the need/desirability of this change. Here is the logic behind not including it:

If you call "hwloc_topology_load", then hwloc merrily does its discovery and slams many-core systems. If you call "opal_hwloc_get_topology", then that is fine - it checks if we already have it, tries to get it from PMIx (using shared mem for hwloc 2.x), and only does the discovery if no other method is available.

We previously decided to let those who need the topology call "opal_hwloc_get_topology" to ensure the topo was available so that we don't load it unless someone actually needs it - in the case where it isn't available via PMIx, this avoids paying the startup time and memory footprint penalties for no reason. The function is protected so it will simply return SUCCESS if the topology is already defined.

After discussion, it was decided to stick with that "only setup the topology if someone actually needs it" approach. Hence, we will not blanket init the topology, and the mtl/ofi component will call opal_hwloc_get_topology to ensure the topo has been defined prior to using it.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-23 11:15:47 -04:00
Austen Lauria
cf5ca14f7a
Merge pull request #7547 from rhc54/topic/hwloc
Ensure we get our local topology
2020-03-23 10:13:02 -04:00
Austen Lauria
391370a05c
Merge pull request #7546 from devreal/egrequest-obj-retain
grequestx: fix racy initialization and premature destruction
2020-03-23 10:12:32 -04:00
Austen Lauria
b560fc5fae
Merge pull request #7505 from hkuno/john.l.byrne/btl_ofi
Fix btl ofi clean-up logic
2020-03-23 10:10:33 -04:00
Jeff Squyres
758c3f8d05
Merge pull request #7534 from artemry-mlnx/artemry-mlnx/cleanup_workspace_wa
Implemented a W/A for workspace cleanup issue
2020-03-21 08:35:11 -04:00
Ralph Castain
973d10159a
Merge pull request #7548 from jsquyres/pr/usnic-typo
usnic: remove typo
2020-03-20 14:55:46 -07:00
Ralph Castain
a523e9b74e
Merge pull request #7549 from rhc54/topic/mpirun
Update PMIx and PRRTE to reduce mpirun complexity
2020-03-20 14:55:25 -07:00
Ralph Castain
2979bb2ce8
Update PMIx and PRRTE to reduce mpirun complexity
Use "prte" instead of "prun" for proxy execution of cmds like mpirun.
This avoids the fork/exec-rendezvous complexities and should result in
more reliable operation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 13:49:12 -07:00
Jeff Squyres
1870b04017 usnic: remove typo
Remove an amusing -- but harmless -- typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-20 11:16:52 -07:00
Ralph Castain
98893a530b
Ensure we get our local topology
Restore missing call to get_topology - others were doing it in their
components as repeated calls just return success, but let's ensure it is
always present.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 09:28:20 -07:00
Joseph Schuchart
dabdfe7153 grequestx: fix race condition in initialization
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-03-20 14:53:28 +01:00
Joseph Schuchart
4a39a34bab grequestx: retain request object until it is removed from the list
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-03-20 14:52:42 +01:00
Ralph Castain
b3f0bc5490
Merge pull request #7544 from rhc54/topic/buf
Re-enable stream buffering option
2020-03-18 18:25:02 -07:00