1
1

5707 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
6b4fb509e9
Cleanup singleton detection and data retrieval
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.

Provide full support for info values via PMIx

Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.

PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list

PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support

Silence a warning from btl/vader

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-16 12:25:28 -07:00
Artem Polyakov
9ffee9859f
Merge pull request #7512 from artpol84/topic/master/timings_update
Topic/master/timings update
2020-03-12 17:37:42 -07:00
Austen Lauria
7c31586c6d
Merge pull request #7501 from awlauria/finalize_leaks_ggouaillardet_awlauria
Finalize memchecker calls and one memory leak
2020-03-11 13:04:50 -04:00
Ralph Castain
18b06430d3
Update PRRTE and PMIx
- Avoid modifying single-dash options of applications
- Fix fetch of node/app-level info
- Return correct status code

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-09 18:23:43 -07:00
Artem Polyakov
7c17a38c96 timings: Fix timings when 'prefix' is used
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2020-03-07 09:36:43 -08:00
Ralph Castain
836cc5b6a0
Merge pull request #7498 from rhc54/topic/again
Update PRRTE and PMIx
2020-03-06 11:31:33 -08:00
Ralph Castain
d454bf1f20
Update PRRTE and PMIx
PMIx:
- Ensure that launchers open all required frameworks
- Pass back the tool's ID
- Fix race condition in IOF

PRRTE:
- Begin conversion to use of nspace in place of numeric jobid
- Restore support:
    --report-bindings
    --display-map
    --display-devel-map
    --display-topo
    --do-not-launch
    --xml-output
    --display-allocation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-06 10:04:41 -08:00
Howard Pritchard
8d59512a9e
Merge pull request #7506 from hppritcha/topic/address_issue7458
check for external libevent and hwloc
2020-03-06 09:49:15 -07:00
Austen Lauria
04a3a28a74 Some memchecker cleanup and others.
- Port memchecker call from a1d502c.
- Remove unused memcheck macro variables.
- Some code readability improvements.
- Remove some stray +1's in dynamic comm cleanup.
- Re-add OPAL_ENABLE_DEBUG macro to osc header.
- Cleanup some printf's, and includes.
- Refactor cleanup of dpm_disconnect_objs.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-05 16:44:18 -05:00
Howard Pritchard
2990d8d98b check for external libevent and hwloc
when building with external PMIx.

Related #7458

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-05 14:30:57 -07:00
Gilles Gouaillardet
ff746153d7 mpool/base: silence a valgrind warning
by adding a constructor to mca_mpool_base_tree_item_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-03-05 16:10:42 -05:00
Austen Lauria
f69c8d6819 Fix segv in btl/vader.
Keep track of the connected procs in vader_add_procs().
Otherwise, the same rank will reconnect the same shmem
segment (rank 0+...) multiple times instead of the next
one as intended.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-04 09:32:58 -05:00
Ralph Castain
c537bef7d5
Update the PRRTE and PMIx pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-28 19:55:42 -08:00
Ralph Castain
cbbe67eff9
Merge pull request #7487 from bosilca/topic/pml_from_vpid0
Make sure the PML selection is consistent across the world.
2020-02-28 17:19:26 -08:00
bosilca
c4d36859ec
Merge pull request #7228 from devreal/progress-returns
Harmonize return values of progress callbacks
2020-02-28 20:15:37 -05:00
George Bosilca
21d743393f
Make sure the PML is consistent across the world.
Temporary solution for the PML inconsistency issue discussed in #7475.
This patch address 2 things: first it make the PMIx key optional so that
if we are not in a full modex mode we don't do a direct modex, and
second it get the PML info from the vpid 0 instead of from the local
rank.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-02-28 17:53:48 -05:00
Ralph Castain
c79c95039e
Merge pull request #7474 from rhc54/topic/up
Update the PMIx and PRRTE pointers
2020-02-27 19:36:08 -08:00
Ralph Castain
9e2db26732
Fix vader local modex
Restrict the search to the "immediate" range so at worst we check with
our local server and don't go up to the host daemon.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 18:15:34 -08:00
Nathan Hjelm
11f23865e9
Merge pull request #7485 from awlauria/purge_more_atomics
Purge some leftover OPAL atomics.
2020-02-27 12:44:41 -08:00
Howard Pritchard
31d7748afd
Merge pull request #7434 from hppritcha/topic/fix_a_config_with_ext_pmix_prob
fix an issue with configuring with external pmix
2020-02-27 12:19:02 -07:00
Ralph Castain
0054de0de7
Update the PMIx and PRRTE pointers
- Deal with deprecated cmd line options and rndz files
- Protect against DVM collisions
- Update atomics
- Fix fork/exec and provide better tool support (PMIx)
- Ensure we cleanup completely upon terminating a tool/server that has
  dropped rendezvous files
- Fix multithreaded launch race on remote node
- Fix bug where multiple threads can be modifying app->env
- Add a --no-ready-msg option to prte
- Utilize the PMIx_Spawn ability to do a fork/exec on our behalf to better
  setup the "prte" DVM when running in proxy mode
- Provide better isolation between DVM instances
- Fix race condition in shutdown of PMIx fork/exec framework
- Ensure ompi personality gets added in proxy scenarios

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 08:54:35 -08:00
Austen Lauria
1a27555eec Purge some leftover OPAL atomics.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-02-27 11:30:00 -05:00
Jeff Squyres
85db54969e
Merge pull request #7422 from bgoglin/hwloc-cleanup
minor hwloc configure fixes
2020-02-27 06:20:29 -05:00
KAWASHIMA Takahiro
a9b16299c9
Merge pull request #7479 from yanagibashi/pr/fix-opal-initialized-ref-counter
opal: Fix opal_initialized reference counter
2020-02-27 08:49:02 +09:00
Nathan Hjelm
9cc0f6348d
Merge pull request #7453 from hjelmn/purge_sparc_v9_atomic_support_in_favor_of_just_builtins_for_this_platform
Purge Sparc v9 and sync atomics.
2020-02-26 14:39:46 -08:00
Nathan Hjelm
489c0840d1 asm: cleanup
Remove ASM formats as they have not been used in some time.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 13:04:04 -08:00
Nathan Hjelm
d2dd27b008 asm: remove support for Sparc v9
This commit removes the specialized support for Sparc v9 as the
architecture is unsupported. The architecture will continue to
work without CMA and using the GCC built-in atomic support.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 13:04:04 -08:00
Nathan Hjelm
038dcad8b5 asm: remove support for __sync built-in atomics
This commit removes the unsupported __sync built-in atomics in
favor of the GCC built-ins. The priority order (if not modified
by configure flags) is: C11, custom atomics
(opal/include/opal/sys/*), then GCC built-ins.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 06:30:34 -08:00
Nathan Hjelm
65a096116f opal: remove remaining atomic references to IA64
IA64 atomic support was deleted some time ago. Some of the references
to the architecture were not removed when the atomic support was. This
commit removes those lingering references. IA64 will continue to work
unsupported with the built-in atomics.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 06:22:37 -08:00
Tsubasa Yanagibashi
7d5fbcfd76 opal: Fix opal_initialized reference counter
Before this change, the reference counters `opal_util_initialized`
and `opal_initialized` were incremented at the beginning of the
`opal_init_util` and the `opal_init` functions respectively.
In other words, they were incremented before fully initialized.

This causes the following program to abort by SIGFPE if
`--enable-timing` is enabled on `configure`.

```c
// need -lm option on link

int main(int argc, char *argv[])
{
    // raise SIGFPE on division-by-zero
    feenableexcept(FE_DIVBYZERO);
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}
```

The logic of the SIGFPE is:

1. `MPI_Init` calls `opal_init` through `ompi_rte_init`.
2. `opal_init` changes the value of `opal_initialized` to 1.
3. `opal_init` calls `opal_init_util`.
4. `opal_init_util` calls `opal_timing_ts_func` through
   `OPAL_TIMING_ENV_INIT`, and `opal_timing_ts_func` returns
   `get_ts_cycle` instead of `get_ts_gettimeofday` because
   `opal_initialized` to 1.
   (This is the problem)
5. `opal_init_util` calls `get_ts_cycle` through
   `OPAL_TIMING_ENV_INIT`.
6. `get_ts_cycle` executes
   `opal_timer_base_get_cycles()) / opal_timer_base_get_freq()`
   and it raises SIGFPE (division-by-zero) because the OPAL TIMER
   framework is not initialized yet and `opal_timer_base_get_freq`
   returns 0.

This commit changes the increment timing of `opal_util_initialized`
and `opal_initialized` to the end of `opal_init_util` and the
`opal_init` functions respectively.

Signed-off-by: Tsubasa Yanagibashi <fj2505dt@aa.jp.fujitsu.com>
2020-02-26 14:09:19 +09:00
Jeff Squyres
cdf478e963 btl/sm: remove the deprecation-notice shell
The SM BTL was effectively removed a long time ago.  All that was left
was a shell that warned people if they tried to use the SM BTL.  For
v5.0, we plan to finally remove this ancient shell (and possibly
replace it with vader).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-25 11:48:42 -05:00
Howard Pritchard
488f656c11 fix an issue with configuring with external pmix
External pmix installs are frequently in non-standard locations and
the path to their shared libraries are not ldconfig'd in because
there may be multiple pmix installs.

The way the configury was set up prior to this patch, the configuration
would fail soon after the PMIX config stuff was called because it
added some pmix lib stuff to the LDFLAGS, resulting in configury tests
for things that require running the configure test to fail.

This patch avoids this problem by resetting the LDFLAGS and LIBS back
to what they were prior to the run of the external PMIX detection.

The CFLAGS setting is left because there are many places in the ompi
and opal source code where pmix_common.h needs to be included.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2020-02-23 13:04:24 -07:00
Ralph Castain
86de81baca
Silence a bunch of warnings
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-22 13:05:28 -08:00
Jeff Squyres
7c76237e0d
Merge pull request #7449 from jsquyres/pr/update-hwloc-to-fix-make-dist
Update hwloc submodule to fix "make distcheck"
2020-02-22 05:26:04 -08:00
Ralph Castain
b1fff49f50
Merge pull request #7451 from rhc54/topic/slurm
Fix Slurm process name
2020-02-21 16:55:49 -08:00
Ralph Castain
4c56a7744a
Fix Slurm process name
Ensure that we truncate the local jobid to 15-bits

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 15:46:53 -08:00
Ralph Castain
4960b6b76a
Remove stale reference to MIPS
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 15:35:17 -08:00
Jeff Squyres
cdd3a9fbcc Update hwloc submodule to fix "make distcheck"
Hwloc upstream has fixed a problem with embedded "make distcheck" that
was breaking that surfaced when you ran autogen in an Open MPI
tarball.

This submodule update takes in the upstream hwloc fixes for this
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-21 13:07:08 -08:00
Ralph Castain
f9643b84b9
Merge pull request #7441 from rhc54/topic/hack
Create a hack to protect against non-integer jobids
2020-02-21 11:28:51 -08:00
Jeff Squyres
66da0c6361 Remove "compress" OPAL framework
This framework is no longer used.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-21 06:28:16 -08:00
Ralph Castain
829fd478b3
Create a hack to protect against non-integer jobids
If someone gives us a namespace that doesn't easily translate to an
integer, we have to create a mechanism for working around the
disconnect. PRRTE has been updated to give us a flag so we know we were
"natively" launched. If we don't see it, then fall back to generating a
hash of the nspace as our jobid. We then have to translate back/forth
between nspace and jobid using a lookup table.

Probably not the right long-term solution, but hopefully helps get us
thru for a bit.

Includes update of PRRTE pointer

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 06:04:55 -08:00
Ralph Castain
254dd2288a
Remove the unused opal/pstat framework
ORTE was the only one who used it, and ORTE is...gone!

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-20 09:23:40 -08:00
Ralph Castain
b13c697d53
Update PRRTE and PMIx pointers
- remove stale s390 and MIPS atomics
- ensure envars from spawn are propagated
- fix make tarball
- ensure cleanup of default hostfile

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-19 22:51:19 -08:00
Brice Goglin
a0ea5abec8 hwloc: clarify the error message when infiniband/verbs.h is missing but hwloc's verbs support is requested
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2020-02-19 19:01:36 +01:00
Brice Goglin
b5df92a201 hwloc: remove a stale configure hack for hwloc 1.3.2 vs suse pci issues
opal_hwloc_hwloc132_save_enable_pci doesn't exist anymore

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2020-02-19 19:01:11 +01:00
Nathan Hjelm
b2f17241b3
Merge branch 'master' into opal_atomics_clean_out_sparcv9_as_it_is_not_supported 2020-02-19 08:16:46 -08:00
Nathan Hjelm
9eb5ef92da opal/asm: remove MIPS
This commit removes the code specific to MIPS. This architecture
has been unsupported for some time. Open MPI will continue to work
on MIPS with C11 and __atomic but will not longer use CMA for
shared memory.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-18 21:35:40 -08:00
Nathan Hjelm
3d1495510c opal/atomic: clean out s390(x)
This commit removes the CMA support for s390 and s390x. These
architectures have been unsupported for awhile and no one has
verified that CMA actually works with Open MPI on these systems.

s390 and s390x will continue to work with Open MPI without CMA.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-18 21:14:10 -08:00
Brice Goglin
06219648fc hwloc: remove unused xml configure-time check
It's not used, and XML is always enabled anyway

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2020-02-18 22:05:29 +01:00
Ralph Castain
82c71fae78
Update PRRTE and PMIx pointers
- Fix VPATH installs
- Protect against NULL home directories

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-17 21:45:22 -08:00