1
1

5868 Коммитов

Автор SHA1 Сообщение Дата
Austen Lauria
b560fc5fae
Merge pull request #7505 from hkuno/john.l.byrne/btl_ofi
Fix btl ofi clean-up logic
2020-03-23 10:10:33 -04:00
Ralph Castain
973d10159a
Merge pull request #7548 from jsquyres/pr/usnic-typo
usnic: remove typo
2020-03-20 14:55:46 -07:00
Ralph Castain
2979bb2ce8
Update PMIx and PRRTE to reduce mpirun complexity
Use "prte" instead of "prun" for proxy execution of cmds like mpirun.
This avoids the fork/exec-rendezvous complexities and should result in
more reliable operation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 13:49:12 -07:00
Jeff Squyres
1870b04017 usnic: remove typo
Remove an amusing -- but harmless -- typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-03-20 11:16:52 -07:00
Ralph Castain
0dccd3378b
Update PMIx and PRRTE
PMIx
- fix several race conditions

PRRTE
- fix race condition
- extend prun-to-prte connection tries
- pass correct nspace to job ctrl in response to ctrl-c

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-18 11:46:38 -07:00
Ralph Castain
972f6aea7f
Update PMIx
- Silence a few (valid) warnings

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-17 08:53:43 -07:00
Ralph Castain
6b4fb509e9
Cleanup singleton detection and data retrieval
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.

Provide full support for info values via PMIx

Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.

PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list

PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support

Silence a warning from btl/vader

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-16 12:25:28 -07:00
Artem Polyakov
9ffee9859f
Merge pull request #7512 from artpol84/topic/master/timings_update
Topic/master/timings update
2020-03-12 17:37:42 -07:00
Austen Lauria
7c31586c6d
Merge pull request #7501 from awlauria/finalize_leaks_ggouaillardet_awlauria
Finalize memchecker calls and one memory leak
2020-03-11 13:04:50 -04:00
Harumi Kuno
ab4875ddc2 set ep to NULL to avoid double close
Per suggestion of @awlauria

Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-03-10 17:39:59 -06:00
Ralph Castain
18b06430d3
Update PRRTE and PMIx
- Avoid modifying single-dash options of applications
- Fix fetch of node/app-level info
- Return correct status code

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-09 18:23:43 -07:00
Gilles Gouaillardet
69bc2e8372 misc: fix <> vs "" includes throught the ompi codebase
This commit fixes an issue with the include usage in some
ompi source files. These source files are using the <> form
of include when the "" form is correct (as these are internal,
**not** system headers).

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-03-09 21:13:49 -04:00
Harumi Kuno
1bc3dab118 Add comments about order of close ops
Per suggestion of @awlauria, added some comments about
the need to free ep before resources it points to.

Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
2020-03-07 14:08:39 -07:00
Artem Polyakov
7c17a38c96 timings: Fix timings when 'prefix' is used
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2020-03-07 09:36:43 -08:00
Ralph Castain
836cc5b6a0
Merge pull request #7498 from rhc54/topic/again
Update PRRTE and PMIx
2020-03-06 11:31:33 -08:00
Ralph Castain
d454bf1f20
Update PRRTE and PMIx
PMIx:
- Ensure that launchers open all required frameworks
- Pass back the tool's ID
- Fix race condition in IOF

PRRTE:
- Begin conversion to use of nspace in place of numeric jobid
- Restore support:
    --report-bindings
    --display-map
    --display-devel-map
    --display-topo
    --do-not-launch
    --xml-output
    --display-allocation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-06 10:04:41 -08:00
Howard Pritchard
8d59512a9e
Merge pull request #7506 from hppritcha/topic/address_issue7458
check for external libevent and hwloc
2020-03-06 09:49:15 -07:00
Austen Lauria
04a3a28a74 Some memchecker cleanup and others.
- Port memchecker call from a1d502c.
- Remove unused memcheck macro variables.
- Some code readability improvements.
- Remove some stray +1's in dynamic comm cleanup.
- Re-add OPAL_ENABLE_DEBUG macro to osc header.
- Cleanup some printf's, and includes.
- Refactor cleanup of dpm_disconnect_objs.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-05 16:44:18 -05:00
Howard Pritchard
2990d8d98b check for external libevent and hwloc
when building with external PMIx.

Related #7458

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-05 14:30:57 -07:00
Gilles Gouaillardet
ff746153d7 mpool/base: silence a valgrind warning
by adding a constructor to mca_mpool_base_tree_item_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-03-05 16:10:42 -05:00
harumi kuno
3095fabf94 Fix mca_btl_ofi_finalize clean-up logic
This fix is from John L. Byrne (john.l.byrne@hpe.com).

When OFI Libfabric binds objects to endpoints, before the object can
be successfully closed, the endpoint must first be freed.  For scalable
endpoints, objects can also be bound to transmit and receive contexts,
and for objects that are bound to contexts, we need to first free the
contexts before freeing the endpoint. We also need to clear the memory
registration cache.

If we don't clean up properly, then fi\_close may not be able to close
the domain because the dom will have a non-zero ref count.

Signed-off-by: harumi kuno <harumi.kuno@hpe.com>
2020-03-04 17:51:08 -07:00
Austen Lauria
f69c8d6819 Fix segv in btl/vader.
Keep track of the connected procs in vader_add_procs().
Otherwise, the same rank will reconnect the same shmem
segment (rank 0+...) multiple times instead of the next
one as intended.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-03-04 09:32:58 -05:00
Ralph Castain
c537bef7d5
Update the PRRTE and PMIx pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-28 19:55:42 -08:00
Ralph Castain
cbbe67eff9
Merge pull request #7487 from bosilca/topic/pml_from_vpid0
Make sure the PML selection is consistent across the world.
2020-02-28 17:19:26 -08:00
bosilca
c4d36859ec
Merge pull request #7228 from devreal/progress-returns
Harmonize return values of progress callbacks
2020-02-28 20:15:37 -05:00
George Bosilca
21d743393f
Make sure the PML is consistent across the world.
Temporary solution for the PML inconsistency issue discussed in #7475.
This patch address 2 things: first it make the PMIx key optional so that
if we are not in a full modex mode we don't do a direct modex, and
second it get the PML info from the vpid 0 instead of from the local
rank.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-02-28 17:53:48 -05:00
Ralph Castain
c79c95039e
Merge pull request #7474 from rhc54/topic/up
Update the PMIx and PRRTE pointers
2020-02-27 19:36:08 -08:00
Ralph Castain
9e2db26732
Fix vader local modex
Restrict the search to the "immediate" range so at worst we check with
our local server and don't go up to the host daemon.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 18:15:34 -08:00
Nathan Hjelm
11f23865e9
Merge pull request #7485 from awlauria/purge_more_atomics
Purge some leftover OPAL atomics.
2020-02-27 12:44:41 -08:00
Howard Pritchard
31d7748afd
Merge pull request #7434 from hppritcha/topic/fix_a_config_with_ext_pmix_prob
fix an issue with configuring with external pmix
2020-02-27 12:19:02 -07:00
Ralph Castain
0054de0de7
Update the PMIx and PRRTE pointers
- Deal with deprecated cmd line options and rndz files
- Protect against DVM collisions
- Update atomics
- Fix fork/exec and provide better tool support (PMIx)
- Ensure we cleanup completely upon terminating a tool/server that has
  dropped rendezvous files
- Fix multithreaded launch race on remote node
- Fix bug where multiple threads can be modifying app->env
- Add a --no-ready-msg option to prte
- Utilize the PMIx_Spawn ability to do a fork/exec on our behalf to better
  setup the "prte" DVM when running in proxy mode
- Provide better isolation between DVM instances
- Fix race condition in shutdown of PMIx fork/exec framework
- Ensure ompi personality gets added in proxy scenarios

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 08:54:35 -08:00
Austen Lauria
1a27555eec Purge some leftover OPAL atomics.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-02-27 11:30:00 -05:00
Jeff Squyres
85db54969e
Merge pull request #7422 from bgoglin/hwloc-cleanup
minor hwloc configure fixes
2020-02-27 06:20:29 -05:00
KAWASHIMA Takahiro
a9b16299c9
Merge pull request #7479 from yanagibashi/pr/fix-opal-initialized-ref-counter
opal: Fix opal_initialized reference counter
2020-02-27 08:49:02 +09:00
Nathan Hjelm
9cc0f6348d
Merge pull request #7453 from hjelmn/purge_sparc_v9_atomic_support_in_favor_of_just_builtins_for_this_platform
Purge Sparc v9 and sync atomics.
2020-02-26 14:39:46 -08:00
Nathan Hjelm
489c0840d1 asm: cleanup
Remove ASM formats as they have not been used in some time.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 13:04:04 -08:00
Nathan Hjelm
d2dd27b008 asm: remove support for Sparc v9
This commit removes the specialized support for Sparc v9 as the
architecture is unsupported. The architecture will continue to
work without CMA and using the GCC built-in atomic support.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 13:04:04 -08:00
Nathan Hjelm
038dcad8b5 asm: remove support for __sync built-in atomics
This commit removes the unsupported __sync built-in atomics in
favor of the GCC built-ins. The priority order (if not modified
by configure flags) is: C11, custom atomics
(opal/include/opal/sys/*), then GCC built-ins.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 06:30:34 -08:00
Nathan Hjelm
65a096116f opal: remove remaining atomic references to IA64
IA64 atomic support was deleted some time ago. Some of the references
to the architecture were not removed when the atomic support was. This
commit removes those lingering references. IA64 will continue to work
unsupported with the built-in atomics.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-02-26 06:22:37 -08:00
Tsubasa Yanagibashi
7d5fbcfd76 opal: Fix opal_initialized reference counter
Before this change, the reference counters `opal_util_initialized`
and `opal_initialized` were incremented at the beginning of the
`opal_init_util` and the `opal_init` functions respectively.
In other words, they were incremented before fully initialized.

This causes the following program to abort by SIGFPE if
`--enable-timing` is enabled on `configure`.

```c
// need -lm option on link

int main(int argc, char *argv[])
{
    // raise SIGFPE on division-by-zero
    feenableexcept(FE_DIVBYZERO);
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}
```

The logic of the SIGFPE is:

1. `MPI_Init` calls `opal_init` through `ompi_rte_init`.
2. `opal_init` changes the value of `opal_initialized` to 1.
3. `opal_init` calls `opal_init_util`.
4. `opal_init_util` calls `opal_timing_ts_func` through
   `OPAL_TIMING_ENV_INIT`, and `opal_timing_ts_func` returns
   `get_ts_cycle` instead of `get_ts_gettimeofday` because
   `opal_initialized` to 1.
   (This is the problem)
5. `opal_init_util` calls `get_ts_cycle` through
   `OPAL_TIMING_ENV_INIT`.
6. `get_ts_cycle` executes
   `opal_timer_base_get_cycles()) / opal_timer_base_get_freq()`
   and it raises SIGFPE (division-by-zero) because the OPAL TIMER
   framework is not initialized yet and `opal_timer_base_get_freq`
   returns 0.

This commit changes the increment timing of `opal_util_initialized`
and `opal_initialized` to the end of `opal_init_util` and the
`opal_init` functions respectively.

Signed-off-by: Tsubasa Yanagibashi <fj2505dt@aa.jp.fujitsu.com>
2020-02-26 14:09:19 +09:00
Jeff Squyres
cdf478e963 btl/sm: remove the deprecation-notice shell
The SM BTL was effectively removed a long time ago.  All that was left
was a shell that warned people if they tried to use the SM BTL.  For
v5.0, we plan to finally remove this ancient shell (and possibly
replace it with vader).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-25 11:48:42 -05:00
Howard Pritchard
488f656c11 fix an issue with configuring with external pmix
External pmix installs are frequently in non-standard locations and
the path to their shared libraries are not ldconfig'd in because
there may be multiple pmix installs.

The way the configury was set up prior to this patch, the configuration
would fail soon after the PMIX config stuff was called because it
added some pmix lib stuff to the LDFLAGS, resulting in configury tests
for things that require running the configure test to fail.

This patch avoids this problem by resetting the LDFLAGS and LIBS back
to what they were prior to the run of the external PMIX detection.

The CFLAGS setting is left because there are many places in the ompi
and opal source code where pmix_common.h needs to be included.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2020-02-23 13:04:24 -07:00
Ralph Castain
86de81baca
Silence a bunch of warnings
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-22 13:05:28 -08:00
Jeff Squyres
7c76237e0d
Merge pull request #7449 from jsquyres/pr/update-hwloc-to-fix-make-dist
Update hwloc submodule to fix "make distcheck"
2020-02-22 05:26:04 -08:00
Ralph Castain
b1fff49f50
Merge pull request #7451 from rhc54/topic/slurm
Fix Slurm process name
2020-02-21 16:55:49 -08:00
Ralph Castain
4c56a7744a
Fix Slurm process name
Ensure that we truncate the local jobid to 15-bits

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 15:46:53 -08:00
Ralph Castain
4960b6b76a
Remove stale reference to MIPS
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 15:35:17 -08:00
Jeff Squyres
cdd3a9fbcc Update hwloc submodule to fix "make distcheck"
Hwloc upstream has fixed a problem with embedded "make distcheck" that
was breaking that surfaced when you ran autogen in an Open MPI
tarball.

This submodule update takes in the upstream hwloc fixes for this
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-21 13:07:08 -08:00
Ralph Castain
f9643b84b9
Merge pull request #7441 from rhc54/topic/hack
Create a hack to protect against non-integer jobids
2020-02-21 11:28:51 -08:00
Jeff Squyres
66da0c6361 Remove "compress" OPAL framework
This framework is no longer used.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-02-21 06:28:16 -08:00