1
1

63 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
f61774fb83
Update PMIx
Pickup fixes in the OMPI envar setting support

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-15 13:02:20 -07:00
Ralph Castain
3252d26183
Sync to PMIx and PRRTE master
- fix potential hang in direct modex
- add support for reachable framework in PRRTE/oob

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-14 10:05:47 -07:00
Ralph Castain
4ab74450d4
Update PRRTE and PMIx
- ensure we return timeout error status
- lots of various bug fixes

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-12 07:05:12 -07:00
Ralph Castain
f32febd7f7
Update PMIx and PRRTE
PMIx:
- restore OPA support

PRRTE:
Restore support for several options
* -N for ppr:N:node
* INHERIT modifier for --map-by option, indicating that
  the spawned job should inherit the placement options
  of its parent. Only applicable to dynamically spawned
  jobs

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-08 09:24:44 -07:00
Ralph Castain
80568bb388
Update for support of PMIX_NUMA_RANK values
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-06 13:06:26 -07:00
Ralph Castain
0d52c2dad7
Sync updates for PMIx and PRRTE
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-06 08:42:13 -07:00
Ralph Castain
3fbfeabff2
Update PRRTE schizo framework
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-03 11:37:19 -07:00
Ralph Castain
44c97e3842
Update again and hope to fix the integration points
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-02 11:58:58 -07:00
Ralph Castain
50d05e7b64
Revert "Add extra libs to PRRTE binaries for external deps"
This reverts commit 1aabbe456d5de8fc3c3d6f91eabb8851db7d63eb.

Update PMIx and PRRTE, plus PRRTE config integration

Cleanup how we pass the extra libs and LDFLAGS for linking against
external libevent, hwloc, and pmix installs.

Catch the flag indicating that PMIx provided the user-level default MCA
params so we don't go looking for them ourselves.

Cleanup misc config warnings

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-01 20:17:32 -07:00
Ralph Castain
f88f271054
Cleanup few errors associated with tool support
Properly mark/detect that a daemon sourced the event broadcast to avoid
reinjecting it into the PMIx server library. Correct the source field
for the event notify call on launcher ready.

Update event notification for tool support
Deal with a variety of race conditions related to tool reconnection to a
different server.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-29 11:58:43 -07:00
Ralph Castain
1cf972dcaf
Update PMIx and PRRTE
Deprecate --am and --amca options

Avoid default param files on backend nodes
Any parameters in the PRRTE default or user param files will have been
picked up by prte and included in the environment sent to the prted, so
don't open those files on the backend.

Avoid picking up MCA param file info on backend
Avoid the scaling problem at PRRTE startup by only reading the system
and user param files on the frontend.

Complete revisions to cmd line parser for OMPI
Per specification, enforce following precedence order:

1. system-level default parameter file
1. user-level default parameter file
1. Anything found in the environment
1. "--tune" files. Note that "--amca" goes away and becomes equivalent to "--tune". Okay if it is provided more than once on a cmd line (we will aggregate the list of files, retaining order), but an error if a parameter is referenced in more than one file with a different value
1. "--mca" options. Again, error if the same option appears more than once with a different value. Allowed to override a parameter referenced in a "tune" file
1. "-x" options. Allowed to overwrite options given in a "tune" file, but cannot conflict with an explicit "--mca" option
1. all other options

Fix special handling of "-np"

Get agreement on jobid across the layers
Need all three pieces (PRRTE, PMIx, and OPAL) to agree on the nspace
conversion to jobid method

Ensure prte show_help messages get output
Print abnormal termination messages
Cleanup error reporting in persistent operations

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-26 16:01:11 -07:00
Ralph Castain
43f79be2e3
Update PMIx and PRRTE
Fix singleton operations and ensure notification upon tool connection.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 11:18:23 -07:00
Ralph Castain
2979bb2ce8
Update PMIx and PRRTE to reduce mpirun complexity
Use "prte" instead of "prun" for proxy execution of cmds like mpirun.
This avoids the fork/exec-rendezvous complexities and should result in
more reliable operation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 13:49:12 -07:00
Ralph Castain
0dccd3378b
Update PMIx and PRRTE
PMIx
- fix several race conditions

PRRTE
- fix race condition
- extend prun-to-prte connection tries
- pass correct nspace to job ctrl in response to ctrl-c

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-18 11:46:38 -07:00
Ralph Castain
972f6aea7f
Update PMIx
- Silence a few (valid) warnings

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-17 08:53:43 -07:00
Ralph Castain
6b4fb509e9
Cleanup singleton detection and data retrieval
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.

Provide full support for info values via PMIx

Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.

PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list

PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support

Silence a warning from btl/vader

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-16 12:25:28 -07:00
Ralph Castain
18b06430d3
Update PRRTE and PMIx
- Avoid modifying single-dash options of applications
- Fix fetch of node/app-level info
- Return correct status code

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-09 18:23:43 -07:00
Ralph Castain
d454bf1f20
Update PRRTE and PMIx
PMIx:
- Ensure that launchers open all required frameworks
- Pass back the tool's ID
- Fix race condition in IOF

PRRTE:
- Begin conversion to use of nspace in place of numeric jobid
- Restore support:
    --report-bindings
    --display-map
    --display-devel-map
    --display-topo
    --do-not-launch
    --xml-output
    --display-allocation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-06 10:04:41 -08:00
Ralph Castain
c537bef7d5
Update the PRRTE and PMIx pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-28 19:55:42 -08:00
Ralph Castain
9e2db26732
Fix vader local modex
Restrict the search to the "immediate" range so at worst we check with
our local server and don't go up to the host daemon.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 18:15:34 -08:00
Ralph Castain
0054de0de7
Update the PMIx and PRRTE pointers
- Deal with deprecated cmd line options and rndz files
- Protect against DVM collisions
- Update atomics
- Fix fork/exec and provide better tool support (PMIx)
- Ensure we cleanup completely upon terminating a tool/server that has
  dropped rendezvous files
- Fix multithreaded launch race on remote node
- Fix bug where multiple threads can be modifying app->env
- Add a --no-ready-msg option to prte
- Utilize the PMIx_Spawn ability to do a fork/exec on our behalf to better
  setup the "prte" DVM when running in proxy mode
- Provide better isolation between DVM instances
- Fix race condition in shutdown of PMIx fork/exec framework
- Ensure ompi personality gets added in proxy scenarios

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 08:54:35 -08:00
Ralph Castain
b13c697d53
Update PRRTE and PMIx pointers
- remove stale s390 and MIPS atomics
- ensure envars from spawn are propagated
- fix make tarball
- ensure cleanup of default hostfile

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-19 22:51:19 -08:00
Ralph Castain
82c71fae78
Update PRRTE and PMIx pointers
- Fix VPATH installs
- Protect against NULL home directories

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-17 21:45:22 -08:00
Ralph Castain
274fba3126
Update PRRTE and PMIx
Correct platform file support
Fix configure cli capture to silence warning

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-15 19:38:34 -08:00
Ralph Castain
344346f27e
Provide hooks for PRRTE and PMIx platform files
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-15 08:40:56 -08:00
Ralph Castain
133e8eba22
Resolve the PMIx v3 incompatibility
Fix a couple of spots in OMPI to resolve warnings. The one in comm_cid
in particular may be responsible for some/all of the comm_spawn issues
as it was passing an incorrect pointer to a macro, thus causing memory
corruption.

Update PRRTE and PMIx to deal with v3/v4 differences.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-14 21:01:10 -08:00
Ralph Castain
e0141f10e6
Fix potential hangs for fast-fail jobs
When a job fails very quickly, it is possible that the spawning tool
won't get notified that the spawn completed prior to be told that the
job terminated. This can cause the tool to "hang" in PMIx_Spawn. Ensure
that PRRTE handles this case by guaranteeing we notify the spawner.

Track both PRRTE and PMIx masters as both have changed, though only the
PRRTE one is involved in this particular fix.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-13 10:59:18 -08:00
Ralph Castain
1a5647ddbe
Update PRRTE
- Ensure we accurately handle node name aliases
- Apply the local launch environ to apps prior to spawn
- Add error check if PMIx_Spawn fails
- Fix compiler warning
- Fix PGI vendor check
- Prevent mpirun from hanging if prte segfaults
- Fix absolute/relative path names to ensure "prte" and
  "prted" are taken from same distribution as "mpirun"

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-12 12:05:41 -08:00
Ralph Castain
2f7338f682
Update PMIx and PRRTE pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-09 07:43:15 -08:00
Gilles Gouaillardet
174e967dbc
Remove ORTE project
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations
 - Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-02-07 18:20:06 -08:00
Eisuke Kawashima
d26d4e1d63
Fix typo and update URLs (https, redirection) [skip ci]
Signed-off-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2020-01-07 03:52:25 +09:00
Gilles Gouaillardet
1c4a3598d0 pmix/pmix4x: refresh to the latest open PMIx master
refresh to openpmix/openpmix@ea3b29b1a4

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-01 14:27:22 +09:00
Ralph Castain
9a2902c047
Force use of pmix/preg/native component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 17:54:44 -07:00
Ralph Castain
2ebc1fa1b7
Update to current PMIx 4.0.0 (master)
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 14:46:55 -07:00
Ralph Castain
c5c93e3391
Update to PMIx master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-07-29 12:20:20 -07:00
Ralph Castain
d202e10c14
Provide locality for all procs on node
Update PMIx to latest master to get supporting updates. For
connect/accept (part of comm_spawn as well), lookup locality for all
participating procs on the node and compute the relative locality so it
can be used for MPI operations.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-07-22 09:23:38 -07:00
Gilles Gouaillardet
4510711e95 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@03a8b5daab

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-07-16 11:31:20 +09:00
Gilles Gouaillardet
63aa156bb0 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@99971222ce

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-27 09:35:49 +09:00
Gilles Gouaillardet
5679a88867 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@f67efc835c

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-24 10:17:23 +09:00
Gilles Gouaillardet
d9326ff2ca pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@186dca196c

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-10 15:17:43 +09:00
Gilles Gouaillardet
562809fca1 pmix/pmix4x: refresh to the latest PMIx master
refrest pmi4x to pmix/pmix@bde4a8a54f

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-04-23 09:31:43 +09:00
Ralph Castain
f4aa783848 Remove all the linkages back to libpmix in pmix components
This link-back seems to be breaking OMPI for some reason. I'm not sure we need it in PMIx anyway, but we'll investigate over there.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-04-16 13:30:20 -07:00
Gilles Gouaillardet
9ce8d7b568 pmix/pmix4x: refresh to the latest PMIx
refrest pmi4x to pmix/pmix@2531c0c3d1

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-04-09 14:03:00 +09:00
Gilles Gouaillardet
e844f76725 pmix/pmix4x: refresh to the latest PMIx
refrest pmi4x to pmix/pmix@20cc9c041e

Fixes open-mpi/ompi#6513

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-03-25 13:33:18 +09:00
Josh Hursey
53cd31ed7e
Merge pull request #6504 from jjhursey/rm-hash-pmix4
Do not force 'hash' gds on direct modex in pmix4x
2019-03-19 20:35:12 -05:00
Ralph Castain
c4be211741 Sync to latest PMIx master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-19 10:27:12 -07:00
Joshua Hursey
1314cf2640 Do not force 'hash' gds on direct modex in pmix4x
* Forcing the 'hash' gds component should not be necessary any more.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-03-19 11:53:26 -05:00
Ralph Castain
677ce0a69f Update PMIx configure logic in the embedded component
PMIx is removing the --enable-embedded-libevent and
--enable-embedded-hwloc flags as they are confusing users. Instead, we
will use the --enable-embedded-mode to handle both of these options.
Update the embedded configury to handle it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-06 17:15:44 -08:00
Ralph Castain
baef25338a Update to latest PRI master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-04 10:10:58 -08:00
Ralph Castain
1e70ea73f4 Update to latest PMIx master (v4.0)
Update the OPAL glue configure code to correctly link the opal/pmix4
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-10 09:37:56 -08:00