1
1

685 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
1cf972dcaf
Update PMIx and PRRTE
Deprecate --am and --amca options

Avoid default param files on backend nodes
Any parameters in the PRRTE default or user param files will have been
picked up by prte and included in the environment sent to the prted, so
don't open those files on the backend.

Avoid picking up MCA param file info on backend
Avoid the scaling problem at PRRTE startup by only reading the system
and user param files on the frontend.

Complete revisions to cmd line parser for OMPI
Per specification, enforce following precedence order:

1. system-level default parameter file
1. user-level default parameter file
1. Anything found in the environment
1. "--tune" files. Note that "--amca" goes away and becomes equivalent to "--tune". Okay if it is provided more than once on a cmd line (we will aggregate the list of files, retaining order), but an error if a parameter is referenced in more than one file with a different value
1. "--mca" options. Again, error if the same option appears more than once with a different value. Allowed to override a parameter referenced in a "tune" file
1. "-x" options. Allowed to overwrite options given in a "tune" file, but cannot conflict with an explicit "--mca" option
1. all other options

Fix special handling of "-np"

Get agreement on jobid across the layers
Need all three pieces (PRRTE, PMIx, and OPAL) to agree on the nspace
conversion to jobid method

Ensure prte show_help messages get output
Print abnormal termination messages
Cleanup error reporting in persistent operations

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-26 16:01:11 -07:00
Ralph Castain
43f79be2e3
Update PMIx and PRRTE
Fix singleton operations and ensure notification upon tool connection.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 11:18:23 -07:00
Ralph Castain
2979bb2ce8
Update PMIx and PRRTE to reduce mpirun complexity
Use "prte" instead of "prun" for proxy execution of cmds like mpirun.
This avoids the fork/exec-rendezvous complexities and should result in
more reliable operation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-20 13:49:12 -07:00
Ralph Castain
0dccd3378b
Update PMIx and PRRTE
PMIx
- fix several race conditions

PRRTE
- fix race condition
- extend prun-to-prte connection tries
- pass correct nspace to job ctrl in response to ctrl-c

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-18 11:46:38 -07:00
Ralph Castain
972f6aea7f
Update PMIx
- Silence a few (valid) warnings

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-17 08:53:43 -07:00
Ralph Castain
6b4fb509e9
Cleanup singleton detection and data retrieval
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.

Provide full support for info values via PMIx

Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.

PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list

PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support

Silence a warning from btl/vader

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-16 12:25:28 -07:00
Ralph Castain
18b06430d3
Update PRRTE and PMIx
- Avoid modifying single-dash options of applications
- Fix fetch of node/app-level info
- Return correct status code

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-09 18:23:43 -07:00
Ralph Castain
836cc5b6a0
Merge pull request #7498 from rhc54/topic/again
Update PRRTE and PMIx
2020-03-06 11:31:33 -08:00
Ralph Castain
d454bf1f20
Update PRRTE and PMIx
PMIx:
- Ensure that launchers open all required frameworks
- Pass back the tool's ID
- Fix race condition in IOF

PRRTE:
- Begin conversion to use of nspace in place of numeric jobid
- Restore support:
    --report-bindings
    --display-map
    --display-devel-map
    --display-topo
    --do-not-launch
    --xml-output
    --display-allocation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-06 10:04:41 -08:00
Howard Pritchard
2990d8d98b check for external libevent and hwloc
when building with external PMIx.

Related #7458

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-05 14:30:57 -07:00
Ralph Castain
c537bef7d5
Update the PRRTE and PMIx pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-28 19:55:42 -08:00
Ralph Castain
cbbe67eff9
Merge pull request #7487 from bosilca/topic/pml_from_vpid0
Make sure the PML selection is consistent across the world.
2020-02-28 17:19:26 -08:00
George Bosilca
21d743393f
Make sure the PML is consistent across the world.
Temporary solution for the PML inconsistency issue discussed in #7475.
This patch address 2 things: first it make the PMIx key optional so that
if we are not in a full modex mode we don't do a direct modex, and
second it get the PML info from the vpid 0 instead of from the local
rank.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-02-28 17:53:48 -05:00
Ralph Castain
c79c95039e
Merge pull request #7474 from rhc54/topic/up
Update the PMIx and PRRTE pointers
2020-02-27 19:36:08 -08:00
Ralph Castain
9e2db26732
Fix vader local modex
Restrict the search to the "immediate" range so at worst we check with
our local server and don't go up to the host daemon.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 18:15:34 -08:00
Howard Pritchard
31d7748afd
Merge pull request #7434 from hppritcha/topic/fix_a_config_with_ext_pmix_prob
fix an issue with configuring with external pmix
2020-02-27 12:19:02 -07:00
Ralph Castain
0054de0de7
Update the PMIx and PRRTE pointers
- Deal with deprecated cmd line options and rndz files
- Protect against DVM collisions
- Update atomics
- Fix fork/exec and provide better tool support (PMIx)
- Ensure we cleanup completely upon terminating a tool/server that has
  dropped rendezvous files
- Fix multithreaded launch race on remote node
- Fix bug where multiple threads can be modifying app->env
- Add a --no-ready-msg option to prte
- Utilize the PMIx_Spawn ability to do a fork/exec on our behalf to better
  setup the "prte" DVM when running in proxy mode
- Provide better isolation between DVM instances
- Fix race condition in shutdown of PMIx fork/exec framework
- Ensure ompi personality gets added in proxy scenarios

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-27 08:54:35 -08:00
Howard Pritchard
488f656c11 fix an issue with configuring with external pmix
External pmix installs are frequently in non-standard locations and
the path to their shared libraries are not ldconfig'd in because
there may be multiple pmix installs.

The way the configury was set up prior to this patch, the configuration
would fail soon after the PMIX config stuff was called because it
added some pmix lib stuff to the LDFLAGS, resulting in configury tests
for things that require running the configure test to fail.

This patch avoids this problem by resetting the LDFLAGS and LIBS back
to what they were prior to the run of the external PMIX detection.

The CFLAGS setting is left because there are many places in the ompi
and opal source code where pmix_common.h needs to be included.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2020-02-23 13:04:24 -07:00
Ralph Castain
4c56a7744a
Fix Slurm process name
Ensure that we truncate the local jobid to 15-bits

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 15:46:53 -08:00
Ralph Castain
829fd478b3
Create a hack to protect against non-integer jobids
If someone gives us a namespace that doesn't easily translate to an
integer, we have to create a mechanism for working around the
disconnect. PRRTE has been updated to give us a flag so we know we were
"natively" launched. If we don't see it, then fall back to generating a
hash of the nspace as our jobid. We then have to translate back/forth
between nspace and jobid using a lookup table.

Probably not the right long-term solution, but hopefully helps get us
thru for a bit.

Includes update of PRRTE pointer

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 06:04:55 -08:00
Ralph Castain
b13c697d53
Update PRRTE and PMIx pointers
- remove stale s390 and MIPS atomics
- ensure envars from spawn are propagated
- fix make tarball
- ensure cleanup of default hostfile

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-19 22:51:19 -08:00
Ralph Castain
82c71fae78
Update PRRTE and PMIx pointers
- Fix VPATH installs
- Protect against NULL home directories

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-17 21:45:22 -08:00
Ralph Castain
274fba3126
Update PRRTE and PMIx
Correct platform file support
Fix configure cli capture to silence warning

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-15 19:38:34 -08:00
Ralph Castain
edaf9160ae
Enable build against PMIx v2.2 without internal PRRTE
If you autogen.pl --without-prrte, we wouldn't configure or build PRRTE
support. However, configuring with --disable-internal-rte wasn't working
as it was being ignored. This led to some false errors when compiling
with an earlier PMIx v2.2 release.

That said, there were a couple of places that needed protection against
PMIx v2.2.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-15 16:28:16 -08:00
Ralph Castain
344346f27e
Provide hooks for PRRTE and PMIx platform files
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-15 08:40:56 -08:00
Ralph Castain
133e8eba22
Resolve the PMIx v3 incompatibility
Fix a couple of spots in OMPI to resolve warnings. The one in comm_cid
in particular may be responsible for some/all of the comm_spawn issues
as it was passing an incorrect pointer to a macro, thus causing memory
corruption.

Update PRRTE and PMIx to deal with v3/v4 differences.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-14 21:01:10 -08:00
Ralph Castain
e0141f10e6
Fix potential hangs for fast-fail jobs
When a job fails very quickly, it is possible that the spawning tool
won't get notified that the spawn completed prior to be told that the
job terminated. This can cause the tool to "hang" in PMIx_Spawn. Ensure
that PRRTE handles this case by guaranteeing we notify the spawner.

Track both PRRTE and PMIx masters as both have changed, though only the
PRRTE one is involved in this particular fix.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-13 10:59:18 -08:00
Ralph Castain
1a5647ddbe
Update PRRTE
- Ensure we accurately handle node name aliases
- Apply the local launch environ to apps prior to spawn
- Add error check if PMIx_Spawn fails
- Fix compiler warning
- Fix PGI vendor check
- Prevent mpirun from hanging if prte segfaults
- Fix absolute/relative path names to ensure "prte" and
  "prted" are taken from same distribution as "mpirun"

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-12 12:05:41 -08:00
Ralph Castain
2f7338f682
Update PMIx and PRRTE pointers
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-09 07:43:15 -08:00
Gilles Gouaillardet
174e967dbc
Remove ORTE project
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations
 - Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-02-07 18:20:06 -08:00
Eisuke Kawashima
d26d4e1d63
Fix typo and update URLs (https, redirection) [skip ci]
Signed-off-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2020-01-07 03:52:25 +09:00
Gilles Gouaillardet
1c4a3598d0 pmix/pmix4x: refresh to the latest open PMIx master
refresh to openpmix/openpmix@ea3b29b1a4

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-01 14:27:22 +09:00
Ralph Castain
9a2902c047
Force use of pmix/preg/native component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 17:54:44 -07:00
Ralph Castain
2ebc1fa1b7
Update to current PMIx 4.0.0 (master)
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 14:46:55 -07:00
Ralph Castain
c5c93e3391
Update to PMIx master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-07-29 12:20:20 -07:00
Ralph Castain
d202e10c14
Provide locality for all procs on node
Update PMIx to latest master to get supporting updates. For
connect/accept (part of comm_spawn as well), lookup locality for all
participating procs on the node and compute the relative locality so it
can be used for MPI operations.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-07-22 09:23:38 -07:00
Gilles Gouaillardet
4510711e95 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@03a8b5daab

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-07-16 11:31:20 +09:00
Gilles Gouaillardet
63aa156bb0 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@99971222ce

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-27 09:35:49 +09:00
Gilles Gouaillardet
5679a88867 pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@f67efc835c

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-24 10:17:23 +09:00
Ralph Castain
d4070d5f58
Fix finalize of flux component
Per patches from @SteVwonder and @garlick

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-06-18 21:14:04 -07:00
Gilles Gouaillardet
d9326ff2ca pmix/pmix4x: refresh to the latest PMIx master
refresh to pmix/pmix@186dca196c

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-06-10 15:17:43 +09:00
Gilles Gouaillardet
562809fca1 pmix/pmix4x: refresh to the latest PMIx master
refrest pmi4x to pmix/pmix@bde4a8a54f

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-04-23 09:31:43 +09:00
Ralph Castain
f4aa783848 Remove all the linkages back to libpmix in pmix components
This link-back seems to be breaking OMPI for some reason. I'm not sure we need it in PMIx anyway, but we'll investigate over there.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-04-16 13:30:20 -07:00
Gilles Gouaillardet
9ce8d7b568 pmix/pmix4x: refresh to the latest PMIx
refrest pmi4x to pmix/pmix@2531c0c3d1

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-04-09 14:03:00 +09:00
Gilles Gouaillardet
e844f76725 pmix/pmix4x: refresh to the latest PMIx
refrest pmi4x to pmix/pmix@20cc9c041e

Fixes open-mpi/ompi#6513

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-03-25 13:33:18 +09:00
Josh Hursey
53cd31ed7e
Merge pull request #6504 from jjhursey/rm-hash-pmix4
Do not force 'hash' gds on direct modex in pmix4x
2019-03-19 20:35:12 -05:00
Ralph Castain
c4be211741 Sync to latest PMIx master
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-19 10:27:12 -07:00
Joshua Hursey
1314cf2640 Do not force 'hash' gds on direct modex in pmix4x
* Forcing the 'hash' gds component should not be necessary any more.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-03-19 11:53:26 -05:00
Joshua Hursey
c2581d0e33 Do not force 'hash' gds on direct modex
* Forcing the 'hash' gds component should not be necessary any more.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-03-18 21:52:32 -05:00
Ralph Castain
cd1b5641be Update slurm pmi configury to account for pmix
When Slurm is built against PMIx, some installations place a copy of the
PMIx library that Slurm is linking against in the Slurm PMI location.
Current configury ignores that location. The desired behavior is to look
for a PMIx lib in that location when --with-pmi is given. If the user
also specifies --with-pmix and gives a different location, then override
anything previously found and look for it where the user directed.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-21 11:33:35 -08:00