1
1
Граф коммитов

4406 Коммитов

Автор SHA1 Сообщение Дата
Geoff Paulsen
0cd5a5afb8
Merge pull request #6714 from rhc54/cmr40/routed
Fix tree spawn at scale
2019-06-07 14:13:22 -05:00
perrynzhou
5acaf006ae regx/base: fix an integer overflow
use strtol() instead of atoi() in order to handle hostnames
containing a large number.

This is a one-off commit for the release branches since
the regx framework has already been removed from master.

Refs. open-mpi/ompi#6729

Signed-off-by: perrynzhou <perrynzhou@gmail.com>
2019-06-06 14:37:33 +09:00
Ralph Castain
6c2cd10d68
Fix tree spawn at scale
Remove the debruijn component as it changes the daemon's parent
process ID, thus breaking the other routed components

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-06-04 09:49:01 -07:00
Jordan Hayes
e00d0abe56 plm_slurm_module: adjust for new SLURM CLI options
SLURM 19 discontinued the use of --cpu_bind (and changed it to
--cpu-bind).  There's no easy way to test at run time which one is
accepted, so set the environment variable SLURM_CPU_BIND to "none",
which should do the same thing as the srun CLI parameter.

Signed-off-by: Jordan Hayes <jhayes@ucr.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7dad74032e)
2019-05-16 09:13:28 -07:00
Geoff Paulsen
811dfc63e0
Merge pull request #6550 from rhc54/cmr402/clnup
v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup
2019-04-09 10:13:15 -05:00
James Clark
d8dc69feb5 Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init.
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Co-authored-by: Jeff Squyres <jsquyres@cisco.com>

(cherry-picked from 20f5840)
2019-04-01 11:10:04 +01:00
Ralph Castain
2536b4f869 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit cfdd08d309)
2019-03-31 11:26:18 -07:00
Ralph Castain
861016c3b2 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 57f6b94fa5)
2019-03-31 11:23:27 -07:00
Ralph Castain
b5d46494cd Fix cross-mpirun connect/accept operations
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 60961ceb41)
2019-03-01 08:41:23 -08:00
Howard Pritchard
55915c3885 rml/ofi: remove
per discussion at the 2/19/19 devel-core meeting,
remove rml/ofi from 4.0.x

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-19 10:27:47 -07:00
Geoff Paulsen
c593b2004e
Merge pull request #6380 from hppritcha/ggouaillardet-topic/oob_tcp_cross_version_compatibility
v4.0.x: oob/tcp: add cross version compatibility support
2019-02-15 13:39:12 -06:00
Howard Pritchard
de1dd1c2b0 oob/tcp: hardwire oob_tcp version string to 4.0.0
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-13 12:54:03 -07:00
Gilles Gouaillardet
dd750795ee oob/tcp: add cross version compatibility support
Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-13 10:21:32 -07:00
Boris Karasev
87c90866cb regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 46e38b9193)
2019-02-11 12:06:49 +02:00
Boris Karasev
62044da5d9 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 1967e41a71)
2019-02-11 12:06:34 +02:00
Howard Pritchard
d84322076f
Merge pull request #6307 from uberlinuxguy/v4.0.x-fix-for-6303
Adding changes for issue #6303 for branch v4.0.x.
2019-02-08 14:41:11 -07:00
Ralph Castain
dae71d3a75 Correct parsing of ppr directives
Needed to apply commit from PR #5778 to get this commit
from PR #6238 to apply cleanly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit b19e5edf76)
2019-01-29 11:34:44 -07:00
Ralph Castain
18afb8e8a6 Update mapping system
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 45f23ca5c9)
2019-01-29 10:04:30 -07:00
Jason Williams
3d8ddbc136 Adding changes for issue #6303 for branch v4.0.x.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>

(cherry picked from commit 98d81a5f7a)
2019-01-29 08:46:59 -05:00
Jeff Squyres
20d231defa odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f96c04244d)
2018-12-22 06:41:33 -08:00
Howard Pritchard
af7a7f5da1
Merge pull request #6216 from abouteiller/export4x/overspawn
v4.0.x: Correctly propagate the oversubscribe flag to the spawnees
2018-12-21 15:33:05 -07:00
Aurélien Bouteiller
d9b0dad828
Correctly propagate the oversubscribe flag to the spawnees
This is a cherry-pick of master (2820aef). The propagation is intended to resolve issue #6130

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-12-21 14:53:25 -05:00
Ralph Castain
2d9c936082 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit d728380741)
2018-12-18 08:27:39 -08:00
Ralph Castain
98c8492057 Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Thanks to @iassiour for the report!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:40:51 -08:00
Howard Pritchard
24dae8609e
Merge pull request #5926 from hjelmn/v4.0.x_need_to_unblock_sigchld_in_some_cases
v4.0.x: Ensure SIGCHLD is unblocked
2018-11-19 13:13:10 -07:00
Jeff Squyres
8be14b9b07 orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e9bf318dcb)
2018-11-08 18:20:28 -05:00
Jeff Squyres
76d4c1843e orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 430c659908)
2018-11-08 16:03:28 -05:00
Ralph Castain
712ddd326f Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1bd772e8eb)
2018-10-30 07:54:35 -07:00
Ralph Castain
05e0545581 Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:21:18 -06:00
Jeff Squyres
37a9cf5c82 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 6bb356ab87)
2018-09-26 14:42:13 -07:00
Gilles Gouaillardet
229ec82cf0 orte: send error messages to stderr.
When a job terminates normally but with a non zero exit code,
display the error message to stderr.

Thanks Emre Brookes for the bug report.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@893270caee)
2018-09-13 10:39:57 +09:00
Gilles Gouaillardet
959aeab5d9 ess/hnp: plug a memory leak in rte_finalize()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@d234caef74)
2018-09-10 09:20:16 +09:00
Geoff Paulsen
954692f06e
Merge pull request #5614 from karasevb/v4.0.x_fix_hwloc_numa_obj
v4.0.x: Fixed the NUMA obj detection for hwloc ver >= 2.0.0
2018-09-07 14:49:26 -05:00
Nathan Hjelm
4eeb41506c odls/alps: resolve hang when launching with mpirun on Crays
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 98172163e6)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 09:22:37 -06:00
Boris Karasev
31ca3842da Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit beb0697f24)
2018-08-28 12:29:16 +03:00
Boris Karasev
d995fb1b3f Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit e5291ccc34)
2018-08-28 12:29:08 +03:00
Howard Pritchard
c2cc336135
Merge pull request #5486 from rhc54/cmr40/maps
Cleanup pmix selection and map-by modifiers
2018-08-22 04:40:00 -04:00
Boris Karasev
8873d901e8 pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 57683366ca)

Conflicts:
	opal/mca/common/ucx/common_ucx.c
	opal/mca/common/ucx/common_ucx.h

Modified:
	ompi/mca/pml/ucx/pml_ucx.c
	oshmem/mca/spml/ucx/spml_ucx.c
2018-08-17 21:33:50 +06:00
Ralph Castain
511319c316 Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcdb1f45ac)
2018-07-25 19:55:28 -07:00
Ralph Castain
6b6e63a346 Control inheritance of launch directives by child jobs
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.

Always inherit oversubscribe for now as otherwise MTT will break

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 15:12:05 -07:00
Ralph Castain
c6ee8d2f5c
Merge pull request #5300 from hjelmn/goodbye_oobud
oob/ud: remove as it has bitrotted
2018-07-09 12:40:04 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Ralph Castain
3b2390e5d5 Silence coverity warnings, remove/ignore build product
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-25 08:01:28 -07:00
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
d2838139e4 Update man and help output for new binding option
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-21 06:36:11 -07:00
Ralph Castain
f17d47087a Define a new binding method and qualifier
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-20 21:26:09 -07:00
Ralph Castain
98b4ed9a3a Fix the no-disconnect test
A race condition exists based on whether or not the userdata object attached to a hwloc_obj_t has been initialized. These objects are setup whenever we scan for resources under that location. You therefore must not set a variable to the pointer to the userdata object and then call a function that will initialize the data in it - you need to set the variable after the function call, and protect against a NULL pointer

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 13:52:34 -07:00
Nathan Hjelm
845516ca11 oob/ud: remove as it has bitrotted
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-19 13:06:26 -06:00
Ralph Castain
f0a0d606a0 Correct accounting for tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)
2018-06-18 13:24:25 -07:00
Ralph Castain
795140e590 Make use of "instant-on" feature optional
The PMIx support for "instant on" remains experimental, so disable it by default. Provide an MCA param and corresponding command line option to enable it at runtime.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:42:00 -07:00