1
1
Граф коммитов

5833 Коммитов

Автор SHA1 Сообщение Дата
Geoff Paulsen
811dfc63e0
Merge pull request #6550 from rhc54/cmr402/clnup
v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup
2019-04-09 10:13:15 -05:00
James Clark
d8dc69feb5 Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init.
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Co-authored-by: Jeff Squyres <jsquyres@cisco.com>

(cherry-picked from 20f5840)
2019-04-01 11:10:04 +01:00
Ralph Castain
2536b4f869 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit cfdd08d309)
2019-03-31 11:26:18 -07:00
Ralph Castain
861016c3b2 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 57f6b94fa5)
2019-03-31 11:23:27 -07:00
Ralph Castain
b5d46494cd Fix cross-mpirun connect/accept operations
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 60961ceb41)
2019-03-01 08:41:23 -08:00
Howard Pritchard
55915c3885 rml/ofi: remove
per discussion at the 2/19/19 devel-core meeting,
remove rml/ofi from 4.0.x

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-19 10:27:47 -07:00
Geoff Paulsen
c593b2004e
Merge pull request #6380 from hppritcha/ggouaillardet-topic/oob_tcp_cross_version_compatibility
v4.0.x: oob/tcp: add cross version compatibility support
2019-02-15 13:39:12 -06:00
Howard Pritchard
de1dd1c2b0 oob/tcp: hardwire oob_tcp version string to 4.0.0
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-13 12:54:03 -07:00
Gilles Gouaillardet
dd750795ee oob/tcp: add cross version compatibility support
Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-13 10:21:32 -07:00
Boris Karasev
87c90866cb regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 46e38b9193)
2019-02-11 12:06:49 +02:00
Boris Karasev
62044da5d9 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 1967e41a71)
2019-02-11 12:06:34 +02:00
Boris Karasev
c154631879 regx/test: update regex test
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit d1ad90f47e)
2019-02-11 12:05:50 +02:00
Howard Pritchard
d84322076f
Merge pull request #6307 from uberlinuxguy/v4.0.x-fix-for-6303
Adding changes for issue #6303 for branch v4.0.x.
2019-02-08 14:41:11 -07:00
Ralph Castain
dae71d3a75 Correct parsing of ppr directives
Needed to apply commit from PR #5778 to get this commit
from PR #6238 to apply cleanly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit b19e5edf76)
2019-01-29 11:34:44 -07:00
Ralph Castain
18afb8e8a6 Update mapping system
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 45f23ca5c9)
2019-01-29 10:04:30 -07:00
Jason Williams
3d8ddbc136 Adding changes for issue #6303 for branch v4.0.x.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>

(cherry picked from commit 98d81a5f7a)
2019-01-29 08:46:59 -05:00
Jeff Squyres
20d231defa odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f96c04244d)
2018-12-22 06:41:33 -08:00
Howard Pritchard
af7a7f5da1
Merge pull request #6216 from abouteiller/export4x/overspawn
v4.0.x: Correctly propagate the oversubscribe flag to the spawnees
2018-12-21 15:33:05 -07:00
Aurélien Bouteiller
d9b0dad828
Correctly propagate the oversubscribe flag to the spawnees
This is a cherry-pick of master (2820aef). The propagation is intended to resolve issue #6130

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-12-21 14:53:25 -05:00
Ralph Castain
2d9c936082 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit d728380741)
2018-12-18 08:27:39 -08:00
Ralph Castain
98c8492057 Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Thanks to @iassiour for the report!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:40:51 -08:00
Howard Pritchard
24dae8609e
Merge pull request #5926 from hjelmn/v4.0.x_need_to_unblock_sigchld_in_some_cases
v4.0.x: Ensure SIGCHLD is unblocked
2018-11-19 13:13:10 -07:00
Jeff Squyres
8be14b9b07 orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e9bf318dcb)
2018-11-08 18:20:28 -05:00
Jeff Squyres
76d4c1843e orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 430c659908)
2018-11-08 16:03:28 -05:00
Ralph Castain
ba6ad9fe42 Remove stale defunct tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 05ac8fa71c)
2018-10-30 08:51:25 -07:00
Ralph Castain
712ddd326f Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1bd772e8eb)
2018-10-30 07:54:35 -07:00
Ralph Castain
8aae7943ab Provide deprecation warning of MPIR debugger
If we detect that we are being debugged by an MPIR-based debugger, then
print a warning that OMPI's MPIR support has been deprecated and will be
removed in a subsequent release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 2cb271716b)
2018-10-25 08:49:00 -07:00
Howard Pritchard
d2fb9949a5
Merge pull request #5785 from jsquyres/pr/v4.0.x/more-compiler-warnings-fixes
v4.0.x: Squash a bunch of harmless compiler warnings.
2018-10-16 16:30:21 -06:00
Ralph Castain
05e0545581 Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:21:18 -06:00
Gilles Gouaillardet
2e2366d193 util/hostfile: fix a double free error
As reported at https://stackoverflow.com/questions/52707242/mpirun-segmentation-fault-whenever-i-use-a-hostfile
mpirun crashes when the hostfile contains a "user@host" line.
The root cause is username was not strdup'ed and free'd twice by opal_argv_free() and free()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@5803385d44)
2018-10-09 13:06:55 +09:00
Jeff Squyres
37a9cf5c82 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 6bb356ab87)
2018-09-26 14:42:13 -07:00
Jeff Squyres
2e37f97a38 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit fe0852bcb4)
2018-09-21 14:35:51 -05:00
Gilles Gouaillardet
229ec82cf0 orte: send error messages to stderr.
When a job terminates normally but with a non zero exit code,
display the error message to stderr.

Thanks Emre Brookes for the bug report.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@893270caee)
2018-09-13 10:39:57 +09:00
Gilles Gouaillardet
959aeab5d9 ess/hnp: plug a memory leak in rte_finalize()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@d234caef74)
2018-09-10 09:20:16 +09:00
Geoff Paulsen
954692f06e
Merge pull request #5614 from karasevb/v4.0.x_fix_hwloc_numa_obj
v4.0.x: Fixed the NUMA obj detection for hwloc ver >= 2.0.0
2018-09-07 14:49:26 -05:00
Geoff Paulsen
325e8d26e8
Merge pull request #5608 from hppritcha/topic/pr5600_to_v4.0.x
Deal with special case during cleanup
2018-08-31 14:00:04 -05:00
Geoff Paulsen
b2daa0001f
Merge pull request #5565 from rhc54/cmr40/pmix301
Update to PMIx 3.0.1
2018-08-31 13:58:41 -05:00
Nathan Hjelm
4eeb41506c odls/alps: resolve hang when launching with mpirun on Crays
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 98172163e6)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 09:22:37 -06:00
Boris Karasev
31ca3842da Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit beb0697f24)
2018-08-28 12:29:16 +03:00
Boris Karasev
d995fb1b3f Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit e5291ccc34)
2018-08-28 12:29:08 +03:00
Ralph Castain
5725e91c1a Deal with special case during cleanup
In some scenarios, we can have a daemon sharing the node with mpirun. In
those cases, we need to avoid race conditions in cleanup

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 8d1be27a1e)
2018-08-27 13:20:11 -06:00
Ralph Castain
b4ae5d005f Allow run-as-root if 2 envars are set
Per suggestion by @bangerth, allow mpirun to execute as root if two
envars are set to specific values

Per conversation with @jsquyres, name the envars OMPI_ALLOW_RUN_AS_ROOT
and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM

Fixes #4451

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7f1444d5f9)
2018-08-24 20:24:22 -07:00
Howard Pritchard
c2cc336135
Merge pull request #5486 from rhc54/cmr40/maps
Cleanup pmix selection and map-by modifiers
2018-08-22 04:40:00 -04:00
Ralph Castain
e27e945d9a Complete job control integration
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 16:08:54 -07:00
Boris Karasev
8873d901e8 pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 57683366ca)

Conflicts:
	opal/mca/common/ucx/common_ucx.c
	opal/mca/common/ucx/common_ucx.h

Modified:
	ompi/mca/pml/ucx/pml_ucx.c
	oshmem/mca/spml/ucx/spml_ucx.c
2018-08-17 21:33:50 +06:00
Ralph Castain
511319c316 Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcdb1f45ac)
2018-07-25 19:55:28 -07:00
Ralph Castain
50e6e14020 Protect against infinite loops
Flag that we provided a notification and ignore it if it attempts to come back up.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit ea0d70bc9396def61545e2ce492a55c4c3aa7772)
2018-07-19 13:57:52 -07:00
Ralph Castain
6b6e63a346 Control inheritance of launch directives by child jobs
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.

Always inherit oversubscribe for now as otherwise MTT will break

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 15:12:05 -07:00
Ralph Castain
c6ee8d2f5c
Merge pull request #5300 from hjelmn/goodbye_oobud
oob/ud: remove as it has bitrotted
2018-07-09 12:40:04 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00