1
1
Граф коммитов

5853 Коммитов

Автор SHA1 Сообщение Дата
Scott Miller
8eae54fd27 plm/rsh: Add chdir option to change directory before orted exec
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
(cherry picked from commit c1b8599528)

Conflicts:
	orte/mca/plm/rsh/plm_rsh_module.c
2019-10-29 15:49:41 -04:00
Joshua Hursey
c6fab32137
Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion.
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 0e8a97c598)
2019-10-02 14:49:47 -05:00
Ralph Castain
edbfcf090a
Cleanup stale code in ORTE/OOB
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 41eb41c3f2)
(cherry picked from commit a2f35c1834ab2fcb216285621d177a179e33dfe7)
2019-09-26 15:21:15 -07:00
KAWASHIMA Takahiro
e5be033c14 ess/pmi: Fix --enable-timing compilation error
This commit fixes an compilation error when configured
with `--enable-timing`.

Procedures in the function `orte_ess_base_app_setup`
in `orte/mca/ess/base/ess_base_std_app.c` are moved
to `orte/mca/ess/pmi/ess_pmi_module.c`
and `orte/mca/ess/singleton/ess_singleton_module.c`
in the recent commit 57f6b94fa5.

In `ess_pmi_module.c`, the first argument of the
`OPAL_TIMING_ENV_NEXT` macro should have been adapted
to the destination function but was not.

In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT`
was not used in the destination function originally.
So `OPAL_TIMING_ENV_NEXT` cannot be used in the function.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 8e7d874e14)
2019-09-19 18:14:06 -04:00
Austen Lauria
1430df3c0f Add 'orte_' prefix to noop_mpir_breakpoint_ptr.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit 77144689f0)
2019-09-19 08:47:17 -04:00
Austen Lauria
3eb7b27d3a Conform MPIR_Breakpoint to MPIR standard.
- Fix MPIR_Breakpoint standard violation by returning void
  instead of a void*.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit 067adfa417)
2019-09-18 09:53:15 -04:00
Geoff Paulsen
a482edc14e
Merge pull request #6944 from jjhursey/v4/fix-tree-launch
Fix tree spawn routed component issue
2019-09-09 13:10:36 -05:00
Ralph Castain
95cc53e331
Be a little less restrictive on interface requirements
If both types of interfaces are enabled, don't error out if one of them
isn't able to open listener sockets. Only one interface family may be
available on some machines, but someone might want to build the code to
run more generally.

Refs https://github.com/pmix/prrte/pull/249

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 06d188ebf3)
2019-09-09 07:52:03 -07:00
Joshua Hursey
4c1160e257 Fix tree spawn routed component issue
* Fix #6618
   - See comments on Issue #6618 for finer details.
 * The `plm/rsh` component uses the highest priority `routed` component
   to construct the launch tree. The remote orted's will activate all
   available `routed` components when updating routes. This allows the
   opportunity for the parent vpid on the remote `orted` to not match
   that which was expected in the tree launch. The result is that the
   remote orted tries to contact their parent with the wrong contact
   information and orted wireup will fail.
 * This fix forces the orteds to use the same `routed` component as
   the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-08-29 16:26:43 -04:00
Scott Miller
1b0cfdf264 v4.0.x: regx/naive: add regx/naive component
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
2019-08-26 11:37:07 -04:00
Jeff Squyres
549abeaa87 orterun: remove duplicate code
https://github.com/open-mpi/ompi/pull/6895 fixed the code in orterun.c
to allow running as root if both OMPI_ALLOW_RUN_AS_ROOT and
OMPI_ALLOW_RUN_AS_ROOT_CONFIRM env vars are set.  However, this
env-var-checking code already exists in
orte_submit.c:orte_submit_init() -- it looks like the
geteuid()/getenv()-checking code here in orterun is now duplicate
code.

So let's just get rid of the duplicate code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 197beb30d5)
2019-08-19 15:49:57 -04:00
Simon Byrne
f49c22af6d Run-as-root env vars in orterun.c
I found that I needed to apply the same change as #5597 to orterun.c for the environment variables to work correctly.

Signed-off-by: Simon Byrne <simonbyrne@gmail.com>
(cherry picked from commit 9c8671c48b)
2019-08-19 15:34:20 -04:00
Ralph Castain
f0f25b60a8
Fix typos
Provide a missing header and paren

Thanks to @zerothi for the assistance

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit bd5a1765ee)
2019-08-07 05:51:29 -07:00
Ralph Castain
9898332ae0
Allow individual jobs to set their map/rank/bind policies
Override the defaults when provided. Ignore LSF binding file if user
overrides by specifying a policy.

Fixes #6631

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit ea0dfc3218)
2019-08-07 05:51:06 -07:00
Austen Lauria
0422b23f35 Try to prevent the compiler from optimizing out MPIR_Breakpoint().
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit 00106f5ac9)
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-07-24 14:08:56 -04:00
Orivej Desh
667fe3f3f3 Fix oob_tcp tcp_component_close segfault with active listeners
oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <orivej@gmx.fr>
(cherry picked from commit 78b7e342bd)
2019-07-08 15:14:43 -07:00
Geoff Paulsen
0cd5a5afb8
Merge pull request #6714 from rhc54/cmr40/routed
Fix tree spawn at scale
2019-06-07 14:13:22 -05:00
perrynzhou
5acaf006ae regx/base: fix an integer overflow
use strtol() instead of atoi() in order to handle hostnames
containing a large number.

This is a one-off commit for the release branches since
the regx framework has already been removed from master.

Refs. open-mpi/ompi#6729

Signed-off-by: perrynzhou <perrynzhou@gmail.com>
2019-06-06 14:37:33 +09:00
Ralph Castain
6c2cd10d68
Fix tree spawn at scale
Remove the debruijn component as it changes the daemon's parent
process ID, thus breaking the other routed components

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-06-04 09:49:01 -07:00
Jordan Hayes
e00d0abe56 plm_slurm_module: adjust for new SLURM CLI options
SLURM 19 discontinued the use of --cpu_bind (and changed it to
--cpu-bind).  There's no easy way to test at run time which one is
accepted, so set the environment variable SLURM_CPU_BIND to "none",
which should do the same thing as the srun CLI parameter.

Signed-off-by: Jordan Hayes <jhayes@ucr.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7dad74032e)
2019-05-16 09:13:28 -07:00
Geoff Paulsen
811dfc63e0
Merge pull request #6550 from rhc54/cmr402/clnup
v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup
2019-04-09 10:13:15 -05:00
James Clark
d8dc69feb5 Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init.
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Co-authored-by: Jeff Squyres <jsquyres@cisco.com>

(cherry-picked from 20f5840)
2019-04-01 11:10:04 +01:00
Ralph Castain
2536b4f869 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit cfdd08d309)
2019-03-31 11:26:18 -07:00
Ralph Castain
861016c3b2 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 57f6b94fa5)
2019-03-31 11:23:27 -07:00
Ralph Castain
b5d46494cd Fix cross-mpirun connect/accept operations
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 60961ceb41)
2019-03-01 08:41:23 -08:00
Howard Pritchard
55915c3885 rml/ofi: remove
per discussion at the 2/19/19 devel-core meeting,
remove rml/ofi from 4.0.x

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-19 10:27:47 -07:00
Geoff Paulsen
c593b2004e
Merge pull request #6380 from hppritcha/ggouaillardet-topic/oob_tcp_cross_version_compatibility
v4.0.x: oob/tcp: add cross version compatibility support
2019-02-15 13:39:12 -06:00
Howard Pritchard
de1dd1c2b0 oob/tcp: hardwire oob_tcp version string to 4.0.0
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-13 12:54:03 -07:00
Gilles Gouaillardet
dd750795ee oob/tcp: add cross version compatibility support
Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-13 10:21:32 -07:00
Boris Karasev
87c90866cb regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 46e38b9193)
2019-02-11 12:06:49 +02:00
Boris Karasev
62044da5d9 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 1967e41a71)
2019-02-11 12:06:34 +02:00
Boris Karasev
c154631879 regx/test: update regex test
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit d1ad90f47e)
2019-02-11 12:05:50 +02:00
Howard Pritchard
d84322076f
Merge pull request #6307 from uberlinuxguy/v4.0.x-fix-for-6303
Adding changes for issue #6303 for branch v4.0.x.
2019-02-08 14:41:11 -07:00
Ralph Castain
dae71d3a75 Correct parsing of ppr directives
Needed to apply commit from PR #5778 to get this commit
from PR #6238 to apply cleanly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit b19e5edf76)
2019-01-29 11:34:44 -07:00
Ralph Castain
18afb8e8a6 Update mapping system
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 45f23ca5c9)
2019-01-29 10:04:30 -07:00
Jason Williams
3d8ddbc136 Adding changes for issue #6303 for branch v4.0.x.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>

(cherry picked from commit 98d81a5f7a)
2019-01-29 08:46:59 -05:00
Jeff Squyres
20d231defa odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f96c04244d)
2018-12-22 06:41:33 -08:00
Howard Pritchard
af7a7f5da1
Merge pull request #6216 from abouteiller/export4x/overspawn
v4.0.x: Correctly propagate the oversubscribe flag to the spawnees
2018-12-21 15:33:05 -07:00
Aurélien Bouteiller
d9b0dad828
Correctly propagate the oversubscribe flag to the spawnees
This is a cherry-pick of master (2820aef). The propagation is intended to resolve issue #6130

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-12-21 14:53:25 -05:00
Ralph Castain
2d9c936082 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit d728380741)
2018-12-18 08:27:39 -08:00
Ralph Castain
98c8492057 Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Thanks to @iassiour for the report!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:40:51 -08:00
Howard Pritchard
24dae8609e
Merge pull request #5926 from hjelmn/v4.0.x_need_to_unblock_sigchld_in_some_cases
v4.0.x: Ensure SIGCHLD is unblocked
2018-11-19 13:13:10 -07:00
Jeff Squyres
8be14b9b07 orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e9bf318dcb)
2018-11-08 18:20:28 -05:00
Jeff Squyres
76d4c1843e orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 430c659908)
2018-11-08 16:03:28 -05:00
Ralph Castain
ba6ad9fe42 Remove stale defunct tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 05ac8fa71c)
2018-10-30 08:51:25 -07:00
Ralph Castain
712ddd326f Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1bd772e8eb)
2018-10-30 07:54:35 -07:00
Ralph Castain
8aae7943ab Provide deprecation warning of MPIR debugger
If we detect that we are being debugged by an MPIR-based debugger, then
print a warning that OMPI's MPIR support has been deprecated and will be
removed in a subsequent release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 2cb271716b)
2018-10-25 08:49:00 -07:00
Howard Pritchard
d2fb9949a5
Merge pull request #5785 from jsquyres/pr/v4.0.x/more-compiler-warnings-fixes
v4.0.x: Squash a bunch of harmless compiler warnings.
2018-10-16 16:30:21 -06:00
Ralph Castain
05e0545581 Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:21:18 -06:00
Gilles Gouaillardet
2e2366d193 util/hostfile: fix a double free error
As reported at https://stackoverflow.com/questions/52707242/mpirun-segmentation-fault-whenever-i-use-a-hostfile
mpirun crashes when the hostfile contains a "user@host" line.
The root cause is username was not strdup'ed and free'd twice by opal_argv_free() and free()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@5803385d44)
2018-10-09 13:06:55 +09:00