* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
30 seconds before sending SIGTERM then another 30 seconds before sending
SIGKILL to remaining processes. This usually happens on an abnormal
termination. Sometimes the user wants to delay the cleanup to give the
system time to write out corefile or run other diagnostics.
* The problem is that child processes may be completing while ORTE is
in this loop. The SIGCHLD will interrupt the `sleep` system call.
Without the loop the sleep could effectively be ignored in this case.
- Sleep returns the amount of time remaining to sleep. If it was
interrupted by a signal then it is a positive number less than or
equal to the parameter passed to it. If it slept the whole time
then it returns 0.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 0e8a97c598)
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 41eb41c3f2)
(cherry picked from commit a2f35c1834ab2fcb216285621d177a179e33dfe7)
This commit fixes an compilation error when configured
with `--enable-timing`.
Procedures in the function `orte_ess_base_app_setup`
in `orte/mca/ess/base/ess_base_std_app.c` are moved
to `orte/mca/ess/pmi/ess_pmi_module.c`
and `orte/mca/ess/singleton/ess_singleton_module.c`
in the recent commit 57f6b94fa5.
In `ess_pmi_module.c`, the first argument of the
`OPAL_TIMING_ENV_NEXT` macro should have been adapted
to the destination function but was not.
In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT`
was not used in the destination function originally.
So `OPAL_TIMING_ENV_NEXT` cannot be used in the function.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 8e7d874e14)
If both types of interfaces are enabled, don't error out if one of them
isn't able to open listener sockets. Only one interface family may be
available on some machines, but someone might want to build the code to
run more generally.
Refs https://github.com/pmix/prrte/pull/249
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 06d188ebf3)
* Fix#6618
- See comments on Issue #6618 for finer details.
* The `plm/rsh` component uses the highest priority `routed` component
to construct the launch tree. The remote orted's will activate all
available `routed` components when updating routes. This allows the
opportunity for the parent vpid on the remote `orted` to not match
that which was expected in the tree launch. The result is that the
remote orted tries to contact their parent with the wrong contact
information and orted wireup will fail.
* This fix forces the orteds to use the same `routed` component as
the HNP used when contructing the tree, if tree launch is enabled.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Provide a missing header and paren
Thanks to @zerothi for the assistance
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit bd5a1765ee)
Override the defaults when provided. Ignore LSF binding file if user
overrides by specifying a policy.
Fixes#6631
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit ea0dfc3218)
use strtol() instead of atoi() in order to handle hostnames
containing a large number.
This is a one-off commit for the release branches since
the regx framework has already been removed from master.
Refs. open-mpi/ompi#6729
Signed-off-by: perrynzhou <perrynzhou@gmail.com>
Remove the debruijn component as it changes the daemon's parent
process ID, thus breaking the other routed components
Signed-off-by: Ralph Castain <rhc@pmix.org>
SLURM 19 discontinued the use of --cpu_bind (and changed it to
--cpu-bind). There's no easy way to test at run time which one is
accepted, so set the environment variable SLURM_CPU_BIND to "none",
which should do the same thing as the srun CLI parameter.
Signed-off-by: Jordan Hayes <jhayes@ucr.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7dad74032e)
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Co-authored-by: Jeff Squyres <jsquyres@cisco.com>
(cherry-picked from 20f5840)
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.
Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 60961ceb41)
Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).
Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 46e38b9193)
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 1967e41a71)
Needed to apply commit from PR #5778 to get this commit
from PR #6238 to apply cleanly.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit b19e5edf76)
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 45f23ca5c9)
This is a cherry-pick of master (2820aef). The propagation is intended to resolve issue #6130
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
Causes the MCA param to be ignored, while the cmd line option still
works.
Thanks to @iassiour for the report!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Follow on to 430c659908: clarify the help message and fix one typo.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e9bf318dcb)
Update the show_help message for when there are not enough slots to
run an application.
Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 430c659908)
Thanks to @hjelmn for debugging it and providing the patch
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
When a job terminates normally but with a non zero exit code,
display the error message to stderr.
Thanks Emre Brookes for the bug report.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@893270caee)
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 98172163e6)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit e5291ccc34)
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.
Thanks to @tonyreina for the error report
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcdb1f45ac)
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.
Always inherit oversubscribe for now as otherwise MTT will break
Signed-off-by: Ralph Castain <rhc@open-mpi.org>