1
1
openmpi/orte/mca
Joshua Hursey c6fab32137
Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion.
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 0e8a97c598)
2019-10-02 14:49:47 -05:00
..
common pmix/cray: fix disable-dlopen problem 2016-11-21 13:45:10 -06:00
errmgr Remove the stale orte-dvm code 2018-10-30 07:54:35 -07:00
ess ess/pmi: Fix --enable-timing compilation error 2019-09-19 18:14:06 -04:00
filem mca: Dynamic components link against project lib 2017-08-24 11:56:16 -04:00
grpcomm Update orte 2018-01-25 08:53:43 -08:00
iof Update ORTE to support PMIx v3 2018-03-02 02:00:31 -08:00
odls Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion. 2019-10-02 14:49:47 -05:00
oob Cleanup stale code in ORTE/OOB 2019-09-26 15:21:15 -07:00
plm Fix tree spawn routed component issue 2019-08-29 16:26:43 -04:00
ras Fix typos 2019-08-07 05:51:29 -07:00
regx v4.0.x: regx/naive: add regx/naive component 2019-08-26 11:37:07 -04:00
rmaps Allow individual jobs to set their map/rank/bind policies 2019-08-07 05:51:06 -07:00
rml rml/ofi: remove 2019-02-19 10:27:47 -07:00
routed Fix tree spawn at scale 2019-06-04 09:49:01 -07:00
rtc orte-rmaps-base: update out-of-slots show_help message 2018-11-08 16:03:28 -05:00
schizo orterun: use consistent CLI option name for --bind-to 2018-06-21 08:22:00 -07:00
snapc pmix: added check for pmix fence status 2018-08-17 21:33:50 +06:00
sstore sstore/stage: fix parameter handling in sstore_stage_local_compress_waitpid_cb() 2018-01-04 09:33:46 +09:00
state Fix cross-mpirun connect/accept operations 2019-03-01 08:41:23 -08:00
Makefile.am Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
mca.h Purge whitespace from the repo 2015-06-23 20:59:57 -07:00