From 696bb4a0c016fd7eb7a3dfca544bd521d7e489be Mon Sep 17 00:00:00 2001 From: Josh Hursey Date: Wed, 5 Jul 2006 19:37:29 +0000 Subject: [PATCH] A partial fix for the hanging orted bugs (Ticket #177) When we force an application to terminate (via CTRL-C to mpirun) we send an out-of-band message to the orted to reap its children. the fork PLS was doing an internal waitpid but never releasing or updating the information and signaling the condition variable. So the fork PLS callback for SIGCHLD registered with the event library and this waitpid are in a bit of a race to 'waitpid' for the children. Since the PLS callback was the only one that handled the signal properly when it 'won' then things were great -- as in the normal termination case. But when it 'lost' -- as in the abnormal termination case -- the orted never received the proper signal that its children had gone away. We want to preserve the internal fork PLS callback since it allows for a timeout while waiting for the child, which the event library won't do. This allows both to exist, and behave properly. This was introduced in r9068. The ticket is still open since the orted's hang in other situations still. This is a fix for one of the causes. This commit was SVN r10662. The following SVN revision numbers were found above: r9068 --> open-mpi/ompi@c2c2daa96657a186719f3cde74159028c3da5a16 --- orte/mca/pls/fork/pls_fork_module.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/orte/mca/pls/fork/pls_fork_module.c b/orte/mca/pls/fork/pls_fork_module.c index 3b20a3b24d..52031a0ac1 100644 --- a/orte/mca/pls/fork/pls_fork_module.c +++ b/orte/mca/pls/fork/pls_fork_module.c @@ -137,7 +137,7 @@ static void orte_pls_fork_kill_processes(opal_value_array_t *pids) the reaping of it. If we get any other error back, just skip it and go on to the next process. */ if (0 != kill(pid, SIGTERM) && ESRCH != errno) { - continue; + goto next_pid; } /* The kill succeeded. Wait up to timeout_before_sigkill @@ -154,7 +154,15 @@ static void orte_pls_fork_kill_processes(opal_value_array_t *pids) "orte-pls-fork:could-not-kill", true, hostname, pid); } + goto next_pid; } + + next_pid: + /* Release any waiting threads from this process */ + OPAL_THREAD_LOCK(&mca_pls_fork_component.lock); + mca_pls_fork_component.num_children--; + opal_condition_signal(&mca_pls_fork_component.cond); + OPAL_THREAD_UNLOCK(&mca_pls_fork_component.lock); } }