A partial fix for the hanging orted bugs (Ticket #177)
When we force an application to terminate (via CTRL-C to mpirun) we send an out-of-band message to the orted to reap its children. the fork PLS was doing an internal waitpid but never releasing or updating the information and signaling the condition variable. So the fork PLS callback for SIGCHLD registered with the event library and this waitpid are in a bit of a race to 'waitpid' for the children. Since the PLS callback was the only one that handled the signal properly when it 'won' then things were great -- as in the normal termination case. But when it 'lost' -- as in the abnormal termination case -- the orted never received the proper signal that its children had gone away. We want to preserve the internal fork PLS callback since it allows for a timeout while waiting for the child, which the event library won't do. This allows both to exist, and behave properly. This was introduced in r9068. The ticket is still open since the orted's hang in other situations still. This is a fix for one of the causes. This commit was SVN r10662. The following SVN revision numbers were found above: r9068 --> open-mpi/ompi@c2c2daa966
Этот коммит содержится в:
родитель
fdba2c9df0
Коммит
696bb4a0c0
@ -137,7 +137,7 @@ static void orte_pls_fork_kill_processes(opal_value_array_t *pids)
|
||||
the reaping of it. If we get any other error back, just
|
||||
skip it and go on to the next process. */
|
||||
if (0 != kill(pid, SIGTERM) && ESRCH != errno) {
|
||||
continue;
|
||||
goto next_pid;
|
||||
}
|
||||
|
||||
/* The kill succeeded. Wait up to timeout_before_sigkill
|
||||
@ -154,7 +154,15 @@ static void orte_pls_fork_kill_processes(opal_value_array_t *pids)
|
||||
"orte-pls-fork:could-not-kill",
|
||||
true, hostname, pid);
|
||||
}
|
||||
goto next_pid;
|
||||
}
|
||||
|
||||
next_pid:
|
||||
/* Release any waiting threads from this process */
|
||||
OPAL_THREAD_LOCK(&mca_pls_fork_component.lock);
|
||||
mca_pls_fork_component.num_children--;
|
||||
opal_condition_signal(&mca_pls_fork_component.cond);
|
||||
OPAL_THREAD_UNLOCK(&mca_pls_fork_component.lock);
|
||||
}
|
||||
}
|
||||
|
||||
|
Загрузка…
x
Ссылка в новой задаче
Block a user