From 3eb6c985420a6da9a53d0a7d2153500623216e7d Mon Sep 17 00:00:00 2001 From: Artem Polyakov Date: Fri, 6 Jan 2017 10:24:26 +0200 Subject: [PATCH] orte/odls: Fix ORTE state machine for the non-zero exit case This commit fixes rare race condition that occurs when the process that is calling `exit(-1)` has delay between fd cleanup and actual OS-level exit. This may happen if the process has some work to do `on_exit()`. **Problem description**: Consider an application process that has called `exit(nonzero)`, it's fd's was closed but it's actual termination at OS level is delayed by some cleanups (eg. in callbacks registered via `on_exit()`). Observed sequence of events was the following: * orted gets stdio disconnection and activating `IOF COMPLETE` state. * parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be activated. * during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc` is called even though real waitpid wasn't yet called (code mentions that waitpid might not be called for unspecified reason). Because of that real exit code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees `IOF COMPLETE` flag and in conjunction with 0-exit-code it activates `WAITPID FIRED` state. * processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be activated. * `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag for this proc to be dropped. * when application process finally exits and `wait_signal_callback` is launched. It sets real exit code and calls `odls_base_default_wait_local_proc` again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`) leading to a hang that was observed. Signed-off-by: Artem Polyakov --- orte/mca/odls/base/odls_base_default_fns.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c index b84321e239..c84ceceb1c 100644 --- a/orte/mca/odls/base/odls_base_default_fns.c +++ b/orte/mca/odls/base/odls_base_default_fns.c @@ -17,6 +17,7 @@ * Copyright (c) 2013-2016 Intel, Inc. All rights reserved. * Copyright (c) 2014 Research Organization for Information Science * and Technology (RIST). All rights reserved. + * Copyright (c) 2017 Mellanox Technologies Ltd. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -1200,6 +1201,9 @@ void odls_base_default_wait_local_proc(orte_proc_t *proc, void* cbdata) ORTE_NAME_PRINT(&proc->name),proc->exit_code)); if (WIFEXITED(proc->exit_code)) { proc->exit_code = WEXITSTATUS(proc->exit_code); + if (0 != proc->exit_code) { + state = ORTE_PROC_STATE_TERM_NON_ZERO; + } } else { if (WIFSIGNALED(proc->exit_code)) { state = ORTE_PROC_STATE_ABORTED_BY_SIG;