1
1
openmpi/orte/mca
Artem Polyakov 3eb6c98542 orte/odls: Fix ORTE state machine for the non-zero exit case
This commit fixes rare race condition that occurs when the process
that is calling `exit(-1)` has delay between fd cleanup and actual
OS-level exit. This may happen if the process has some work to do
`on_exit()`.

**Problem description**:
Consider an application process that has called `exit(nonzero)`, it's
fd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via `on_exit()`).
Observed sequence of events was the following:

* orted gets stdio disconnection and activating `IOF COMPLETE` state.
* parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be
activated.
* during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc`
is called even though real waitpid wasn't yet called (code mentions that
waitpid might not be called for unspecified reason). Because of that real exit
code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees
`IOF COMPLETE` flag and in conjunction with 0-exit-code it activates
`WAITPID FIRED` state.
* processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be
activated.
* `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag
for this proc to be dropped.
* when application process finally exits and `wait_signal_callback` is
launched. It sets real exit code and calls `odls_base_default_wait_local_proc`
again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag
dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`)
leading to a hang that was observed.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-01-06 11:12:55 +02:00
..
common pmix/cray: fix disable-dlopen problem 2016-11-21 13:45:10 -06:00
dfs Remove stale global variables 2017-01-02 14:04:24 -08:00
errmgr Ensure jobs that fail always return a non-zero exit code. 2016-12-14 09:41:06 -08:00
ess Remove stale global variables 2017-01-02 14:04:24 -08:00
filem Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). 2016-10-23 21:52:39 -07:00
grpcomm Sync the internal OMPI component to PMIx master 2016-12-19 19:14:16 -08:00
iof Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. 2017-01-05 10:32:17 -08:00
notifier ORTE: update for the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
odls orte/odls: Fix ORTE state machine for the non-zero exit case 2017-01-06 11:12:55 +02:00
oob Fix IOF when outputing to files - the remote orteds were failing to output stdout/err from their procs. 2016-12-01 14:12:47 -08:00
plm Remove stale global variables 2017-01-02 14:04:24 -08:00
ras pmix/cray: fix disable-dlopen problem 2016-11-21 13:45:10 -06:00
rmaps Only instantiate the HWLOC topology in an MPI process if it actually will be used. 2016-12-29 10:33:29 -08:00
rml Resolve a duplicate symbol issue when the rml/ofi component is enabled 2016-12-05 13:41:38 -08:00
routed Fix the radix routed component to correctly handle connected tools - in such cases, the route must be direct to the tool. 2016-11-01 19:03:26 -07:00
rtc Clean out old cruft from the ORCM project 2016-09-21 00:13:30 -07:00
schizo Remove stale global variables 2017-01-02 14:04:24 -08:00
snapc Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). 2016-10-23 21:52:39 -07:00
sstore mca/base: add priority output to mca_base_select 2015-10-19 12:32:41 -06:00
state Transfer some minor cleanups back from the PMIx reference server 2016-12-23 08:46:04 -08:00
Makefile.am Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
mca.h Purge whitespace from the repo 2015-06-23 20:59:57 -07:00