1
1
openmpi/orte/orted
Ralph Castain 2940309613 Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS:
1. a direct callback from waitpid - this set the waitpid_fired flag

2. a notify event callback from the IOF - this set the iof complete flag

3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize.

The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3.

This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT.

This commit was SVN r20074.
2008-12-05 04:20:00 +00:00
..
help-orted.txt Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. 2007-07-12 19:53:18 +00:00
Makefile.am Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. 2008-06-18 03:15:56 +00:00
orted_comm.c Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 2008-12-05 04:20:00 +00:00
orted_main.c Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 2008-12-03 17:45:42 +00:00
orted.h Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. 2008-02-28 19:58:32 +00:00