1
1
openmpi/orte
Ralph Castain 2940309613 Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS:
1. a direct callback from waitpid - this set the waitpid_fired flag

2. a notify event callback from the IOF - this set the iof complete flag

3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize.

The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3.

This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT.

This commit was SVN r20074.
2008-12-05 04:20:00 +00:00
..
etc Many thanks to Ralf W. for finding a subtle bug in these Makefile.am's 2008-06-04 01:28:03 +00:00
include Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 2008-10-18 00:00:49 +00:00
mca Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 2008-12-05 04:20:00 +00:00
orted Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 2008-12-05 04:20:00 +00:00
runtime Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 2008-12-03 17:45:42 +00:00
test This is a first step towards supporting fully-routed OOB communications: 2008-10-31 21:10:00 +00:00
tools Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 2008-12-03 17:45:42 +00:00
util Per request from IBM/Eclipse, provide MCA param to request output when nodes are resolved to a different nodename. This really only happens for the node that mpirun executes on, but they need the alert so they can do string matching of node names. 2008-11-24 19:57:08 +00:00
Doxyfile Fix the broken Doxyfile so people can generate what little code base documentation we have :-) 2006-04-13 12:52:17 +00:00
Makefile.am Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. 2008-03-05 22:44:35 +00:00