1
1
openmpi/orte/mca/state
Ralph Castain 34e5573988 Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination.
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.

Refs trac:4717

This commit was SVN r32063.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-21 17:09:02 +00:00
..
app The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. 2014-05-08 02:01:35 +00:00
base Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process. 2014-06-16 20:46:35 +00:00
hnp The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. 2014-05-08 02:01:35 +00:00
novm Per RFC: 2014-06-01 16:14:10 +00:00
orted Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination. 2014-06-21 17:09:02 +00:00
staged_hnp Per RFC: 2014-06-01 16:14:10 +00:00
staged_orted This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on. 2014-06-17 17:57:51 +00:00
tool The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. 2014-05-08 02:01:35 +00:00
Makefile.am Use the correct abstraction layer name for the data dirs 2014-05-08 14:32:24 +00:00
state_types.h Sorry for mid-day commit, but I had promised on the call to do this upon my return. 2012-04-06 14:23:13 +00:00
state.h Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating. 2014-03-13 04:02:24 +00:00