1
1
openmpi/opal/mca/crs/blcr
Josh Hursey 66af515061 Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized.
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.

Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.

After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.

The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.


A few other minor things:
-------------------------
 * Add a check to make sure the event engine is balanced in its init/finalize
 * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).

This commit was SVN r24296.
2011-01-25 22:43:47 +00:00
..
configure.m4 WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. 2010-09-17 23:04:06 +00:00
crs_blcr_component.c After talking through the patch with Jeff, we have a couple more fixes to r21766 that should also go over to v1.3 in Ticket #1987. 2009-08-05 22:07:37 +00:00
crs_blcr_module.c Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized. 2011-01-25 22:43:47 +00:00
crs_blcr.h Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). 2009-09-22 18:26:12 +00:00
help-opal-crs-blcr.txt Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). 2007-03-16 23:11:45 +00:00
Makefile.am WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. 2010-09-17 23:04:06 +00:00