1
1
openmpi/opal/runtime
Josh Hursey 66af515061 Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized.
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.

Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.

After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.

The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.


A few other minor things:
-------------------------
 * Add a check to make sure the event engine is balanced in its init/finalize
 * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).

This commit was SVN r24296.
2011-01-25 22:43:47 +00:00
..
help-opal-runtime.txt Remove some old references to ft_enable parameter that no longer exists. 2007-03-17 20:02:42 +00:00
Makefile.am Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). 2007-03-16 23:11:45 +00:00
opal_cr.c Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized. 2011-01-25 22:43:47 +00:00
opal_cr.h Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. 2010-10-24 18:35:54 +00:00
opal_finalize.c Add a missing to opal_sos_finalize in opal_finalize_util. 2011-01-20 23:18:02 +00:00
opal_init.c Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. 2010-10-24 18:35:54 +00:00
opal_params.c Remove stale mca param. Ensure that verbosity gets properly set for event framework debug 2010-11-13 15:37:17 +00:00
opal_progress.c Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. 2010-10-28 15:22:46 +00:00
opal_progress.h - Replace combinations of 2009-08-20 11:42:18 +00:00
opal.h Remove include/opal/sys/cache.h -- its only purpose in life was to 2010-07-06 14:33:36 +00:00