1
1
openmpi/opal
Josh Hursey 66af515061 Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized.
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.

Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.

After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.

The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.


A few other minor things:
-------------------------
 * Add a check to make sure the event engine is balanced in its init/finalize
 * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).

This commit was SVN r24296.
2011-01-25 22:43:47 +00:00
..
asm Pushing the Debian patch (based on Manuel Prinz modifications). 2010-11-17 02:36:03 +00:00
class Update the multicast subsystem - ported from Cisco branch 2011-01-13 01:54:05 +00:00
config The fix for ticket #2560 was somehow removed in the 2010-11-15 21:41:56 +00:00
datatype Reshape the datatype engine. The basic types are built down in OPAL. MPI types are 2011-01-13 06:08:54 +00:00
dss The system headers are supposed to be protected by #ifdef and not by #if. 2009-07-16 18:27:33 +00:00
etc Ensure that platform-specific mca param files get installed with the correct default mca param filename. Platform-specific mca param files overwrite any pre-existing default mca param file as they are considered to be the "gold" standard if a platform file was provided. 2008-08-27 02:40:02 +00:00
include Fix the Sparc and Sparcv9 atomics based on Nicolai Stange 2010-12-03 19:16:53 +00:00
mca Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized. 2011-01-25 22:43:47 +00:00
memoryhooks - Replace combinations of 2009-08-20 11:42:18 +00:00
runtime Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized. 2011-01-25 22:43:47 +00:00
threads Add a "name" field to the condition wait object to help with debugging 2010-11-24 23:20:06 +00:00
tools Emit an error (instead of a SEGV) if the "compiler" parameter is not set 2010-12-21 19:01:39 +00:00
util Decode SOS error code before checking it with the native error code. 2011-01-20 23:21:38 +00:00
win32 Add support for nanosleep function using Sleep on Windows. The accuracy of the sleep function on Windows is 1 millisecond mentioned in MSDN doc. 2010-12-15 15:43:25 +00:00
CMakeLists.txt Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
Makefile.am Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. 2010-10-24 18:35:54 +00:00
win_makefile Move all win32 related files in opal, and modify all the Makefiles.am to 2005-12-08 06:17:15 +00:00