66af515061
Short Version: -------------- Event engine needs to be flushed so it does not use old/stale file descriptors. Long Version: ------------- The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive. After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted. The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem. A few other minor things: ------------------------- * Add a check to make sure the event engine is balanced in its init/finalize * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is). This commit was SVN r24296.
Last updated: 15 Sep 2010 How to update the Libevent embedded in OPAL ------------------------------------------- OPAL requires some modification of the Libevent build system in order to properly operate. In addition, OPAL accesses the Libevent functions through a set of wrappers - this is done for three reasons: 1. Hide the Libevent functions. Some applications directly call libevent APIs and expect to operate against a locally installed library. Since the library used by OPAL may differ in version, and to avoid linker errors for multiply-defined symbols, it is important that the libevent functions included in OPAL be "hidden" from external view. Thus, OPAL's internal copy of libevent is built with visibility set to "hidden" and all access from the OPAL code base is done through "opal_xxx" wrapper API calls. In those cases where the system is built against a compiler that doesn't support visibility, conflicts can (unfortunately) arise. However, since only a very few applications would be affected, and since most compilers support visibility, we do not worry about this possibility. 2. Correct some deficiencies in the distributed Libevent configuration tests. Specifically, the distributed tests for kqueue and epoll support provide erroneous results on some platforms (as determined by our empirical testing). OPAL therefore provides enhanced tests to correctly assess those environments. 3. Enable greater flexibility in configuring Libevent for the specific environment. In particular, OPAL has no need of Libevent's dns, http, and rpc events, so configuration options to remove that code from Libevent have been added. The procedure for updating Libevent has been greatly simplified compared to prior versions in the OPAL code base by replacing file-by-file edits with configuration logic. Thus, updating the included libevent code can generally be accomplished by: 1. create a new opal/mca/event component for the updated version, using a name "libeventxxx", where xxx = libevent version. For example, libevent 2.0.7 => component libevent207 2. create a subdirectory "libevent" in the new component and unpack the new libevent code tarball into it. 3. copy the configure.m4, autogen.subdirs, component.c, moduule.c, Makefile.am, and .h files from a prior version to the new component. You will need to customize them for the new version, but they can serve as a good template. In many cases, you will just have to update the component name. 4. edit libevent/configure.in to add OMPI specific options and modes. Use the corresponding file from a prior version as a guide. The necessary changes are marked with "OMPI" comments. These changes have been pushed upstream to libevent, and so edits may no longer be required in the new version. 5. Modify libevent/Makefile.am. Here again, you should use the file from a prior version as an example. Hopefully, you can just use the file without change - otherwise, the changes will have to be done by hand. Required changes reflect the need for OMPI to turn "off" unused subsystems such as http. These changes have been pushed upstream to libevent, and so edits may no longer be required in the new version. 6. in your new component Makefile.am, note that the libevent headers are listed by name when WITH_INSTALL_HEADERS is given. This is required to support the OMPI --with-devel-headers configure option. Please review the list and update it to include all libevent headers for the new version.