1
1
openmpi/orte/mca
Josh Hursey 8f45fcb429 More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now.
* Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation).
 * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer.
 * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process.
 * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario.

This commit was SVN r24306.
2011-01-26 14:56:35 +00:00
..
db Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. 2010-11-07 23:29:52 +00:00
debugger removed c99 test code 2011-01-25 23:02:35 +00:00
errmgr More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. 2011-01-26 14:56:35 +00:00
ess Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
filem Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. 2010-10-24 18:35:54 +00:00
grpcomm Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
iof Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
notifier Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) 2011-01-13 20:13:49 +00:00
odls corrected a couple places in orte where it said cpu_model when it should have been cpu_type. 2011-01-11 19:56:26 +00:00
oob Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) 2011-01-13 20:13:49 +00:00
plm Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
ras Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
rmaps Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
rmcast Update the multicast subsystem - ported from Cisco branch 2011-01-13 01:54:05 +00:00
rml Update the multicast subsystem - ported from Cisco branch 2011-01-13 01:54:05 +00:00
routed Convert the bad dos line endings to unix style for all windows related files. 2010-12-02 12:08:08 +00:00
sensor Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. 2010-11-07 23:29:52 +00:00
snapc More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. 2011-01-26 14:56:35 +00:00
sstore Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. 2010-10-24 18:35:54 +00:00