Ralph Castain
97cfbdfc8c
Add debug
...
This commit was SVN r23514.
2010-07-27 16:55:16 +00:00
Ralph Castain
cfbfbb75a2
Don't abort if a race condition causes a nid not to be found
...
This commit was SVN r23513.
2010-07-27 16:20:58 +00:00
Ralph Castain
ff2d573f7e
Adjust some default values, and ensure we don't start sending too soon
...
This commit was SVN r23492.
2010-07-23 19:37:16 +00:00
Ralph Castain
95a908cb27
Provide run-time control over heartbeat
...
This commit was SVN r23490.
2010-07-23 17:11:12 +00:00
Ralph Castain
e7719f0aa4
Update platform files, adjust sensor heartbeat module selection rules
...
This commit was SVN r23477.
2010-07-22 21:50:46 +00:00
Ralph Castain
099c3aad97
Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0.
...
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.
Thanks to Dong Ahn (LLNL) for catching this problem!
Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.
This commit was SVN r23300.
2010-06-24 05:13:53 +00:00
Ralph Castain
7ce34223f1
Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases.
...
Implement process migration when failed nodes are detected. Some testing still required
This commit was SVN r23121.
2010-05-12 18:11:58 +00:00
Ralph Castain
2ff1ae13e1
Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
...
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00