1
1

8 Коммитов

Автор SHA1 Сообщение Дата
Wesley Bland
5fde3e0e00 Move the resilient orte errmgr code into a seperate errmgr for now while it's
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).

This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Josh Hursey
0eb3b3b7b0 Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
Per RFC:
  http://www.open-mpi.org/community/lists/devel/2011/06/9335.php

This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
b98a2917ff Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it.
This commit was SVN r24412.
2011-02-18 02:48:12 +00:00
Josh Hursey
fabd5cc153 Simplification of the ErrMgr framework by removing the 'stack'/composite functionality.
The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably.

Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component.

So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning.

This commit was SVN r23628.
2010-08-19 13:09:20 +00:00
Ralph Castain
099c3aad97 Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0.
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.

Thanks to Dong Ahn (LLNL) for catching this problem!

Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.

This commit was SVN r23300.
2010-06-24 05:13:53 +00:00
Ralph Castain
05e05089b8 Ignore failed comm connections if it is our connection that failed
This commit was SVN r23184.
2010-05-20 03:13:09 +00:00
Ralph Castain
4bd25f587c Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t
he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so.

Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort.

This commit was SVN r23112.
2010-05-11 00:34:12 +00:00