Ralph Castain
|
247ba7e90d
|
Use the base function to claim a slot when fault groups are not defined
This commit was SVN r21681.
|
2009-07-15 11:28:58 +00:00 |
|
Ralph Castain
|
b97f885c00
|
Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.
This commit was SVN r21639.
|
2009-07-13 02:29:17 +00:00 |
|
Ralph Castain
|
b96a71b62e
|
Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.
This commit was SVN r21545.
|
2009-06-26 20:54:58 +00:00 |
|
Ralph Castain
|
e9fc0a74fb
|
Silence compiler warnings
This commit was SVN r21445.
|
2009-06-16 13:34:31 +00:00 |
|
Ralph Castain
|
d1dd8c2653
|
Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
Have the resilient mapper also setup for new daemons in case the PLM needs them.
This commit was SVN r21437.
|
2009-06-15 13:55:01 +00:00 |
|
Ralph Castain
|
c0c56e30c9
|
Add a missing function to the resilient mapper so it defines daemons in case they are needed
This commit was SVN r21428.
|
2009-06-12 19:48:13 +00:00 |
|
Ralph Castain
|
170327e575
|
Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
This commit was SVN r21424.
|
2009-06-12 17:52:17 +00:00 |
|
Ralph Castain
|
0a67bcb653
|
Minor cleanups
This commit was SVN r21387.
|
2009-06-06 15:44:00 +00:00 |
|
Ralph Castain
|
0336460b0a
|
Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
This commit was SVN r21385.
|
2009-06-06 01:08:47 +00:00 |
|
Ralph Castain
|
303e3a1d39
|
Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
This commit was SVN r21350.
|
2009-06-02 03:23:20 +00:00 |
|