Josh Hursey
e4f2d03d28
ErrMgr Framework redesign to better support fault tolerance development activities.
...
Explained in more detail in the following RFC:
http://www.open-mpi.org/community/lists/devel/2010/03/7589.php
This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
f66b6cae23
Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available.
...
This commit was SVN r22480.
2010-01-25 22:25:13 +00:00
Ralph Castain
cec840f6b9
The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init.
...
This commit was SVN r22404.
2010-01-14 17:59:42 +00:00
Ralph Castain
5e031d9ded
Let a restarted process have access to all known nodes instead of only those already in its prior job map
...
This commit was SVN r22225.
2009-11-19 19:45:11 +00:00
Ralph Castain
40e2299fa7
Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode.
...
Cleanup the show_help text file
This commit was SVN r22082.
2009-10-09 15:26:23 +00:00
Ralph Castain
dcab61ad83
Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids.
...
This commit was SVN r22048.
2009-10-02 03:16:18 +00:00
Ralph Castain
646a3500a7
Correctly account for number of procs in the job
...
This commit was SVN r21843.
2009-08-20 00:07:38 +00:00
Ralph Castain
1dc12046f1
Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities.
...
Adds several new mpirun options:
* -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding.
* -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned.
* -bind-to-core - currently the default behavior (maintained from prior default)
* -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile
Similar features/options are provided at the board level for multi-board nodes.
Documentation to follow...
This commit was SVN r21791.
2009-08-11 02:51:27 +00:00
Ralph Castain
c0e85a492c
Deleted one too many lines...might be good to set the value of oldnode!
...
Thanks George.
This commit was SVN r21702.
2009-07-16 18:49:24 +00:00
Ralph Castain
247ba7e90d
Use the base function to claim a slot when fault groups are not defined
...
This commit was SVN r21681.
2009-07-15 11:28:58 +00:00
Ralph Castain
b97f885c00
Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
...
Continue work on the resilient mapper, completing support for fault groups.
This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
b96a71b62e
Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
...
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.
This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
Ralph Castain
e9fc0a74fb
Silence compiler warnings
...
This commit was SVN r21445.
2009-06-16 13:34:31 +00:00
Ralph Castain
d1dd8c2653
Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
...
Have the resilient mapper also setup for new daemons in case the PLM needs them.
This commit was SVN r21437.
2009-06-15 13:55:01 +00:00
Ralph Castain
c0c56e30c9
Add a missing function to the resilient mapper so it defines daemons in case they are needed
...
This commit was SVN r21428.
2009-06-12 19:48:13 +00:00
Ralph Castain
170327e575
Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
...
This commit was SVN r21424.
2009-06-12 17:52:17 +00:00
Ralph Castain
0a67bcb653
Minor cleanups
...
This commit was SVN r21387.
2009-06-06 15:44:00 +00:00
Ralph Castain
0336460b0a
Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
...
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
303e3a1d39
Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
...
This commit was SVN r21350.
2009-06-02 03:23:20 +00:00