Ralph Castain
646a3500a7
Correctly account for number of procs in the job
...
This commit was SVN r21843.
2009-08-20 00:07:38 +00:00
Ralph Castain
1dc12046f1
Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities.
...
Adds several new mpirun options:
* -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding.
* -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned.
* -bind-to-core - currently the default behavior (maintained from prior default)
* -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile
Similar features/options are provided at the board level for multi-board nodes.
Documentation to follow...
This commit was SVN r21791.
2009-08-11 02:51:27 +00:00
Ralph Castain
c0e85a492c
Deleted one too many lines...might be good to set the value of oldnode!
...
Thanks George.
This commit was SVN r21702.
2009-07-16 18:49:24 +00:00
Ralph Castain
247ba7e90d
Use the base function to claim a slot when fault groups are not defined
...
This commit was SVN r21681.
2009-07-15 11:28:58 +00:00
Ralph Castain
b97f885c00
Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
...
Continue work on the resilient mapper, completing support for fault groups.
This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
b96a71b62e
Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
...
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.
This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
Ralph Castain
e9fc0a74fb
Silence compiler warnings
...
This commit was SVN r21445.
2009-06-16 13:34:31 +00:00
Ralph Castain
d1dd8c2653
Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
...
Have the resilient mapper also setup for new daemons in case the PLM needs them.
This commit was SVN r21437.
2009-06-15 13:55:01 +00:00
Ralph Castain
c0c56e30c9
Add a missing function to the resilient mapper so it defines daemons in case they are needed
...
This commit was SVN r21428.
2009-06-12 19:48:13 +00:00
Ralph Castain
170327e575
Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
...
This commit was SVN r21424.
2009-06-12 17:52:17 +00:00
Ralph Castain
0a67bcb653
Minor cleanups
...
This commit was SVN r21387.
2009-06-06 15:44:00 +00:00
Ralph Castain
0336460b0a
Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
...
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
303e3a1d39
Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
...
This commit was SVN r21350.
2009-06-02 03:23:20 +00:00