Ralph Castain
|
8ab962411c
|
Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases.
Take advantage of the changes to update the routed_base_receive code to avoid message overlap.
This commit was SVN r22329.
|
2009-12-17 19:39:53 +00:00 |
|
Ralph Castain
|
bc869636be
|
Reset the verbosity levels to suppress debug output
This commit was SVN r22095.
|
2009-10-13 15:29:38 +00:00 |
|
Ralph Castain
|
84cc847be8
|
Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool.
This commit was SVN r22083.
|
2009-10-10 01:19:56 +00:00 |
|
Ralph Castain
|
709b36efb4
|
Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
This commit was SVN r22012.
|
2009-09-25 01:00:09 +00:00 |
|
Ralph Castain
|
e554fc282d
|
Add some diagnostic output when daemons die
This commit was SVN r21960.
|
2009-09-09 18:16:50 +00:00 |
|
Ralph Castain
|
08e17b72cf
|
Break a circular logic loop in the cm routed module.
This commit was SVN r21708.
|
2009-07-17 18:07:35 +00:00 |
|
Ralph Castain
|
90a2db25e9
|
Modify the errmgr callback function so it passes the proc that failed instead of only the jobid.
Update the cm routed module to detect and pass orted failures.
This commit was SVN r21682.
|
2009-07-15 11:43:33 +00:00 |
|
Ralph Castain
|
b97f885c00
|
Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.
This commit was SVN r21639.
|
2009-07-13 02:29:17 +00:00 |
|
Ralph Castain
|
4a31c65126
|
Add initial support for CMs
This commit was SVN r21334.
|
2009-05-30 20:55:55 +00:00 |
|