This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.
Thanks to Dong Ahn (LLNL) for catching this problem!
Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.
This commit was SVN r23300.
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so.
Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort.
This commit was SVN r23112.
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.
This commit was SVN r23071.
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis
2. orte_max_local_restarts - default max number of local restarts, can be overridden
3. orte_max_global_restarts - default max number of relocates, can be overridden
Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.
This commit was SVN r23057.
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.
This commit was SVN r23045.
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.
2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.
This support must be enabled by configuring --enable-sensors.
ompi_info and orte-info have been updated to include the new framework.
Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.
Implementation continues.
This commit was SVN r23043.
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.
* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.
* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.
* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons
* update the routed modules to reflect that process state is updated in the errmgr
* ensure that the orted's open the errmgr and select their appropriate module
* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place
* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.
* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios
This commit was SVN r23023.
Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations.
This commit was SVN r22931.
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option
2. add want-ft=orcm so we can build the orcm errmgr component
3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved
This commit was SVN r22885.
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.
This commit was SVN r22783.
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.
This commit was SVN r22405.