Add a new function to opal_progress that tells us our recursion depth to support that solution.
Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death...
Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang.
After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them.
Thanks you Jeff! :-/
This commit was SVN r18611.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.
Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.
More info to come on devel list.
This commit was SVN r18555.
Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv.
Created two new macros to assist:
ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object
ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd.
Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output.
This has been tested on Mac, TM, and SLURM.
This commit was SVN r17647.