1
1
openmpi/orte/runtime
Ralph Castain 27a73ad9ee Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message.
This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out.

What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while.

The HNP doesn't have that problem as there is no SM file there! So it gets out first.

What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!!

This commit was SVN r17893.
2008-03-20 13:30:51 +00:00
..
data_type_support Fix comm_spawn (maybe). 2008-03-06 21:56:00 +00:00
help-orte-runtime.txt Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next. 2007-04-24 20:53:54 +00:00
Makefile.am Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. 2008-03-17 17:58:59 +00:00
orte_cr.c This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are: 2008-03-05 04:57:23 +00:00
orte_cr.h Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
orte_data_server.c Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. 2008-02-28 19:58:32 +00:00
orte_data_server.h Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
orte_finalize.c Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message. 2008-03-20 13:30:51 +00:00
orte_globals_class_instances.h Support for new RMAPS rank mapping component 2008-03-18 09:39:07 +00:00
orte_globals.c local declaration instead of using global variable 2008-03-19 13:04:40 +00:00
orte_globals.h Support for new RMAPS rank mapping component 2008-03-18 09:39:07 +00:00
orte_init.c Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. 2008-03-17 17:58:59 +00:00
orte_locks.c Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. 2008-03-17 17:58:59 +00:00
orte_locks.h Add the declspec's in here so that they're visible. 2008-03-17 18:37:03 +00:00
orte_wait.c Restore r17703; it was accidentally removed as part of r17704. 2008-03-05 12:01:37 +00:00
orte_wait.h First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown. 2008-03-05 13:51:32 +00:00
orte_wakeup.c Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. 2008-03-19 19:00:51 +00:00
orte_wakeup.h Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. 2008-03-19 19:00:51 +00:00
runtime.h Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00