1
1
openmpi/ompi/mca
Josh Hursey 2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
..
allocator Per long threads on the mailing list and much confusion discussion 2007-12-15 13:32:02 +00:00
bml Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. 2008-04-24 17:54:22 +00:00
btl reverted r18169,r18170 due to connection reset by peer on odin/sif 2008-04-23 15:26:15 +00:00
coll accidentally made a change in the wrong place. 2008-04-23 17:32:05 +00:00
common Cleanup shared file creation on unix/linux. 2008-03-30 13:41:47 +00:00
crcp Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. 2008-04-17 20:43:56 +00:00
dpm the port name is only relevant at the root, so only look at it there. 2008-04-17 12:37:10 +00:00
io Restore a placeholder to make non-SVN SCM's happy. 2008-02-28 20:19:22 +00:00
mpool Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. 2008-03-23 23:10:15 +00:00
mtl nasty memory bug... 2008-04-18 03:01:53 +00:00
osc Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. 2008-04-17 20:43:56 +00:00
pml Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. 2008-04-24 17:54:22 +00:00
pubsub Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. 2008-04-16 14:27:42 +00:00
rcache Don't call free(), or library functions that may call free() inside (such as 2008-01-08 08:55:42 +00:00
topo Present state of MPI debugger work: 2008-03-05 12:22:34 +00:00