The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.
The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.
Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.
* Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
* Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
* Update ft_event functions in PML and BML to handle the new restart state.
* Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.
This commit was SVN r18276.
{{{
svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}
Contains:
* Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
* Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
* Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
* Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
* Some other sundry cleanup items all dealing with C/R functionality in the trunk.
This commit was SVN r18241.
Fix the ompi-server -h cmd line option so it actually tells you something!
Add two new testing codes to the orte/test/mpi area: accept and connect.
This commit was SVN r18176.
Rational (taken from the code):
/* This is PITA. We never know which source address an
* incoming/outgoing packet will have, so even with
* btl_tcp_if_include/exclude on the remote end, we
* might get a different source address.
*
* If this address isn't included in btl_proc->proc_addrs,
* we would erroneously drop the connection
*/
merge -r18165:18167 to the trunk.
This commit was SVN r18169.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r18165
r18167
mca_btl_openib_endpoint_post_rr_nolock is freeing the endpoint lock on
the error case, but most/all of the functions calling this free the lock
regardless of its error case. Thus resulting is a double free of the
lock.
This commit was SVN r18131.
wring directory. The UH copyrights do belong into this file (i.e. because of
the fix which is in the 1.2 branch, the UH copyright notes are in the header
there alreary), but I want to have the proper log for that.
This commit was SVN r18124.
inter-communicator scatter, since the root (root==MPI_ROOT) might very well
have recvcount=0. The same fix has been applied to gather.c just the other way
round.
Fixes the bug reported on the mainling list by Martin Audet. If there is a
1.2.7 this fix might be worthwhile porting it over.
Please note, that while the test works now for basic and for inter, we get a
0byte malloc warning from the inter module, which we still have to fix in a
separate patch.
This commit was SVN r18122.