still-broken trunk build on common platforms (e.g., 64 bit Linux
RHEL4U4), I think it's clear that this code is not ready for
prime-time.
I'm backing out all the commits in the trunk/ompi/op tree from r17901
onwards. This code can be re-committed when compiles and runs on
common platforms.
cd ompi/op
svn merge -r 17907:17900 https://svn.open-mpi.org/svn/ompi/trunk/ompi/op .
This commit was SVN r17908.
The following SVN revision numbers were found above:
r17901 --> open-mpi/ompi@b9520e61dc
operations. Added to the reduction operations a set of reduction
functions that take 2 input buffers and one output buffer to avoid
some extra memory copies. These can't be used with user defined
operations. The intel c collective suite passes both original, and
new (new, not the user defined operations).
This commit was SVN r17901.
portals btl has ownership and therefor didn't free the frag as it should) this
causes leakage and hangs in MPI_Finalize.
Also added a bit more debugging.
This commit was SVN r17900.
This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out.
What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while.
The HNP doesn't have that problem as there is no SM file there! So it gets out first.
What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!!
This commit was SVN r17893.
* The opal_sys_timer_get_cycles() call was implemented for
Sparc v9 using inline assembly, but not in the assembly files.
This would only currently matter on Linux Sparc systems using
a compiler that didn't support inline assembly (not many of
those), but it should be there for completion.
* The linux timer component would always build on non-Alpha
platforms, rather than only building on platforms where
opal_sys_timer_get_cycles() was implemented. This would
only matter on a very narrow set of platforms that we don't
really support, but still, it could be more right. We now
only build the component on platforms where we have the
assembly call to get the cycle counter.
* Added a comment to opal/sys/timer.h to note that the linux
timer component needed to be updated if another platform was
added.
This should be harmless to commit. It will only really change
behaviors on platforms we don't have assembly support for, which
currently won't make it through configure. It really only matters
when (if?) we support atomic operations through libatomic_ops.
This commit was SVN r17887.
* Fix an error message to correctly display if we were before
MPI_INIT or after MPI_FINALIZE (refs trac:1243)
This commit was SVN r17873.
The following Trac tickets were found above:
Ticket 1243 --> https://svn.open-mpi.org/trac/ompi/ticket/1243
without calling a get or put. So, just keep it here until a better solution is
found.
This commit was SVN r17872.
The following SVN revision numbers were found above:
r17857 --> open-mpi/ompi@d460ccfbf9
Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway.
So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked.
Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion.
This commit was SVN r17843.
This commit lowers the priority of the darwin backtrace component
below that of the ''execinfo'' and ''stackprint'' components, which
will cause OS X Leopard to use the ''execinfo'' component. execinfo
utilizes a public API for printing the stacktrace. The ''darwin''
component uses some evil hacks and a not-so supported package from
Apple to print the stack trace.
This commit was SVN r17840.