when CHECK_AND_RECYCLE detects an error, a message is displayed
if the error occurs on an intrinsic communicator, then abort
the program (instead of trying to free the communicator)
cmr=v1.8.3:reviewer=hjelmn
This commit was SVN r32659.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
It is essential to call mca_base_framework_close for every framework
that is opened. coll/ml was not doing this so neither bcol nor sbgp
were getting cleaned up. This commit fixes this omission.
Also fixed a leak caused by calling OBJ_DESTRUCT for something created
with OBJ_NEW. With these changes coll/ml appears to be valgrind clean.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31743.
The file coll_ml_ibarrier.c wasn't included in coll/ml's Makefile.am
and the setup code from coll_ml_hier_algorithms_ibarrier.c was not
being called. It looks like this code is stale and has long since been
replaced by the code in coll_ml_barrier.c
Once all these little CMRs are approved I may make it into one roll-up
CMR to make it easier on the RM.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31418.
a segmentation fault in the reduce cleanup
Some of the changes address false warnings produced by scan-build. I
added asserts and changed some malloc calls to calloc to silence these
warnings.
The was one issue in cleanup for reduce since the component_functions
member is changed by the allreduce call. There may be other issues
with how this code works but releasing the allocated
component_functions after setting up the static functions addresses
the primary issue (SIGSEGV).
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31417.
Discussed this with Manju and we decided to back this one out until a later time.
This reverts commit r31188 and closes trac:4435
This commit was SVN r31282.
The following SVN revision numbers were found above:
r31188 --> open-mpi/ompi@f1dd589092
The following Trac tickets were found above:
Ticket 4435 --> https://svn.open-mpi.org/trac/ompi/ticket/4435
There were a couple of issues with the memory leak fixes and several more verbose
issues. This fixes those issues.
cmr=v1.8.1:ticket=trac:4473
This commit was SVN r31273.
The following Trac tickets were found above:
Ticket 4473 --> https://svn.open-mpi.org/trac/ompi/ticket/4473
Thanks to ggouaillardet for finding and fixing these issues.
Closes trac:4460
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31264.
The following Trac tickets were found above:
Ticket 4460 --> https://svn.open-mpi.org/trac/ompi/ticket/4460
The error doesn't prevent the user from running so there is no reason
to display it unless the user requested it (through coll_ml_verbose).
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31242.
a hierarchy actually matches a bcol that is in use.
There was a bug in one of the paths to calculate the ml buffer size. I fixed
the bug and squashed all the paths together to avoid further issues (the
result was correct in another path that calculated the same value).
Additionally, the i_hier was being used as the bcol_index. This is not
correct in a couple of cases so I added a variable to keep track of the
real bcol_index.
cmr=v1.8:reviewer=pasha
This commit was SVN r31189.
bound.
This case is correctly handled by coll/ml so remove the check that diables
coll/ml in the not bound case.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31188.
This patch fixes two leaks:
- Fix typo in fallback collective code that caused coll/ml to retain
the ibcast module twice but only release it once. One of those ibcast
saves was supposed to be bcast.
- Do not check for module initialization in the module destructor. It
is possible to destruct a module that is partially setup.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31187.
This isn't causing any errors that I know about but it does fix an
annoying valgrind warning. Simple fix, no review required.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31130.
There are situations where coll/ml does not initialize properly. These will
eventually need to be fixed but in the meantime it is better to not always
print an error message because the collective framework can still fall back
on another collective module. This commit reduces the verbose output.
cmr=v1.7.5:reviewer=manjugv
This commit was SVN r31129.
It is usually not a good idea to assert when something is not implemented
or something goes wrong. Replace asserts with debug output and return.
cmr=v1.7.5:reviewer=manjugv
This commit was SVN r31128.
Also fixed spelling: IS_NOT_RECHABLE -> IS_NOT_REACHABLE.
Also mark a few places where opal_show_help() should have been used;
Manju will take care of these.
This commit was SVN r31104.
In r31071 I modified the logic to not increment the hierarchy level if
no processes were selected by that sbgp. That fixed a problem seen on
systems where we don't support process binding. The problem is there
is a case where we actually did select processes yet the number of
selected processes is 0. We need to increment the hierarchy in this case
as well.
This should fix the segmentation fault found by recent MTT runs. Once
this is committed to 1.7.5 remove the .ompi_ignore's from coll/ml and
bcol/ptpcoll. Tested with ompi-tests/ibm.
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31081.
The following SVN revision numbers were found above:
r31071 --> open-mpi/ompi@1911d97044
This was causing JVMs to run out of stack space, and all manner of
badness ensued.
Instead, use the heap -- that's what it's there for.
cmr=v1.7.5:reviewer=rhc:subject=make coll/ml use the heap for large debug array
This commit was SVN r31073.
fails to select any processes on any nodes.
Also modified basesmsocket to only print debugging info to the framework
output.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31071.
This is hot-fix patch for the issue reported by Ralph.
In future we plan to restructure ml data structure layout.
Tested by Nathan.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30619.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
This commit was SVN r30605.
The following SVN revision numbers were found above:
r30600 --> open-mpi/ompi@7d2c4cb468
r30602 --> open-mpi/ompi@9e751a0302
r30604 --> open-mpi/ompi@3012c280cf
Revision number ranges (suitable for "git log"):
r30602-30604 --> open-mpi/ompi@9e751a03^..3012c280
them, but it's going to take a little time (at least one day). So
Nathan says it's ok to .ompi_ignore coll ml until he's able to fix it.
This commit was SVN r30600.
Several changes are contained in this commit:
- Clean up tabs and trailing whitespaces
- Use consistent indentation in changed files
- Remove unused code. None of the removed code will ever have been
used in a trunk build.
- Clean up the smcm code quite a bit
- Do not fflush stderr and use opal_output instead of fprintf.
These changes have been tested on Cray XE-6 and PSM systems.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30533.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
This commit fixes one warning that should have caused coll/ml to segfault
on reduce. The fix should be correct but we will continue to investigate.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30477.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
After IM with Nathan, apply patch from ticket after verification by Paul Hargrove that it fixes the problem on non-x86 32-bit platforms
Verified by Paul, RM-approved
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30411.
The following Trac tickets were found above:
Ticket 4143 --> https://svn.open-mpi.org/trac/ompi/ticket/4143