WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
We were still leaking 1) file descriptors for data files, and 2) some
control files. I fixed both of these leaks and everything is looking
good. This should fix the bug where we are running out of file
descriptors when running the loop_spawn test. I also too the
opportunity to refactor the code a bit to make the mapping/unmapping
simpler. This should help avoid these sorts of issues in the future.
Depends on #4678
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31893.
if in_ptr is NULL, the MAP_FIXED flag cannot be passed to mmap
this caused a hang in topology/cart and topology/sub from ibm
test suite on trunk.
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31890.
Basesmuma was vallocing space for control data then mmapping over that
data. Nothing in the code suggests any need for mmapping a specific
address so I did the following to remove the leak:
- Removed the valloc of the buffer space
- ftruncate the mmaped file to ensure there is sufficient memory to
allocate space for the control data.
Ideally this code should be using opal/shmem but that is a larger
change. Keeping it simple for now.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31822.
netpatterns_setup_narray_knomial_tree.
Fix a bug in ptpcoll that caused memory allocated by
netpatterns_setup_narray_knomial_tree to leak.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31781.
The items in the available bcol list were getting leaked. This commit
fixes this leak. I also cleaned up the code a bit. This includes
making use of the opal_argv_free function.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31744.
We were leaking file descriptors when coll/ml was in use. It turn out
this was because basesmuma was failing to unmap files it had previously
mapped. This commit cleans up the setup code to ensure that we only
attempt to map the control files once per module and then ensures the
files are unmapped when the module is released.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31737.
algorithm
Per suggestion from Manju make sure there isn't a gap in the size ranges
for the available algorithms.
cmr=v1.8.2:ticket=trac:4437:reviewer=ompi-rm1.8
This commit was SVN r31728.
The following Trac tickets were found above:
Ticket 4437 --> https://svn.open-mpi.org/trac/ompi/ticket/4437
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
Not closing this file descriptor will cause us to leak file
descriptors. It is safe to close the file after it has been mmapped.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31579.
The algorithm was failing ibm/collective/allgather and iallgather. I
cleaned up the code to eliminate duplicate code paths and tracked the
issue down to an error in the way extra nodes in the knomial exchange
are handled. The new code is more compact and has been tested with up
to 64 ranks with the ibm test suite.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31419.
Thanks to ggouaillardet for finding and fixing these issues.
Closes trac:4460
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31264.
The following Trac tickets were found above:
Ticket 4460 --> https://svn.open-mpi.org/trac/ompi/ticket/4460
This commit should finish the work started for #869. Closing that ticket
with this commit.
Closes trac:869
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31257.
The following Trac tickets were found above:
Ticket 869 --> https://svn.open-mpi.org/trac/ompi/ticket/869
When we are only using local ranks basesmuma needs to provide an allreduce
function for both large and small message or else the coll/ml selection
logic will fail. In the future this logic should probably be updated to
just disable allreduce in coll/ml instead of disabling coll/ml. For now
it should be correct to say the basesmuma allgather works for larger
messages.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31190.
After discussion with Manju we decided to update these the process count
limits of the shared memory collectives to an arbitrarily large number.
cmr=v1.7.5:ticket=trac:4405
This commit was SVN r31126.
The following SVN revision numbers were found above:
r31096 --> open-mpi/ompi@3f469d08e7
The following Trac tickets were found above:
Ticket 4405 --> https://svn.open-mpi.org/trac/ompi/ticket/4405
The initialization code did several allgathers on void *'s using
MPI_LONG_LONG_INT. This will produce the wrong result on 32-bit
platforms. Instead use MPI_BYTE with count = sizeof (void *).
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30627.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
Found two bugs in basesmuma:
- Release all resources when tearing down the bcol module.
- Allways call the allreduce in the smcm code. We do not know
beforehand whether all procs have all the files mapped.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30623.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
This is hot-fix patch for the issue reported by Ralph.
In future we plan to restructure ml data structure layout.
Tested by Nathan.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30619.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
This commit was SVN r30605.
The following SVN revision numbers were found above:
r30600 --> open-mpi/ompi@7d2c4cb468
r30602 --> open-mpi/ompi@9e751a0302
r30604 --> open-mpi/ompi@3012c280cf
Revision number ranges (suitable for "git log"):
r30602-30604 --> open-mpi/ompi@9e751a03^..3012c280
opal does not always define MB. It is recommended that opal_atomic_[rw]mb is
called instead. We will need to address the cases where these functions are
no-ops on weak-memory ordered cpus.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30534.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
Several changes are contained in this commit:
- Clean up tabs and trailing whitespaces
- Use consistent indentation in changed files
- Remove unused code. None of the removed code will ever have been
used in a trunk build.
- Clean up the smcm code quite a bit
- Do not fflush stderr and use opal_output instead of fprintf.
These changes have been tested on Cray XE-6 and PSM systems.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30533.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
This commit fixes one warning that should have caused coll/ml to segfault
on reduce. The fix should be correct but we will continue to investigate.
cmr=v1.7.5:ticket=trac:4158
This commit was SVN r30477.
The following Trac tickets were found above:
Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158
After IM with Nathan, apply patch from ticket after verification by Paul Hargrove that it fixes the problem on non-x86 32-bit platforms
Verified by Paul, RM-approved
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30411.
The following Trac tickets were found above:
Ticket 4143 --> https://svn.open-mpi.org/trac/ompi/ticket/4143
allgather.
The new collectives provide a signifigant performance increase over tuned for
small and medium messages. We are initially setting the priority lower than
tuned until this has had some time to soak in the trunk. Please set
coll_ml_priority to 90 for MTT runs.
Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL),
and Nathan Hjelm (LANL).
Commit details (for reference):
Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather.
We need to take the basesmuma header into account when calculating the
ptpcoll small message thresholds. Add a define to bcol.h indicating the
maximum header size so we can take the header into account while not
making ptpcoll dependent on information from basesmuma.
This resolves an issue with allreduce where ptpcoll overwrites the
header of the next buffer in the basesmuma bank.
Fix reduce and make a sequential collective launcher in coll_ml_inlines.h
The root calculation for reduce was wrong for any root != 0. There are
four possibilities for the root:
- The root is not the current process but is in the current hierarchy. In
this case the root is the index of the global root as specified in the
root vector.
- The root is not the current process and is not in the next level of the
hierarchy. In this case 0 must be the local root since this process will
never communicate with the real root.
- The root is not the current process but will be in next level of the
hierarchy. In this case the current process must be the root.
- I am the root. The root is my index.
Tested with IMB which rotates the root on every call to MPI_Reduce. Consider
IMB the reproducer for the issue this commit solves.
Make the bcast algorithm decision an enumerated variable
Resolve various asset failures when destructing coll ml requests.
Two issues:
- Always reset the request to be invalid before returning it to the
free list. This will avoid an asset in ompi_request_t's destructor.
OMPI_REQUEST_FINI does this (and also releases the fortran handle
index).
- Never explicitly construct or destruct the superclass of an opal
object. This screws up the class function tables and will cause
either an assert failure or a segmentation fault when destructing
coll ml requests.
Cleanup allgather.
I removed the duplicate non-blocking and blocking functions and modeled
the cleanup after what I found in allreduce. Also cleaned up the code
somewhat.
Don't bother copying from the send to the recieve buffer in
bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the
same.
The eliminates a warning about memcpy and aliasing and avoids an
unnecessary call to memcpy.
Alwasy call CHECK_AND_RELEASE on memsync collectives.
There was a call to OBJ_RELEASE on the collective communicator but
because CHECK_AND_RECYLCE was never called there was not matching call
to OBJ_RELEASE. This caused coll ml to leak communicators.
Make allreduce use the sequential collective launcher in coll_ml_inlines.h
Just launch the next collective in the component progress.
I am a little unsure about this patch. There appears to be some sort
of race between collectives that causes buffer exhaustion in some cases
(IMB Allreduce is a reproducer). Changing progress to only launch the
next bcol seems to resolve the issue but might not be the best fix.
Note that I see little-no performance penalty for this change.
Fix allreduce when there are extra sources.
There was an issue with the buffer offset calculation when there are
extra sources. In the case of extra sources == 1 the offset was set
to buffer_size (just past the header of the next buffer). I adjusted
the buffer size to take into accoun the maximum header size (see the
earlier commit that added this) and simplified the offset calculation.
Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup
to work correctly.
This has been tested with various layouts using the ibm testsuite and
imb and appears to have the same performance as the old blocking version.
Fix allgather for non-contiguous layouts and simplify parsing the
topology.
Some things in this patch:
- There were several comments to the effect that level 0 of the
hierarchy MUST contain all of the ranks. At least one function
made this assumption but it was not true. I changed the sbgp
components and the coll ml initization code to enforce this
requirement.
- Ensure that hierarchy level 0 has the ranks in the correct
scatter gather order. This removes the need for a separate
sort list and fixes the offset calculation for allgather.
- There were several passes over the hierarchy to determine
properties of the hierarchy. I eliminated these extra passes
and the memory allocation associated with them and calculate the
tree properties on the fly. The same DFS recursion also handles
the re-order of level 0.
All these changes have been verified with MPI_Allreduce, MPI_Reduce, and
MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests.
coll/ml: correct pointer usage for MPI_BOTTOM
Since contiguous datatypes are copied via memcpy (bypassing the convertor) we
need to adjust for the lb of the datatype. This corrects problems found testing
code that uses MPI_BOTTOM (NULL) as the send pointer.
Add fallback collectives for allreduce and reduce.
cmr=v1.7.5:reviewer=pasha
This commit was SVN r30363.