WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
Let's not make the move to OPAL any harder than it has to be; this
commit can wait until after the BTL move.
This commit was SVN r32316.
The following SVN revision numbers were found above:
r32315 --> open-mpi/ompi@7b7ed8ed97
CMR'ing just to (try to) keep the differences between trunk and v1.8
branch (somewhat) small.
Reviewed by Dave Goodell
cmr=v1.8.3:reviewer=ompi-rm1.8
This commit was SVN r32315.
Previously, we were only checking connectivity upon first ''send'' to
a peer. But this ignores the case where the first communication to a
peer is actually an ACK -- i.e., we successfully received something
from the peer and we need to send an ACK back. So we need to verify
that the ACK will actually get there.
Specifically, certain asymmetric routing cases can lead to a hang if
we don't check the connectivity in both directions. E.g., if the
sender is able to get traffic to the receiver, but the receiver is
unable to get traffic back to the sender because it made a different
routing decision than the sender.
In this case, the connectivity checker from the sender could succeed
(because the connectivity checker will ACK along the same path in
which the ping was received), but sending a BTL ACK could fail
(because the BTL ACK will be sent back along the path chosen by the
graph algorithm, which, in an erroneous asymmetric routing scenario,
may be different/wrong).
Hence, we want to trigger the connectivity checker at the first
communication from A->B, which may either be a BTL send or an ACK.
Reviewed by Dave Goodell.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32309.
Ensure that target directories exists before creating symlinks.
cmr=v1.8.2:reviewer=jsquyres
Thanks Jeff to step up as an reviewer.
This commit was SVN r32305.
- Portals4/OSC was unable to acquire an exclusive lock due to an invalid
local address in the atomic operation. This caused the reported hang.
- After fixing the hang, the test continued to fail because
ompi_datatype_is_contiguous_memory_layout() reports that MPI_EMPTY (the
origin datatype) is noncontiguous and Portals4/OSC does not support
noncontiguous datatypes at this time. However, in this case the origin
count is zero so contiguous/noncontiguous is irrelevant. Now we skip
the contiguous check if the count is zero.
cmr=v1.8.3:reviewer=regrant:subject=Fix for "Portals4/MTL hangs in c_get_accumulate test"
This commit was SVN r32295.
The following Trac tickets were found above:
Ticket 4662 --> https://svn.open-mpi.org/trac/ompi/ticket/4662
Fix a copy-n-paste error: the ompi/pompi interfaces should not have
optional ierror arguments. Optional ierror arguments are only used in
the MPI_<foo> interfaces. The ompi/pompi interfaces are the actual
underlying routines (in C, incidentally, which is why they're declared
as BIND(C)), and do not have optional ierror arguments.
Also fix a typo in the BIND(C) name for pompi_win_shared_query_f().
cmr=v1.8.2:reviewer=ggouaillardet
This commit was SVN r32287.
Rever r32246, r32254, and 32255 -- they were fixing side-effects of
the real bug. Real fix coming after this one.
This commit was SVN r32286.
The following SVN revision numbers were found above:
r32246 --> open-mpi/ompi@08d2a1a48d
r32254 --> open-mpi/ompi@232d4dbb7b
QA ran across the case where the user can't write to the target
directory for the connectivity map file. In this case, we silently
continued. They requested that we at least warn in this case.
Fixes Cisco bug CSCup62821
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32283.
Description:
This mod fixes a regression in the ugni btl eager get
path introduced in changeset 32196.
References:4800
Closes:4800
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r32264.
The logic was mishandling the case of a newer kernel and an older
libusnic_verbs. Simplify usnic_transport() to return constants in the
2 known cases (not a usNIC device and the TRANSPORT_USNIC_UDP case),
and call the magic probe in all other cases.
Reviewed-by: Dave Goodell <dgoodell@cisco.com>
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32260.
If we don't explicitly declare that (a == NULL && b == NULL) is
equivalent to qsort, we could end up with wonky sorting order. I.e.,
it's *possible* that some NULLs could end up in the middle of the
array.
Regardless of whether it will ever happen in practice, it makes the
code more clear to also handle the "both are NULL" case.
Also fix the 2-spacing indents.
Reviewed by Dave Goodell.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32259.
Simplify and fix the r32246
cmr=v1.8.2:ticket=trac:4792
This commit was SVN r32254.
The following SVN revision numbers were found above:
r32246 --> open-mpi/ompi@08d2a1a48d
The following Trac tickets were found above:
Ticket 4792 --> https://svn.open-mpi.org/trac/ompi/ticket/4792
ABSoft compilers cannot compile a fortran subroutine
with the BIND(C, NAME="name") modifier *and* argument(s)
with the OPTIONAL modifier
This patch detects this unsupported feature and use
adhoc wrappers if it is missing
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32246.
The wrong descriptor field was used when calculating the size received when
using the RDMA rendevous protcol.
This commit was SVN r32232.
The following SVN revision numbers were found above:
r32196 --> open-mpi/ompi@a14e0f10d4
new automake requires subdirs-object directive, to resolve this:
09:43:37 automake: warning: possible forward-incompatibility.
09:43:37 automake: At least a source file is in a subdirectory, but the 'subdir-objects'
09:43:37 automake: automake option hasn't been enabled. For now, the corresponding output
09:43:37 automake: object file(s) will be placed in the top-level directory. However,
09:43:37 automake: this behaviour will change in future Automake versions: they will
09:43:37 automake: unconditionally cause object files to be placed in the same subdirectory
09:43:37 automake: of the corresponding sources.
09:43:37 automake: You are advised to start using 'subdir-objects' option throughout your
09:43:37 automake: project, to avoid future incompatibilities.
09:43:37 tools/otfmerge/Makefile.common:13: warning: source file '$(OTFMERGESRCDIR)/otfmerge.c' is in a subdirectory,
09:43:37 tools/otfmerge/Makefile.common:13: but option 'subdir-objects' is disabled
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32225.
Always include into the tarball (aka 'make dist') :
- mpi-f90-interfaces.h
- mpi-f90-cptr-interfaces.F90
cmr=v1.8.2:ticket=trac:4736
This commit was SVN r32215.
The following Trac tickets were found above:
Ticket 4736 --> https://svn.open-mpi.org/trac/ompi/ticket/4736
Handle OMPI_REQUEST_NOOP in MPI_Startall rather than PML
cmr=v1.8.2:reviewer=bosilca:ticket=4764
This commit was SVN r32213.
The following Trac tickets were found above:
Ticket 4764 --> https://svn.open-mpi.org/trac/ompi/ticket/4764
The connectivity map output routine needs to handle the case where
entries in the endpoints array are NULL (e.g., if one process has 2
endpoints and another process has only 1 endpoint).
Fixes Cisco bug CSCup83649.
cmr=v1.8.2
This commit was SVN r32211.
Older gfortran compilers (e.g., the gfortran that ships in RHEL5) do
not support ISO_C_BINDING, and therefore do not support the
TYPE(C_PTR) type. As such, they cannot support the overloaded
interfaces for MPI_WIN_ALLOCATE_SHARED and MPI_SHARED_QUERY that are
mandated in MPI-3.
So we separate those interfaces out into a separate .F90 file that is
#include'd in the tkr mpi.F90 file. In this separate .F90 file, we
use an #if to determine whether the compiler supports ISO_C_BINDING or
not.
Also re-jiggered the order of testing in ompi_setup_mpi_fortran.m4: we
now need to test whether the compiler supports ISO_C_BINDING even when
we're only building the mpi module (not strictly when we're building
the mpi_f08 module).
Finally, tweaked the use-mpi-tkr/Makefile.am to:
* Add some proper dependencies for mpi.F90
* Allow the general AM compilation to be used instead of supplying a
specific rule for compiling mpi.F90
cmr=v1.8.2:ticket=trac:4736
This commit was SVN r32204.
The following Trac tickets were found above:
Ticket 4736 --> https://svn.open-mpi.org/trac/ompi/ticket/4736
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.
This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:
* In order to allow scalable
* generation of collective id's, they are formed as:
*
* top 32-bits are the jobid of the procs involved in
* the collective. For collectives across multiple jobs
* (e.g., in a connect_accept), the daemon jobid will
* be used as the id will be issued by mpirun. This
* won't cause problems because daemons don't use the
* collective_id
*
* bottom 32-bits are a rolling counter that recycles
* when the max is hit. The daemon will cleanup each
* collective upon completion, so this means a job can
* never have more than 2**32 collectives going on at
* a time. If someone needs more than that - they've got
* a problem.
*
* Note that this means (for now) that RTE-level collectives
* cannot be done by individual threads - they must be
* done at the overall process level. This is required as
* there is no guaranteed ordering for the collective id's,
* and all the participants must agree on the id of the
* collective they are executing. So if thread A on one
* process asks for a collective id before thread B does,
* but B asks before A on another process, the collectives will
* be mixed and not result in the expected behavior. We may
* find a way to relax this requirement in the future by
* adding a thread context id to the jobid field (maybe taking the
* lower 16-bits of that field).
This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.
This commit was SVN r32203.
mca_btl_base_segment_t and replace them with des_local and des_remote
This change also updates the BTL version to 3.0.0. This commit does
not represent the final version of BTL 3.0.0. More changes are coming.
In making this change I updated all of the BTLs as well as BTL user's
to use the new structure members. Please evaluate your component to
ensure the changes are correct.
RFC text:
This is the first of several BTL interface changes I am proposing for
the 1.9/2.0 release series.
What: Change naming of btl descriptor members. I propose we change
des_src and des_dst (and their associated counts) to be des_local and
des_remote. For receive callbacks the des_local member will be used to
communicate the segment information to the callback. The proposed change
will include updating all of the doxygen in btl.h as well as updating
all BTLs and BTL users to use the new naming scheme.
Why: My btl usage makes use of both put and get operations on the same
descriptor. With the current naming scheme I need to ensure that there
is consistency beteen the segments described in des_src and des_dst
depending on whether a put or get operation is executed. Additionally,
the current naming prevents BTLs that do not require prepare/RMA matched
operations (do not set MCA_BTL_FLAGS_RDMA_MATCHED) from executing
multiple simultaneous put AND get operations. At the moment the
descriptor can only be used with one or the other. The naming change
makes it easier for BTL users to setup/modify descriptors for RMA
operations as the local segment and remote segment are always in the
same member field. The only issue I forsee with this change is that it
will require a little more work to move BTL fixes to the 1.8 release
series.
This commit was SVN r32196.