* General TCP cleanup for OPAL / ORTE
* Simplifying the OOB by moving much of the logic into the RML
* Allowing the OOB RML component to do routing of messages
* Adding a component framework for handling routing tables
* Moving the xcast functionality from the OOB base to its own framework
Includes merge from tmp/bwb-oob-rml-merge revisions:
r15506, r15507, r15508, r15510, r15511, r15512, r15513
This commit was SVN r15528.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15506
r15507
r15508
r15510
r15511
r15512
r15513
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.
Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.
This commit was SVN r15517.
It will prevent the error failure in openib finalize
but it doesn't resolve the actual issue. I guess that
oneside tests some how allocates memory (mpool?) and doesn't
release it. Need to check it.
This commit was SVN r15488.
* bml.h had a change that introduced a variable named "_order" to
avoid a conflict with a local variable. The namespace starting
with _ belongs to the os/compiler/kernel/not us. So we can't start
symbols with _. So I replaced it with arg_order, and also updated
the threaded equivalent of the macro that was modified.
* in btl_openib_proc.c, one opal_output accidentally had its string
reverted from "ompi_modex_recv..." to
"mca_pml_base_modex_recv....". This was fixed.
* The change to ompi/runtime/ompi_preconnect.c was entirely
reverted; it was an artifact of debugging.
This commit was SVN r15475.
The following SVN revision numbers were found above:
r15474 --> open-mpi/ompi@8ace07efed
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
threads to create new communicators. There is nothing in the standard about threading
and communicaotr functions, but as they include collective communications I expect the
same rules have to be applied. As such, on an incorrect MPI program we deadlock (!).
This commit was SVN r15456.
switching:
0 0
/ \ \ / \ \
1 \ \ --> 4 \ \
/ \ \ / \ \
3 2 \ 3 2 \
4 1
(duh). The first form is the bmtree suitable for bcast, but the latter is better for reduce.
Updating default decision function accordingly.
This commit was SVN r15422.
It solve the problem with the MPI_Aint alignment that showed up on Solaris
Sparc and on heterogeneous environments when dealing with the data-type description.
The solution is to move the displacement array from the packed array if we
detect that the local architecture required MPI_Aint to be aligned to an
MPI_Aint boundary (which is not the case for x86 architectures if MPI_Aint
is a 64 bits type).
This commit was SVN r15395.
instead of just the procs for MCW (in MCW order). Should make resolving
ptl_process_id_t structures for arbitrary communicators easier for
applications that need it.
This commit was SVN r15393.
Short description: major changes include -
1. singletons now fork/exec a local daemon to manage their operations.
2. the orte daemon code now resides in libopen-rte
3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided.
I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine.
This commit was SVN r15390.
an override here in ompi_info to force the loading of all components.
This is ok because we *only* call opal_init_util() (not orte_init() or
ompi_mpi_init()).
This commit was SVN r15386.
that exactly describes the buffer to be used as the target of the
operation
* Use the above flag to disable components setting the flag from being
used for real RDMA operations for the one-sided component (the
BTLs will still be used for RDMA transfers for the PML and for
send/receive communication for the OSC component)
This commit was SVN r15375.
have to construct/destruct only once. Therefore, the construction will
happens before digging for a PML, while the destruction just before
finalizing the component.
Add some OPAL_LIKELY/OPAL_UNLIKELY.
This commit was SVN r15347.
receive queues are shared among all PMLs, they are declared in the base PML,
and the selected PML is in charge of initializing and releasing them.
The CM PML is slightly different compared with OB1 or DR. Internally it use
2 different types of requests: light and heavy. However, now with this patch
both types of requests are stored in the same queue, and cast appropriately
on the allocation macro. This means we might use less memory than we allocate,
but in exchange we got full support for most of the parallel debuggers.
Another thing with this patch, is that now for all PML (CM included) the basic
PML requests start with the same fields, and they are declared in the same order
in the request structure. Moreover, the fields have been moved in such a way
that only one volatile/atomic will exist per line of cache (hopefully).
This commit was SVN r15346.
VxWorks. Still some issues remaining, I'm sure.
Refs trac:1010
This commit was SVN r15320.
The following Trac tickets were found above:
Ticket 1010 --> https://svn.open-mpi.org/trac/ompi/ticket/1010
than just the PML/BTLs these days. Also clean up the code so that it
handles the situation where not all nodes register information for a given
node (rather than just spinning until that node sends information, like
we do today).
Includes r15234 and r15265 from the /tmp/bwb-modex branch.
This commit was SVN r15310.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15234
r15265
The problem is that in the case of threaded builds for every fifo
a head and tail lock will be allocated inside the shared memory
segment and the ptr is stored inside the fifo. In the case that the sm backend
file will be mapped in all processes at the same address (mostly the
case for non-thread builds) this is fine, but in the cases when the
processes map the file at different addresses this addresses cause big
trouble in other processes than the one that allocted the locks.
Therefore the send lock addresses have to be recalculated to match
the local mapping of the processes that use them.
This commit was SVN r15291.
* Make orted.1 man page be non-descriptive because it's really an
internal command.
* Re-work the opal_wrapper man page logic a bit so that we can have a
real opal_wrapper.1 installed that says "don't look here -- look at
mpicc (etc.)"
This commit was SVN r15264.
Eddelbuettel, one of the Debian/GNU Linux maintainers of the Open MPI
package.
* Updated the contributed man page with some examples and
updated option descriptions, and other small things.
* Made the --hostname option work again.
* Made the --version option work like it's supposed to.
* Updated help strings that are displayed via --help to be a bit more
descriptive in the parameters that various options accept.
This commit was SVN r15260.
The default odls has been updated and works fine. The process odls has been updated, but I could not verify its operation. The bproc ODLS has not been updated yet. Ralph will look at it soon.
This commit was SVN r15257.
relative bandwidths of each BTL. Precalculate what part of a message should
be send via each BTL in advance instead of doing it during scheduling.
This commit was SVN r15248.
each BTL. Precalculate what part of a message should be send via each BTL in
advance instead of doing it during scheduling.
This commit was SVN r15247.
argument to the query for the line speed. This function is still not
documented, and it really look strange that we have to respecify the
nic_id (it's already attached to the endpoint).
This commit was SVN r15241.
* Fix potential race condition with starting a new lock epoch if we
were releasing a lock
* Increment the shared counter if we start a shared lock session during
the unlock code
This commit was SVN r15186.
a WIN_FREE
* Fix race condition in threaded builds with pending unlocks and
finishing an epoch
* Fix memory leak due to use of OBJ_DESTRUCT instead of OBJ_RELEASE
* Fix race condition between releasing multiple shared locks and
starting a new lock
* Need to incremement the shared count if starting a new shared
lock once an exclusive lock finishes
This commit was SVN r15185.
OBJ_NEW
* Need to single when the passive unlock has left an expose epoch for
the win_free case
* Clean up some debugging output
* fix missing variable initialization
This commit was SVN r15167.
flex (which, incidentally, emit ''more'' warnings than earlier
versions). Grumble.
This commit was SVN r15166.
The following SVN revision numbers were found above:
r15158 --> open-mpi/ompi@57d09c10f7
- adding linear algorithm with synchronization for gather.
This algorithm prevents congestion at root process, but introduces
synchronization (serializes non-root processes, but allows messages
to arrive from two processes at the same time).
It performed better than binomial and linear algorithms for large message,
and intermediate and large communicator sizes.
- Updating MPI_Gather decision function to reflect performance results
from MX. I will perform more measurements though - so this one can
change.
This commit was SVN r15165.
of Fortran datatypes (mpif-common.h) and the list of registered
datatypes: MOOG(REAL2).
Configure and Compilation with ia32/gcc just finished, naturally
without real2.
This commit was SVN r15137.
to make checks for MPI-implementations fail in the right way ,-]
- check in configure.ac
- BINARY INCOMPATIBLE change to mpif-common.h
(if implemented the *right* way)
Actually OMPI_F90_CHECK takes two arguments, not three.
- Only have corresponding C-Type, if the opt. Fortran
type is really supported,
Otherwise pass ompi_mpi_unavailable to DECLARE_MPI_SYNONYM_DDT;
- Reviewed by George and Jeff
This commit was SVN r15133.
interface:
- Fix the handling of MPI_BOTTOM in various places
Update of r15030
- While being at it, handle MPI_IN_PLACE in the same way.
Convert OMPI_ADDR -> OMPI_F2C_BOTTOM
Convert OMPI_IN_PLACE -> OMPI_F2C_IN_PLACE
and have them converted before the actual call.
- Approved by George and tested with icc and simple f77 mpi-program and
with program by Daniel.
This commit was SVN r15129.
The following SVN revision numbers were found above:
r15030 --> open-mpi/ompi@15f9e58c68
reason it's that we don't have the nice configure stuff, so detecting
when to enable the CR PML it's kind of hard. Keep it defined and at
least it compile smoothly.
This commit was SVN r15116.
branch:
* Support btl_openib_if_include and btl_openib_if_exclude MCA
parameters, similar to those supported by other BTLs. Each take a
comma-delimited lists of identifiers. Identifiers can be HCA
interface names (e.g., ipath0, mthca1, etc.) or an HCA interface
name and port numbers (e.g., ipath0:1, mthca1:2, etc.). It is an
error to specify both _include and _exclude. If you specify a
non-existant (or non-ACTIVE) HCA and/or port, you'll get a warning
unless you disable the warning by setting the MCA parameter
btl_openib_warn_nonexistent_if to 0.
* Start updating to use BEGIN_C_DECLS and END_C_DECLS
* A few other minor fixes that were picked up along the way.
This commit was SVN r15063.
Set bandwidth for all ports of mthca0:
--mca btl_openib_bandwidth_mthca0 1000
Set bandwidth for port 1 of mthca1:
--mca btl_openib_bandwidth_mthca1:1 1000
Set latency for port 2 lid 123 on mthca0:
--mca btl_openib_latency_mthca0:2:123 20
This commit was SVN r15041.
even look at the status code and basically guarantee that the aio
function was never called, so there's really no point in AC_TRY_RUN
over AC_COMPILE_IFELSE...
This commit was SVN r15033.
single threaded builds. In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked). There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test. So we have some work to do on our path towards
thread support.
Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks. So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks. Simple, right? :).
This commit was SVN r15011.
structures in the system. Instead of using memcmp, use the ns function.
This won't cause a problem as long as all three elements of the name are
ints, but if they have different sizes, alignment and padding rules
can cause memcmp() to compare padding space, which rarely holds a sane
value.
This commit was SVN r14998.
id based on the last half of the mapper MAC. This allow us to figure out how
to connect peers. This allow the MX BTL to be used in a cluster of cluster
configuration where each cluster have MX internally as well as on a multi
rail MX system.
This commit was SVN r14932.
Changes paffinity interface to use a cpu mask for available/preferred cpus
rather than the current coarse grained paffinity that lets the OS choose
which processor.
Macros for setting and clearing masks are provided.
Solaris and windows changes have not been made. Solaris subdirectory has some
suggested changes - however the relevant man pages for the Solaris 10 APIs
have some ambiguity regarding order in which one create and sets a processor
set. As we did not have access to a solaris 10 machine we could not test to
see the correct way to do the work under solaris.
This commit was SVN r14887.
symbols in them and environ is defined only in the final application
(probably in crt1.o). Apple provides a function for getting at the
environment, so use that instead if it's available.
This commit was SVN r14857.
have the SRQ interface.
* Instead of setting AC_DEFINEs per MCA component, set per test. THe
answers can never be difference, and this will speed sed just a teeny
bit
This commit was SVN r14856.
-I for ${includedir}/openmpi. Solves many problems, and with just a tad
bit of hackery. Don't know why I didn't just do this earlier.
Refs trac:542
This commit was SVN r14853.
The following Trac tickets were found above:
Ticket 542 --> https://svn.open-mpi.org/trac/ompi/ticket/542
that allows to send any range of a request by send/recv instaed of RDMA
and use it to send data from the end of a request in pipeline protocol.
This commit was SVN r14841.
Do not initialize them, if not.
If initializing them, check for the correct C-equivalent type
to copy from...
Issue a warning, when a type (e.g. REAL*16) is not available to
build the type (here COMPLEX*32).
This fixes issues with ompi and pacx.
Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32.
This commit was SVN r14818.
The following SVN revision numbers were found above:
r8 --> open-mpi/ompi@e952ab1f88
compile failed because of the wrong variable name.
This commit was SVN r14807.
The following SVN revision numbers were found above:
r14806 --> open-mpi/ompi@7e57bbb0ef
This commit moves the initalization/finalization of opal_event and opal_progress
to opal_init/finalize. These were previously init/final in ORTE which is an
abstraction violation. After talking about it we concluded that there are no
ordering issues that require these to be init/final in ORTE instead of OPAL.
I ran the IBM test suite against this commit and it didn't turn up any new
failures so I think it is good to go.
Let us know if this causes problems.
This commit was SVN r14773.
environment variables corresopnding to framework MCA parameters so
that the opal MCA base loads '''all''' components (not just the ones
specified in the environment variables). This has the side-effect of
not showing the user's value when displaying the framework MCA
parameters via --param output. For example:
{{{
shell% setenv OMPI_MCA_btl foo
shell% ompi_info --param btl base
}}}
The above sequence would show a "<none>" value for the "btl" parameter
instead of "foo".
This commit restores the environment after we munge it to make the
loader load all components. Hence, the above command sequence will
show "foo" for the "btl" parameter value, not "<none>".
This commit was SVN r14771.
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.
This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side.
Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.
This commit was SVN r14768.