compile failed because of the wrong variable name.
This commit was SVN r14807.
The following SVN revision numbers were found above:
r14806 --> open-mpi/ompi@7e57bbb0ef
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.
This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side.
Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.
This commit was SVN r14768.
fix (r14749) and then backed it out (r14753).
As we are unable to send more than a 32 bits length over TCP in one go, there
is no reason to have an uint64 length in the header. This reduce the size
of the TCP header.
This commit was SVN r14755.
The following SVN revision numbers were found above:
r14749 --> open-mpi/ompi@48c026ce6b
r14753 --> open-mpi/ompi@28ed850b4c
r14703 for the point-to-point component.
* Associate the list of long message requests to poll with the
component, not the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list to move long messages asynchronously.
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly further along in the
becoming thread safe thing.
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14751.
The following SVN revision numbers were found above:
r14703 --> open-mpi/ompi@2b4b754925
mca_btl_tcp_hdr_t struct and remove the need for the heterogeneous
padding by changing the type of the "size" member to be uint32_t
(vs. uint64_t). The value would never be greater than 32 bits anyway,
so having the type be uint64_t was wasteful.
This commit was SVN r14749.
* Remove unused declaration
* remove unused variable warning when not using progress threads
* If we're using progress threads, we want to lock, not trylock
when in progress, since it was called from the wakeup thread
and not the progress function
This commit was SVN r14739.
to do while(...) { } then we can't change the variables in the ...
atomically, but should do it while holding the module lock.
* Fix dumb communicator creation error when we don't create the progress
stuff (because a window already exists), where we would accidently
jump to the error case.
This commit was SVN r14715.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
* Combine polling of the long requests and buffer requests into
one type, and in one place
* Associate the list of requests to poll with the component, not
the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list. Not the best, but without some asynch notification
from the PML that a given set of requests has completed, there
isn't much better
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly futher along in the
becoming thread safe thing
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14703.
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.
Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size
btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size
btl_*_min_rdma_pipeline_size is new..
This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.
This commit was SVN r14681.
* Move ipv6comat.h code into opal_config_bottom.h and change into some
more intelligent testing of structures
* Change opal's if interface to use sockaddr instead of sockaddr_storage,
as the RFCs suggest we do
* Move the networking code in opal that isn't directly related to if
detection into net.h
* Add quicky function to get the port out of either a sockaddr_in
or sockaddr_in6, saving a bunch of code in the oob.
* Update TCP oob and btl with new interface
This commit was SVN r14679.
without any context so we need to save a context somewhere so it can be
retrieved given only buffer pointer. This patch saves context (pointer to
frag) just before start of a buffer so it can be be easily retrieved.
This commit was SVN r14664.
The following SVN revision numbers were found above:
r13921 --> open-mpi/ompi@90fb58de4f
* Require Autoconf 2.60 or higher and remove some cruft
required for AC 2.59 or the AC 2.59 / AC 2.60 mix
* Remove a bunch of now unnecessary AC_SUBST calls
* Use the libtool-provided variables for the -I and
library to use when compiling against ltdl
Fixes trac:1000
This commit was SVN r14652.
The following Trac tickets were found above:
Ticket 1000 --> https://svn.open-mpi.org/trac/ompi/ticket/1000
avoid alignmment issues. This commit fixes trac:1009.
This commit was SVN r14608.
The following Trac tickets were found above:
Ticket 1009 --> https://svn.open-mpi.org/trac/ompi/ticket/1009
via the visibility feature that is provided by some compilers.
Per default this feature is disabled, to enable it you need to
configure with --enable-visibility and obviously you need a compiler
with visibility support. Please refer to the wiki for more information.
https://svn.open-mpi.org/trac/ompi/wiki/Visibility
This commit was SVN r14582.
with RDMAing the rest of it. Also more than one RDMA writes can be performed
simultaneously by different threads. To make this code thread safe this patch
clones original request convertor for each RDMA fragment.
This commit was SVN r14574.
- Removing "small" message size limit because it really does not relate to the eager size
accross the board.
Now, the leaf nodes in generalized reduce will use blocking send (DEFAULT/ORIGINAL BEHAVIOR)
either when the maximum number of outstanding requests is 0 or
when the total number of segments is less than the maximum number of outstanding requests.
Otherwise, it will send messages using non-blocking synchronized send operation.
This commit was SVN r14572.
during the IPv6 patch. The most important is the multi BTL support. There
was a quite interesting bug. Instead of setting up the multiple connections
over different physical devices, based on the time when these connections
were created most of the time they were all using the same physical network.
Which, of course, was not the intended goal, as we top at the maximum
bandwidth available over one device instead of gathering all available
bandwidth from all devices.
Second, the IPv6 RFC suggest to use sockaddr_storage as a holder for the
IP information, but use a sockaddr* when we pass it to functions. This is
only partially corrected by this patch.
Some other minor cleanups.
This commit was SVN r14544.
This "feature" is disabled by default and it should not affect the current performance.
In case when the message size is large and segment size is smaller than eager size for particular interface,
the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without
any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB
message with 1KB segments for example). In case of binomial algorithm root node always has at least one
child which is leaf, so this can potentially affect the root's performance significantly [Especially in
large communicators where root may have quite a few children (binomial tree for example)].
When the segment size is bigger than the eager size, rendezvous protocol ensures that this does
not happen so it is not necessary.
Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes
(which may explain some "deadlocks" on Thunderbird tests).
To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM"
this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM.
Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time.
The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes
with 1KB segments over MX, but I need to test it some more to make sure it is consistent.
Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B.
If message/individual segment size is greater than 4000B - we will not use this feature. This variable may
or may not be exposed as mca parameter later...
I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests
up to 80 processes (over MX).
This commit was SVN r14518.
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
of #ifs in the OOOB code.
- Fix a compiler warning in the TCP BTL due to run-time determined
array size by making it a dynamicly allocated array.
- Fix the unpacking code of IPv4 addresses when using IPv6 support, so
that the address is in the correct location (instead of in an IPv6
structure, use an IPv4 structure). Refs trac:1005.
This commit was SVN r14514.
The following Trac tickets were found above:
Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
Make sure that the wrapper selection is compiled out if not enabling FT. Before the
logic would skip over it since the conditional if statements would not be satisfied,
now there are no additional if statements when compiled out.
With this modification the selection logic looks nearly identical to pre-r14051
with the exception of the non-FT related improvements.
This commit was SVN r14491.
The following SVN revision numbers were found above:
r14051 --> open-mpi/ompi@dadca7da88
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
It's a bit longer, but much more clear in it's implementation I believe.
Fundamentally it is the same, but is much more solid in the implementation.
I created quite a few directed tests that this version of the implementation
now passes.
This commit was SVN r14457.
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).
Short version of new features:
* Support for ibv_fork_init()
* Automatically fill in the openib BTL bandwidth value by
querying the HCA port
* Installdirs functionality
* Fixes to always use -I in the Fortran wrapper compilers (#924)
* Gleb's mpool updates
* Remove some kruft in btl/openib/configure.m4, therefore
fixing the harmless warnings noted in #665
* Bunches of updates to the Linux RPM spec file
I.e., effectively the same thing that r14411 brought to the v1.2
branch.
Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2). Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).
This commit was SVN r14449.
The following SVN revision numbers were found above:
r14411 --> open-mpi/ompi@83b31314ae
r14432 --> open-mpi/ompi@a48f160595
r14433 --> open-mpi/ompi@68f346d2bc
r14445 --> open-mpi/ompi@13d366b827
- Move the PML Modex stuff out of the BML -- Abstraction violation.
- Also fix the location of the add_procs with respect to the stage gates.
This commit was SVN r14422.
to move it into the unexpected queue. Instead pack the data in
only one buffer. Now the code look more optimized and clear, but
I have a doubt about who's using this functionality. I think that
all BTLs always return only one memory segment attached to the
matching fragment (i.e. there is no unexpected iov type receive).
This commit was SVN r14416.
function to make it more scalable. The memory fragmentation is still high, but
at least in most of the cases (where all ressources are correctly released
before the cleanup) the code is now highly efficient. Before the code execute
in (N * (N-1))!, which take a while when the number of allocated ressources
increase (which is the case when a lot of unexpected messages are created).
The fix consist of checking if all items are freed and if it's the case
then do not recreate the free items list (as we know that everything will
be released). If this condition is not true, we fall back on the
original execution path (which is still sub-sub-sub ... optimal).
This commit was SVN r14406.
same "restrict" check as the top-level OMPI configure.ac script so
that it will guarantee to always get the same result. Therefore, the
#define for restrict will always have the same value in both
opal_config.h and romioconf.h, and we get 7 less warnings (6 in the IO
ROMIO component, 1 in ROMIO itself) when compiling with icc on Linux
(because PAC_C_RESTRICT and AC_C_RESTRICT would get different values
for the "restrict" #define in this case).
This commit was SVN r14387.
Fix for memory corruption in the restarted process stack. This stemed from
the brute force method we were previously using. This commit fixes this by
using a lighter weight solution focused in the r2 BML instead of above the PML.
This is a more efficient and flexible solution, and it solves the original
problem.
In the process I pulled out the ft_event function in the tcp BTL and r2 BML
into a set of *_ft.[c|h] files just to keep any updates to these code paths
as isolated as possible to make merging easier on everyone.
This commit was SVN r14371.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
The following Trac tickets were found above:
Ticket 977 --> https://svn.open-mpi.org/trac/ompi/ticket/977
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
the real interest is for small to middle size unexpected messages. The unexpected messages are copied
by the PML in it's own unexpected buffers. Therefore, there is no reason to make a first copy in the
TCP BTL. The BTL can handle to the PML it's own buffer, and can be sure that once the callback
completed it can reuse the buffer, no matter what happened with the fragment.
This commit was SVN r14320.