The smsg_mboxes free list was not getting destructed. The construct
has been moved to module initialization and a matching destruct is now
in the module destruct.
This commit was SVN r31746.
There is no reason to cancel the listening thread. It should die
automatically when the file descriptor is closed. It is sufficient
to just wait for the thread to exit with pthread join.
cmr=v1.8.2:ticket=trac:4616:reviewer=jsquyres
This commit was SVN r31738.
The following Trac tickets were found above:
Ticket 4616 --> https://svn.open-mpi.org/trac/ompi/ticket/4616
In preparation for moving the BTLs down to OPAL, discontinue the use
of the RML for connectivity client/agent communication. Instead, use
local unix domain sockets in the job session directory (all
communication is between processes on the same server, so unix domain
sockets are fine).
This commit was SVN r31710.
Add the component use_udp value into the modex. If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used). This prevents accidental customer
misconfigurations.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31689.
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes. Also change mtu from
int to uint16_t; there was no need for it to be that large.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31688.
Fix mismatch between the MCA param (which expresses the timeout in
*mili*seconds) and the struct timeval timeout (which expresses the
timeout in *micro*seconds).
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31687.
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
In abusive MPI communication patterns, sending a UDP ping only once a
second may not be sufficient -- all the UDP pings may be dropped. So
increase the frequency of the pings to every quarter second, and allow
more total pings to be sent.
Total timeout time is still the same (10 seconds) -- we'll just now
try 40 times (i.e., once every quarter second) as opposed to 10 times
(i.e., once a second). Testing has shown that this frequency allows
the connectivity checker to always succeed even in the many-to-one
abusive communication patterns.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31602.
We're passing a char foo[x] into PACK_BYTES, so we don't need to take
its address in the macro. This is parallel to the UNPACK_BYTES macro
(where we pass a char bar[x] into it, and don't take its address in
the macro).
The value we're packing is only used to output in a show_help message,
which is why this wasn't noticed before (i.e., it's not used in
network or addressing that would have caused a failure).
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31594.
Use the new opal dstore API (vs. the old RTE DB API).
(dstore is not going to the v1.8 series, so there's no need to CMR
this to v1.8)
This commit was SVN r31580.
This commit will improve the message rate when using the sendi function
by not waiting for the send to get to the remote process.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31526.
feature
This commit should fix a hang seen when running some of the one-sided
tests. The downside of this fix is it reduces the maximum size of the
messages that use the fast boxes. I will fix this in a later commit.
To improve performance under a heavy load I introduced sequencing to
ensure messages are given to the pml in order. I have seen little-no
impact on the message rate or latency with this change and there is a
clear improvement to the heavy message rate case.
Lets let this sit in the trunk for a couple of days to ensure that
everything is working correctly.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31522.
Two things to note:
- This change will allow us to expand the BTL interface without
having to worry about modifying BTLs that will not support the new
interfaces. More on this will come later this year as part of the
1.9 series.
- C99 guarantees that uninitialed members of structs declared outside
of functions (DATA binary section) will be initialized with
0's. This allows us to drop stuff like .btl_flags = 0, or .btl_get
= NULL.
This commit was SVN r31388.
This commit fixes two nasty races:
- One can occur if the connection request message and connection completion
message arrive out of order. This can happen normally when adaptive routing
is used and also in a timeout situation where a UD message is lost.
- One occurs when handling an ack at the same time as we are handling the
message timeout. In this case we can not free the message or the timeout
will be operating on invalid data. This fix is a band-aid until I can come
up with a better approach. Instead of freeing the message it is marked
as inactive and the event callback is triggered immediately (this has no
affect if the callback is already active). The callback then frees the
message if it is inactive.
cmr=v1.8.1:reviewer=pasha
This commit was SVN r31305.
directory not the job's
This bug didn't affect the correctness of the vader results just the
cleanup. This commit removes an error message about removing a non-existent
file.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31265.
If ompi_modex_recv() fails with OPAL_ERR_DATA_VALUE_NOT_FOUND, it
simply means that the peer process did not put any usnic BTL modex
info -- it is not an error. So have the usnic BTL simply ignore that
peer (vs. disqualifying itself / treating this like a real error).
Refs trac:4442.
This commit was SVN r31258.
The following Trac tickets were found above:
Ticket 4442 --> https://svn.open-mpi.org/trac/ompi/ticket/4442
This commit should finish the work started for #869. Closing that ticket
with this commit.
Closes trac:869
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31257.
The following Trac tickets were found above:
Ticket 869 --> https://svn.open-mpi.org/trac/ompi/ticket/869
Fix a one line bug when dealing with non-contiguous sends in prepare_src. Bug was
identified by the intel test suite.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31232.
In most cases, bad messages received by the connectivty checker are
just dropped. However, in one specific code path, a bad packet caused
an abort. Doh!
This commit does two things:
1. Improve verbose messages for all these cases
1. Simply drop incoming messages that cannot be identified as ACKs or PINGs
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31225.
* Ensure that all endpoints[x] values are initialized to NULL
* If ibv_create_ah fails, remove each endpoint from the
module->all_endpoints list so that the endpoint can be destructed
properly.
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31111.
* clang warning stomp
* memory barrier for volatile variable use
These can go to 1.7.5 or can slip to v1.8 -- RM decision.
Submitted by Jeff Squyres, reviewed by Dave Goodell
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30944.
* Older versions of libusnic_verbs actually return 0 when querying for
an unknown port. So also check for a magic ID in the returned data
to *really* know if the usnic extensions are supported.
* Use a union (in the common_verbs area) and memcpy (in the btl) to
avoid undefined C type aliasing behavior.
* Ensure to memset the function table to 0 if the usnic extensions
are not supported.
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30935.
Realistically, the usnic BTL doesn't need to know anything about the
underlying transport except for its header length (so that it knows
where the payload begins in a received buffer). So remove the use of
the specific transport prefix union and just rely on the usnic verbs
extension to tell us what the header length is if we're using the
usNIC/UDP transport, or sizeof(struct ibv_grh) if we're using usNIC/L2
transport.
This commit was SVN r30914.
If they exist, call the usnic verbs extensions to both enable UDP
support and get the UD receiver header length that should be used
(rather than assume 40/struct GRH).
This commit was SVN r30912.
We don't use this functionality any more; we use the transport_type
and device name to identify usnic devices. It's slightly easier
because we can transport_type+name from ibv_device_open() and don't
have to do an additional ibv_query_device() to get its attributes.
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30882.
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style. That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint. If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later. We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30879.
The following SVN revision numbers were found above:
r30850 --> open-mpi/ompi@3641500442
r30852 --> open-mpi/ompi@4e282a3295
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253