This would be a really, really weird case if it ever happens (i.e.,
you have usnics but the agent process failed somewhere in MPI_INIT
such that the agent never appears), but having an infinite loop
doesn't seem like a good idea.
(does not need to go to v1.8 because v1.8 still uses RML for
communication for the connectivity checker)
This commit was SVN r31932.
if eager rdma is used, endpoint reference_count is greater than one.
this commit is a temporary fix that OBJ_RELEASE the endpoint as much as needed.
thought this is likely correct, it can be suboptimal and hence needs to be reviewed
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31922.
Now that the infrastructure is calling BTL del_procs() before the BTL
finalize(), the usnic BTL had to re-order some of its teardown
sequence to avoid assert() failing.
This is part of a larger conversation involving #4669. Since
MPI_FINALIZE and MPI_COMM_DISCONNECT currently use an
oob/grpcomm-based barrier, the usnic BTL can ''absolutely know'' that
these endpoints and procs will no longer be used. If the ORTE DPM
goes back to a PML-based barrier, the usnic BTL will need to grow more
complex teardown semantics (a la TCP socket FIN/ACK/FIN_WAIT states).
Refs trac:4669
This commit was SVN r31871.
The following Trac tickets were found above:
Ticket 4669 --> https://svn.open-mpi.org/trac/ompi/ticket/4669
Make sure to cleanup the registation cache when removing an
endpoint. This leak only affects systems with XPMEM installed.
Since this is in code specific to XPMEM not sure who could review so
sending it directly to the RM.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31821.
vagrind correctly indicated that the mpool is needed when tearing down
the free lists. Reordered the teardown code to correct this.
My code so no review required.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31780.
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
This commit fixes three leaks:
- bml/r2: fix leak of del_procs in mca_bml_r2_del_procs
- Release the modex data in btl/scif, btl/ugni, and btl/vader
- ompi_mpi_finalize: close the allocator framework
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31778.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
a brand new dummy connection has to be used to properly
trigger the main thread termination and avoid timeout in
the main thread
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31772.
scif_close is not causing scif_poll in the listening thread to return
as expected. To ensure the thread exits attempt to make a local
connection to wake up the thread before calling pthread_join.
cmr=v1.8.2:reviewer=ggouaillardet
This commit was SVN r31756.
The smsg_mboxes free list was not getting destructed. The construct
has been moved to module initialization and a matching destruct is now
in the module destruct.
This commit was SVN r31746.
There is no reason to cancel the listening thread. It should die
automatically when the file descriptor is closed. It is sufficient
to just wait for the thread to exit with pthread join.
cmr=v1.8.2:ticket=trac:4616:reviewer=jsquyres
This commit was SVN r31738.
The following Trac tickets were found above:
Ticket 4616 --> https://svn.open-mpi.org/trac/ompi/ticket/4616
In preparation for moving the BTLs down to OPAL, discontinue the use
of the RML for connectivity client/agent communication. Instead, use
local unix domain sockets in the job session directory (all
communication is between processes on the same server, so unix domain
sockets are fine).
This commit was SVN r31710.
Add the component use_udp value into the modex. If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used). This prevents accidental customer
misconfigurations.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31689.
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes. Also change mtu from
int to uint16_t; there was no need for it to be that large.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31688.
Fix mismatch between the MCA param (which expresses the timeout in
*mili*seconds) and the struct timeval timeout (which expresses the
timeout in *micro*seconds).
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31687.
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
In abusive MPI communication patterns, sending a UDP ping only once a
second may not be sufficient -- all the UDP pings may be dropped. So
increase the frequency of the pings to every quarter second, and allow
more total pings to be sent.
Total timeout time is still the same (10 seconds) -- we'll just now
try 40 times (i.e., once every quarter second) as opposed to 10 times
(i.e., once a second). Testing has shown that this frequency allows
the connectivity checker to always succeed even in the many-to-one
abusive communication patterns.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31602.
We're passing a char foo[x] into PACK_BYTES, so we don't need to take
its address in the macro. This is parallel to the UNPACK_BYTES macro
(where we pass a char bar[x] into it, and don't take its address in
the macro).
The value we're packing is only used to output in a show_help message,
which is why this wasn't noticed before (i.e., it's not used in
network or addressing that would have caused a failure).
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31594.
Use the new opal dstore API (vs. the old RTE DB API).
(dstore is not going to the v1.8 series, so there's no need to CMR
this to v1.8)
This commit was SVN r31580.
This commit will improve the message rate when using the sendi function
by not waiting for the send to get to the remote process.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31526.
feature
This commit should fix a hang seen when running some of the one-sided
tests. The downside of this fix is it reduces the maximum size of the
messages that use the fast boxes. I will fix this in a later commit.
To improve performance under a heavy load I introduced sequencing to
ensure messages are given to the pml in order. I have seen little-no
impact on the message rate or latency with this change and there is a
clear improvement to the heavy message rate case.
Lets let this sit in the trunk for a couple of days to ensure that
everything is working correctly.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31522.
Two things to note:
- This change will allow us to expand the BTL interface without
having to worry about modifying BTLs that will not support the new
interfaces. More on this will come later this year as part of the
1.9 series.
- C99 guarantees that uninitialed members of structs declared outside
of functions (DATA binary section) will be initialized with
0's. This allows us to drop stuff like .btl_flags = 0, or .btl_get
= NULL.
This commit was SVN r31388.
This commit fixes two nasty races:
- One can occur if the connection request message and connection completion
message arrive out of order. This can happen normally when adaptive routing
is used and also in a timeout situation where a UD message is lost.
- One occurs when handling an ack at the same time as we are handling the
message timeout. In this case we can not free the message or the timeout
will be operating on invalid data. This fix is a band-aid until I can come
up with a better approach. Instead of freeing the message it is marked
as inactive and the event callback is triggered immediately (this has no
affect if the callback is already active). The callback then frees the
message if it is inactive.
cmr=v1.8.1:reviewer=pasha
This commit was SVN r31305.
directory not the job's
This bug didn't affect the correctness of the vader results just the
cleanup. This commit removes an error message about removing a non-existent
file.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31265.
If ompi_modex_recv() fails with OPAL_ERR_DATA_VALUE_NOT_FOUND, it
simply means that the peer process did not put any usnic BTL modex
info -- it is not an error. So have the usnic BTL simply ignore that
peer (vs. disqualifying itself / treating this like a real error).
Refs trac:4442.
This commit was SVN r31258.
The following Trac tickets were found above:
Ticket 4442 --> https://svn.open-mpi.org/trac/ompi/ticket/4442
This commit should finish the work started for #869. Closing that ticket
with this commit.
Closes trac:869
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31257.
The following Trac tickets were found above:
Ticket 869 --> https://svn.open-mpi.org/trac/ompi/ticket/869
Fix a one line bug when dealing with non-contiguous sends in prepare_src. Bug was
identified by the intel test suite.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31232.