This commit fixes leaks of bml endpoints in finalize. A summary of the
bugs/fixes is below.
1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
never released the reference to them (ompi_proc_all called
OBJ_RETAIN on all the procs returned). When calling del_procs at
finalize it should suffice to call ompi_proc_world which does not
increment the reference count.
2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
references to the procs from calling the pml_add_comm
function. The fix is to reorder the calls to do omp_comm_finalize,
del_procs, pml_finalize instead of del_procs, pml_finalize,
ompi_comm_finalize.
3) The check in del_procs in r2 checked for a reference count of
1. This is incorrect. At this point there should be 2 references:
1 from ompi_proc, and another from the add_procs. The fix is to
change this check to look for a reference count of 22. This check
makes me extremely uncomforable as nothing will call del_procs if
the reference count of a procs is not 2 when del_procs is
called. Maybe there should be an assert since this is a developer
error IMHO.
cmr=v1.8.2:reviewer=bosilca
This commit was SVN r31782.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
netpatterns_setup_narray_knomial_tree.
Fix a bug in ptpcoll that caused memory allocated by
netpatterns_setup_narray_knomial_tree to leak.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31781.
vagrind correctly indicated that the mpool is needed when tearing down
the free lists. Reordered the teardown code to correct this.
My code so no review required.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31780.
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
This commit fixes three leaks:
- bml/r2: fix leak of del_procs in mca_bml_r2_del_procs
- Release the modex data in btl/scif, btl/ugni, and btl/vader
- ompi_mpi_finalize: close the allocator framework
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31778.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
a brand new dummy connection has to be used to properly
trigger the main thread termination and avoid timeout in
the main thread
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31772.
The local buffer object is only used by the root of the group,
so move down the buffer declaration and initialization.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31771.
An OBJ_DESTRUCT was missing for mca_pml_ob1.send_ranges causing a
memory leak. Identified by valgrind.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31768.
This commit fixes two leaks:
- We never destructed the attributes on MPI_COMM_WORLD. All other
communicators that have attributes are released through
ompi_comm_free which does the attribute destruction. For
MPI_COMM_WORLD this is now done before the destructor is called.
- Add missing OBJ_RELEASE for ompi_comm_f_to_c_table.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31766.
scif_close is not causing scif_poll in the listening thread to return
as expected. To ensure the thread exits attempt to make a local
connection to wake up the thread before calling pthread_join.
cmr=v1.8.2:reviewer=ggouaillardet
This commit was SVN r31756.
closed
We were leaving the selected component open. This commit should
eliminate a leak detected by valgrind.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31749.
The smsg_mboxes free list was not getting destructed. The construct
has been moved to module initialization and a matching destruct is now
in the module destruct.
This commit was SVN r31746.
This commit fixes memory leaks discovered in the sbgp setup code. We
were leaking an opal_argv as well as some list items. I took the
opportunity to clean up the code a little which includes making use of
the opal_argv_free function.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31745.
The items in the available bcol list were getting leaked. This commit
fixes this leak. I also cleaned up the code a bit. This includes
making use of the opal_argv_free function.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31744.
It is essential to call mca_base_framework_close for every framework
that is opened. coll/ml was not doing this so neither bcol nor sbgp
were getting cleaned up. This commit fixes this omission.
Also fixed a leak caused by calling OBJ_DESTRUCT for something created
with OBJ_NEW. With these changes coll/ml appears to be valgrind clean.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31743.
Simple bug. The dist_graph pointer must be a constructed object. The
change from malloc to OBJ_NEW was missing from r31716. Tested with MTT
and everything looks ok now.
This commit was SVN r31739.
The following SVN revision numbers were found above:
r31716 --> open-mpi/ompi@e3df77548d
There is no reason to cancel the listening thread. It should die
automatically when the file descriptor is closed. It is sufficient
to just wait for the thread to exit with pthread join.
cmr=v1.8.2:ticket=trac:4616:reviewer=jsquyres
This commit was SVN r31738.
The following Trac tickets were found above:
Ticket 4616 --> https://svn.open-mpi.org/trac/ompi/ticket/4616
We were leaking file descriptors when coll/ml was in use. It turn out
this was because basesmuma was failing to unmap files it had previously
mapped. This commit cleans up the setup code to ensure that we only
attempt to map the control files once per module and then ensures the
files are unmapped when the module is released.
cmr=v1.8.2:reviewer=manjugv
This commit was SVN r31737.
algorithm
Per suggestion from Manju make sure there isn't a gap in the size ranges
for the available algorithms.
cmr=v1.8.2:ticket=trac:4437:reviewer=ompi-rm1.8
This commit was SVN r31728.
The following Trac tickets were found above:
Ticket 4437 --> https://svn.open-mpi.org/trac/ompi/ticket/4437
MPI_Cart_Create/MPI_Graph_create/MPI_Dist_Graph
Fixes trac:4581
This commit was SVN r31716.
The following Trac tickets were found above:
Ticket 4581 --> https://svn.open-mpi.org/trac/ompi/ticket/4581
In preparation for moving the BTLs down to OPAL, discontinue the use
of the RML for connectivity client/agent communication. Instead, use
local unix domain sockets in the job session directory (all
communication is between processes on the same server, so unix domain
sockets are fine).
This commit was SVN r31710.
Add the component use_udp value into the modex. If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used). This prevents accidental customer
misconfigurations.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31689.
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes. Also change mtu from
int to uint16_t; there was no need for it to be that large.
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31688.
Fix mismatch between the MCA param (which expresses the timeout in
*mili*seconds) and the struct timeval timeout (which expresses the
timeout in *micro*seconds).
Reviewed by Dave Goodell
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r31687.
So track that the rte has reached that point, and only emit the new message if it is accurate.
Note that we still generate a TON of output for a minor error:
Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[50239,1],2]) is on host: Ralphs-iMac
Process 2 ([[50239,1],2]) is on host: Ralphs-iMac
BTLs attempted: sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[50239,1],2]
Exit code: 1
--------------------------------------------------------------------------
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
Ralphs-iMac:examples rhc$
Hopefully, we can agree on a way to reduce this verbage!
This commit was SVN r31686.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
In abusive MPI communication patterns, sending a UDP ping only once a
second may not be sufficient -- all the UDP pings may be dropped. So
increase the frequency of the pings to every quarter second, and allow
more total pings to be sent.
Total timeout time is still the same (10 seconds) -- we'll just now
try 40 times (i.e., once every quarter second) as opposed to 10 times
(i.e., once a second). Testing has shown that this frequency allows
the connectivity checker to always succeed even in the many-to-one
abusive communication patterns.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31602.
We're passing a char foo[x] into PACK_BYTES, so we don't need to take
its address in the macro. This is parallel to the UNPACK_BYTES macro
(where we pass a char bar[x] into it, and don't take its address in
the macro).
The value we're packing is only used to output in a show_help message,
which is why this wasn't noticed before (i.e., it's not used in
network or addressing that would have caused a failure).
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r31594.