1
1

7437 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
750c6c7861 Update the UTK copyright on the topology related files.
This commit was SVN r31805.
2014-05-16 22:23:52 +00:00
Gilles Gouaillardet
96ae38823d Fix a memory leak in ompi_osc_base_finalize()
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31788.
2014-05-16 05:25:24 +00:00
Gilles Gouaillardet
fc96b0a7b8 Fix a typo in mca_bml_r2_del_procs()
Use bml_endpoint->btl_eager instead of bml_endpoint->btl_send.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31786.
2014-05-16 04:43:18 +00:00
Nathan Hjelm
e97e4cf924 Add missing include.
cmr=v1.8.2:ticket=trac:4639

This commit was SVN r31784.

The following Trac tickets were found above:
  Ticket 4639 --> https://svn.open-mpi.org/trac/ompi/ticket/4639
2014-05-15 19:52:06 +00:00
Nathan Hjelm
572507f451 btl/vader: fix del_procs bug exposed by the fact we now actually call
del_procs

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31783.
2014-05-15 19:26:49 +00:00
Nathan Hjelm
faf008f527 Fix bugs that were causing leaks in finalize.
This commit fixes leaks of bml endpoints in finalize. A summary of the
bugs/fixes is below.

 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
    never released the reference to them (ompi_proc_all called
    OBJ_RETAIN on all the procs returned). When calling del_procs at
    finalize it should suffice to call ompi_proc_world which does not
    increment the reference count.

 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
    references to the procs from calling the pml_add_comm
    function. The fix is to reorder the calls to do omp_comm_finalize,
    del_procs, pml_finalize instead of del_procs, pml_finalize,
    ompi_comm_finalize.

 3) The check in del_procs in r2 checked for a reference count of
    1. This is incorrect. At this point there should be 2 references:
    1 from ompi_proc, and another from the add_procs. The fix is to
    change this check to look for a reference count of 22. This check
    makes me extremely uncomforable as nothing will call del_procs if
    the reference count of a procs is not 2 when del_procs is
    called. Maybe there should be an assert since this is a developer
    error IMHO.

cmr=v1.8.2:reviewer=bosilca

This commit was SVN r31782.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2014-05-15 18:28:03 +00:00
Nathan Hjelm
55f0dcb81a Add netpatterns_cleanup_narray_knomial_tree function to cleanup after
netpatterns_setup_narray_knomial_tree.

Fix a bug in ptpcoll that caused memory allocated by
netpatterns_setup_narray_knomial_tree to leak.

cmr=v1.8.2:reviewer=manjugv

This commit was SVN r31781.
2014-05-15 17:36:26 +00:00
Nathan Hjelm
d3dc2c9b0b btl/ugni: reorder teardown to avoid using freed memory
vagrind correctly indicated that the mpool is needed when tearing down
the free lists. Reordered the teardown code to correct this.

My code so no review required.

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31780.
2014-05-15 16:58:28 +00:00
Nathan Hjelm
73bfecd650 More leak fixes.
Two leaks are fixed in this commit:

 - Do not leak btl component list items.

 - Do not leak the nodename when decoding the pidmap.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31779.
2014-05-15 16:38:13 +00:00
Nathan Hjelm
e4db2c3ebb ompi: fix various small leaks
This commit fixes three leaks:

 - bml/r2: fix leak of del_procs in mca_bml_r2_del_procs

 - Release the modex data in btl/scif, btl/ugni, and btl/vader

 - ompi_mpi_finalize: close the allocator framework

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31778.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2014-05-15 15:59:51 +00:00
Gilles Gouaillardet
e836abfc51 btl/scif: fix hang at shutdown
a brand new dummy connection has to be used to properly
trigger the main thread termination and avoid timeout in
the main thread

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r31772.
2014-05-15 09:55:50 +00:00
Gilles Gouaillardet
3e19041425 Fix memory leak in mca_common_sm_rml_info_bcast(...)
The local buffer object is only used by the root of the group,
so move down the buffer declaration and initialization.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31771.
2014-05-15 08:41:14 +00:00
Nathan Hjelm
4113cfa03a pml/ob1: add missing OBJ_DESTRUCT
An OBJ_DESTRUCT was missing for mca_pml_ob1.send_ranges causing a
memory leak. Identified by valgrind.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31768.
2014-05-14 21:15:45 +00:00
Nathan Hjelm
32a85c6d7d allocator/bucket: free all memory associated with a bucket allocator
when it is finalized.

Fixes a leak identified by valgrind.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31767.
2014-05-14 21:15:39 +00:00
Nathan Hjelm
279c0a3ca7 comm: fix communicator subsystem leaks
This commit fixes two leaks:

 - We never destructed the attributes on MPI_COMM_WORLD. All other
   communicators that have attributes are released through
   ompi_comm_free which does the attribute destruction. For
   MPI_COMM_WORLD this is now done before the destructor is called.

 - Add missing OBJ_RELEASE for ompi_comm_f_to_c_table.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31766.
2014-05-14 21:15:33 +00:00
Ralph Castain
a202139656 Looks like someone missed a '|' in this comparison, so correct it here
This commit was SVN r31760.
2014-05-14 16:53:22 +00:00
Nathan Hjelm
82934b173c btl/scif: make the exiting member volatile
cmr=v1.8.2:ticket=trac:4626

This commit was SVN r31757.

The following Trac tickets were found above:
  Ticket 4626 --> https://svn.open-mpi.org/trac/ompi/ticket/4626
2014-05-14 16:22:24 +00:00
Nathan Hjelm
97605a4002 btl/scif: fix hang at shutdown
scif_close is not causing scif_poll in the listening thread to return
as expected. To ensure the thread exits attempt to make a local
connection to wake up the thread before calling pthread_join.

cmr=v1.8.2:reviewer=ggouaillardet

This commit was SVN r31756.
2014-05-14 16:14:00 +00:00
George Bosilca
f27123a20d Fix the add_proc issue identified by Jeff: the TCP BTL now discard a
peer proc without TCP support instead of completely dropping TCP support for the entire job.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31753.
2014-05-14 13:47:57 +00:00
Nathan Hjelm
518f188ad4 bml/base: ensure all components are closed when the framework is
closed

We were leaving the selected component open. This commit should
eliminate a leak detected by valgrind.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31749.
2014-05-13 23:04:40 +00:00
Nathan Hjelm
c15c3dee4c common/ugni: fix bugs in r31557
This commit was SVN r31748.

The following SVN revision numbers were found above:
  r31557 --> open-mpi/ompi@c4c9bc1573
2014-05-13 22:37:01 +00:00
Nathan Hjelm
dd8de4d6eb btl/ugni: fix memory leaks and silence some warnings
The smsg_mboxes free list was not getting destructed. The construct
has been moved to module initialization and a matching destruct is now
in the module destruct.

This commit was SVN r31746.
2014-05-13 21:22:33 +00:00
Nathan Hjelm
9c45e4152d sbgp/base: fix memory leaks
This commit fixes memory leaks discovered in the sbgp setup code. We
were leaking an opal_argv as well as some list items. I took the
opportunity to clean up the code a little which includes making use of
the opal_argv_free function.

cmr=v1.8.2:reviewer=manjugv

This commit was SVN r31745.
2014-05-13 21:22:25 +00:00
Nathan Hjelm
ddd501c0d9 bcol/base: cleanup code and fix memory leak
The items in the available bcol list were getting leaked. This commit
fixes this leak. I also cleaned up the code a bit. This includes
making use of the opal_argv_free function.

cmr=v1.8.2:reviewer=manjugv

This commit was SVN r31744.
2014-05-13 21:22:18 +00:00
Nathan Hjelm
c32d84154a coll/ml: fix leaks and close all the framework opened
It is essential to call mca_base_framework_close for every framework
that is opened. coll/ml was not doing this so neither bcol nor sbgp
were getting cleaned up. This commit fixes this omission.

Also fixed a leak caused by calling OBJ_DESTRUCT for something created
with OBJ_NEW. With these changes coll/ml appears to be valgrind clean.

cmr=v1.8.2:reviewer=manjugv

This commit was SVN r31743.
2014-05-13 21:22:12 +00:00
Nathan Hjelm
fc4f932cc2 Fix bug in r31716
Simple bug. The dist_graph pointer must be a constructed object. The
change from malloc to OBJ_NEW was missing from r31716. Tested with MTT
and everything looks ok now.

This commit was SVN r31739.

The following SVN revision numbers were found above:
  r31716 --> open-mpi/ompi@e3df77548d
2014-05-13 17:39:43 +00:00
Nathan Hjelm
a78519a2b2 btl/scif: remove call to pthread_cancel
There is no reason to cancel the listening thread. It should die
automatically when the file descriptor is closed. It is sufficient
to just wait for the thread to exit with pthread join.

cmr=v1.8.2:ticket=trac:4616:reviewer=jsquyres

This commit was SVN r31738.

The following Trac tickets were found above:
  Ticket 4616 --> https://svn.open-mpi.org/trac/ompi/ticket/4616
2014-05-13 17:29:53 +00:00
Nathan Hjelm
c13c21d476 basesmuma: clean up the setup code and ensure mapped files are unmapped
We were leaking file descriptors when coll/ml was in use. It turn out
this was because basesmuma was failing to unmap files it had previously
mapped. This commit cleans up the setup code to ensure that we only
attempt to map the control files once per module and then ensures the
files are unmapped when the module is released.

cmr=v1.8.2:reviewer=manjugv

This commit was SVN r31737.
2014-05-13 17:00:31 +00:00
Nathan Hjelm
9e3a0d7b7a basesmuma: modify the minimum size for the large fan-in fan-out allreduce
algorithm

Per suggestion from Manju make sure there isn't a gap in the size ranges
for the available algorithms.

cmr=v1.8.2:ticket=trac:4437:reviewer=ompi-rm1.8

This commit was SVN r31728.

The following Trac tickets were found above:
  Ticket 4437 --> https://svn.open-mpi.org/trac/ompi/ticket/4437
2014-05-13 14:56:21 +00:00
Gilles Gouaillardet
209378efec btl/scif: prevent SIGSEGV from occuring when the module is unloaded
Fixes trac:4615

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r31717.

The following Trac tickets were found above:
  Ticket 4615 --> https://svn.open-mpi.org/trac/ompi/ticket/4615
2014-05-13 10:04:38 +00:00
Gilles Gouaillardet
e3df77548d Fix memory leak when releasing a communicator created by
MPI_Cart_Create/MPI_Graph_create/MPI_Dist_Graph

Fixes trac:4581

This commit was SVN r31716.

The following Trac tickets were found above:
  Ticket 4581 --> https://svn.open-mpi.org/trac/ompi/ticket/4581
2014-05-13 04:49:23 +00:00
Nathan Hjelm
0b8bb2339b btl/scif: update the size when preparing send fragments
Thanks to Gilles Gouaillardet for catching this.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31715.
2014-05-12 22:05:31 +00:00
Jeff Squyres
e37c7af0fb usnic: update cclient/cagent to use unix domain sockets (not RML)
In preparation for moving the BTLs down to OPAL, discontinue the use
of the RML for connectivity client/agent communication.  Instead, use
local unix domain sockets in the job session directory (all
communication is between processes on the same server, so unix domain
sockets are fine).

This commit was SVN r31710.
2014-05-09 20:35:36 +00:00
Jeff Squyres
184e4fc0ca usnic: ensure that procs agree on use_udp value
Add the component use_udp value into the modex.  If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used).  This prevents accidental customer
misconfigurations.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31689.
2014-05-08 16:43:50 +00:00
Jeff Squyres
e9c3df652e usnic: reduce sizeof(ompi_btl_usnic_addr_t) to 56 bytes
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes.  Also change mtu from
int to uint16_t; there was no need for it to be that large.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31688.
2014-05-08 16:38:59 +00:00
Jeff Squyres
a61e4d6425 usnic: fix connectivity checker timeout
Fix mismatch between the MCA param (which expresses the timeout in
*mili*seconds) and the struct timeval timeout (which expresses the
timeout in *micro*seconds).

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31687.
2014-05-08 16:36:07 +00:00
Ralph Castain
ab4f8585b0 When we abort during MPI_Init, we currently emit a totally incorrect error message stating that we were unable to aggregate error messages and cannot guarantee all other processes were killed. This simply isn't true IF the rte has been initialized.
So track that the rte has reached that point, and only emit the new message if it is accurate.

Note that we still generate a TON of output for a minor error:

Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[50239,1],2]) is on host: Ralphs-iMac
  Process 2 ([[50239,1],2]) is on host: Ralphs-iMac
  BTLs attempted: sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50239,1],2]
  Exit code:    1
--------------------------------------------------------------------------
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
Ralphs-iMac:examples rhc$ 

Hopefully, we can agree on a way to reduce this verbage!

This commit was SVN r31686.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2014-05-08 15:48:16 +00:00
Ralph Castain
76f5991ab2 Couple of minor fixes
This commit was SVN r31680.
2014-05-08 02:26:45 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6 The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
top_ompi_srcdir  ->  OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR

We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.

Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.

This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
c5d64a22df Fix romio configure to look for update OMPI support file name
This commit was SVN r31670.
2014-05-07 03:19:45 +00:00
Ralph Castain
fdfb331e13 Per RFC, continue the renaming process
This commit was SVN r31663.
2014-05-06 20:53:55 +00:00
Ralph Castain
2b7a3ae601 Per RFC, continue pecking away at the build system renaming
OMPI_CONFIG_SUBDIR  -> OPAL_CONFIG_SUBDIR
   OMPI_CONFIG_SUBDIR_ARGS  ->  OPAL_CONFIG_SUBDIR_ARGS

This commit was SVN r31647.
2014-05-06 16:27:38 +00:00
Devendar Bureddy
dfaac7d29d Do not call into hcoll progress after MPI_Finalize
Reviewed by Mike
cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31639.
2014-05-05 22:46:39 +00:00
Ralph Castain
29609577d5 Per RFC:
ompi_show_title  -> opal_show_title
    ompi_show_subtitle -> opal_show_subtitle

This commit was SVN r31638.
2014-05-05 22:35:23 +00:00
Ralph Castain
4def94900a Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Jeff Squyres
10e8ab493e btl_usnic_mca.c: Increase default connectivity checker frequency
In abusive MPI communication patterns, sending a UDP ping only once a
second may not be sufficient -- all the UDP pings may be dropped.  So
increase the frequency of the pings to every quarter second, and allow
more total pings to be sent.

Total timeout time is still the same (10 seconds) -- we'll just now
try 40 times (i.e., once every quarter second) as opposed to 10 times
(i.e., once a second).  Testing has shown that this frequency allows
the connectivity checker to always succeed even in the many-to-one
abusive communication patterns.

cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r31602.
2014-05-02 11:06:18 +00:00
Jeff Squyres
bf82ee2a14 btl_usnic_connectivity.h: fix PACK_BYTES macro
We're passing a char foo[x] into PACK_BYTES, so we don't need to take
its address in the macro.  This is parallel to the UNPACK_BYTES macro
(where we pass a char bar[x] into it, and don't take its address in
the macro).

The value we're packing is only used to output in a show_help message,
which is why this wasn't noticed before (i.e., it's not used in
network or addressing that would have caused a failure).

cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r31594.
2014-05-01 22:23:22 +00:00
Yossi Etigin
6aa5680059 Revert r30966.
cmr=v1.8.1:reviewer=ompi-gk1.8

This commit was SVN r31593.

The following SVN revision numbers were found above:
  r30966 --> open-mpi/ompi@280e96c99a
2014-05-01 22:17:09 +00:00
Ralph Castain
0c74d1fd6f Silence warning
This commit was SVN r31592.
2014-05-01 21:11:39 +00:00