1
1
Граф коммитов

4179 Коммитов

Автор SHA1 Сообщение Дата
Yossi Etigin
64d98e0438 Fix data corruption in MXM by registering to OPAL memory release hooks and removing any mappings created by mxm
This commit was SVN r28489.
2013-05-14 12:27:44 +00:00
Rolf vandeVaart
9d569f1487 Fix warning when compiling in CUDA aware code.
This commit was SVN r28476.
2013-05-10 21:29:08 +00:00
Nathan Hjelm
422331b4da btl/openib: fix unconnected datagram connection method (udcm)
The primary issue with udcm is that the immediate data in message
acks were often bogus. This caused the sender to keep trying even
though a message was received and acked. The fix is to use the
source LID and QP to determine which message is being acked. In
most cases this should work well since only one message will be
in flight to any peer.

This commit was SVN r28444.
2013-05-03 17:11:38 +00:00
Jeff Squyres
c8258c06e2 In coll_sm, we alloc a huge chunk of shared memory, divvy it into lots
of individual regions (each region is a multiple of page size in
length), and each process claims its own regions by binding it to its
local memory.  Each process would end up membining something like 16
individual regions in the overall shmem segment.

There were two errors in this code relating to the memory affinity
pinning.  Some combination of these two errors would lead to kernel
panics (!) on my RHEL 6.2 x86_64 machines when used with mmap'ed
shared memory (not posix or sysv shared memory, curiously enough):

1. The shared memory segment is initially divided into two regions:
control and data.  The control starts at the beginning of the shmem
segment, the data starts after that.  The data portion, unfortunately,
was ''not'' aligned to a page.  So all the multiple-of-page-size
regions that we divvy up were also not alined on page boundaries.  And
therefore all the regions we tried to membind were not on page
boundaries.

The solution was to ensure that the data portion started on a page
boundary.  Then all of the individual regions were on page boundaries,
too.

That being said, in my tests, Linux mbind() fails gracefully when the
address is not on a page boundary.  So I'm not sure how this worked at
all / led to a kernel panic...

2. There was some bad pointer math that resulted in membinding regions
larger than they should have been, resulting in region overlaps.
There were definitely overlaps between regions in the same process;
it's likely that there were overlaps between regions of multiple
processes, too -- I'm not sure (and don't care to figure out :-) ).

The solution was to fix the pointer math so that each region membinds
exactly only itself and no neighboring/overlapping regions.

cmr:v1.7.2:reviewer=samuel

This commit was SVN r28442.
2013-05-03 12:49:35 +00:00
Alex Mikheev
9e2fdc7d56 - correction of r28440
This commit was SVN r28441.

The following SVN revision numbers were found above:
  r28440 --> open-mpi/ompi@93ce233530
2013-05-02 12:52:58 +00:00
Alex Mikheev
93ce233530 - btl_openib: changed default SRQ settings:
- increase number of wqe to minimize number of RNRs
    - it is better to have high watermark and post relatively small number of wqes
    - increased TX queue size

This commit was SVN r28440.
2013-05-02 12:46:35 +00:00
Alex Mikheev
f76680fbd0 - btl_openib: fix total registered memory calculation for ConnectIB and Ofed 2.0
This commit was SVN r28432.
2013-05-01 13:39:29 +00:00
Jeff Squyres
d92a8e01f8 Use the _SAFE list traversal macro so that we can remove each item
from the list (just for good measure), and then free() it (without
using _SAFE, we were accessing memory that was just free()'d to get to
the next item).  Also be a little more thorough -- DESTRUCT the list
when we're all done.

This commit was SVN r28429.
2013-05-01 12:26:16 +00:00
George Bosilca
8b0335380a Fix the error messages to reference the correct function.
This commit was SVN r28425.
2013-04-30 23:26:03 +00:00
George Bosilca
6a75c84fa8 Remove useless define.
This commit was SVN r28424.
2013-04-30 23:24:59 +00:00
Ralph Castain
9de82aba55 Revert r28417 - given the non-standard way vprotocol is implemented, I see no way to use the framework verbosity here. Best to just leave it alone as those who use it know what they need to do to get debug output
This commit was SVN r28418.

The following SVN revision numbers were found above:
  r28417 --> open-mpi/ompi@b00de5be8b
2013-04-30 16:37:17 +00:00
Nathan Hjelm
b00de5be8b vprotocol: remove the old output and use the framework output
This commit was SVN r28417.
2013-04-30 15:21:42 +00:00
Ralph Castain
ceb4061214 Fix BTL_VERBOSE - when the MCA param change was committed, it left the base verbosity variable declared so things compiled. Sadly, the verbosity was now being set to a new variable, so debug never was output.
This commit was SVN r28414.
2013-04-30 01:15:52 +00:00
Nathan Hjelm
f384263de7 btl/openib: fix typo
This commit was SVN r28413.
2013-04-29 22:21:25 +00:00
Ralph Castain
5d7a93c032 Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed.
Interesting how many includes had to be fixed here and there to fill in missing dependencies :-)

This commit was SVN r28411.
2013-04-29 17:02:37 +00:00
Ralph Castain
8996ecb128 Add missing include
This commit was SVN r28405.
2013-04-27 00:09:36 +00:00
Jeff Squyres
f55cea1a5b If there are no BTLs, do ''not'' actually shut down the fd listener,
because a) it may still be needed to shut down the CPCs, and b) it
will be shut down during component_close().

This commit was SVN r28402.
2013-04-26 15:31:50 +00:00
Jeff Squyres
99b7a0f20d Remove unused variables.
This commit was SVN r28401.
2013-04-26 15:29:42 +00:00
Vishwanath Venkatesan
c902624b59 Using ompi_type_destroy to free ompi_datatype. This had to be updated in all the collective algorithms.
Hopefully this will fix all warnings.

This commit was SVN r28385.
2013-04-24 19:27:26 +00:00
Nathan Hjelm
2edff7f784 btl/openib: don't free string handle by MCA variable system
This commit was SVN r28383.
2013-04-24 18:59:18 +00:00
Alex Margolin
aebd794bf6 Fixed macro definition order in MXM component headers
This commit was SVN r28378.
2013-04-24 16:51:43 +00:00
Vishwanath Venkatesan
bba4a93f63 Got this wrong while replacing MPI function with OMPI functions. Fixed it now.
This commit was SVN r28350.
2013-04-22 19:58:25 +00:00
Rolf vandeVaart
5e1dde419c Fix some compile errors in CUDA-aware code that has crept in.
This commit was SVN r28346.
2013-04-18 15:34:16 +00:00
Vishwanath Venkatesan
53753622d4 Changing some of the MPI_ functions to ompi_ equivalents.
This commit was SVN r28342.
2013-04-17 21:06:36 +00:00
Alex Margolin
0ab7675019 Fix MXM connection establishment flow
This commit was SVN r28329.
2013-04-12 16:37:42 +00:00
Steve Wise
134baaf2fa Add Chelsio T5 device. This fixes trac:3552 and should be added to cmr:v1.6:reviewer=jsquyres and cmr:v1.7:reviewer=jsquyres
This commit was SVN r28327.

The following Trac tickets were found above:
  Ticket 3552 --> https://svn.open-mpi.org/trac/ompi/ticket/3552
2013-04-11 19:30:53 +00:00
George Bosilca
2d33c9ee39 Stop complaining about an overwritten default parameter.
This commit was SVN r28322.
2013-04-10 19:44:37 +00:00
Jeff Squyres
8405975bf6 Be a little more conservative about initializing devices and modules
(i.e., ensure that more data items get zeroed out/set to NULL) so that
if something goes wrong during initialization, we don't try to clean
up something that isn't there (and segv).

The chance of this happening on the trunk is very low (and will also
be low once the verbs improvements are brought over to v1.7).  But it
can actually happen in the v1.6 branch (e.g., if no CPC is available,
we'll try to get the length of the endpoints list, but the endpoints
list is NULL).  

Hence, even though the real goal is to get this functionality over to
v1.6, I figured I'd commit to the trunk/CMR to v1.7 just to try to
keep commonality in the openib between all three where possible.

This commit was SVN r28317.
2013-04-09 21:55:31 +00:00
Ralph Castain
45af6cf59e The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem.
This commit was SVN r28308.
2013-04-08 23:34:16 +00:00
Nathan Hjelm
4e95d691a7 pml/ob1: do not reset the convertor if one was not created (size = 0).
This macro is only used on the failure path so the additional if statement
should not have any affect on performance.

cmr:v1.7

This commit was SVN r28292.
2013-04-05 01:40:11 +00:00
Pavel Shamis
fed6e60131 Fixing OpenIB BTL compilation failure for a cases when
BTL_OPENIB_MALLOC_HOOKS_ENABLED is disabled.

This commit was SVN r28290.
2013-04-04 20:17:18 +00:00
Pavel Shamis
aa1f5697b4 In order to prevent name conflicts in XRC (MOFED) enabled mode
OFACM's ib_address_t was renamed to ofacm_ib_address_t

This commit was SVN r28289.
2013-04-04 20:02:17 +00:00
Nathan Hjelm
e8d9944456 sbgp/ibnet: fix param -> var update errors
This commit was SVN r28284.
2013-04-03 20:17:18 +00:00
Nathan Hjelm
75093155ab bcol/iboffload: fix still more errors from param -> var updates
This commit was SVN r28283.
2013-04-03 19:57:03 +00:00
Nathan Hjelm
47a1897710 bcol/iboffload: fix more errors from param -> var updates
This commit was SVN r28281.
2013-04-03 18:55:46 +00:00
Nathan Hjelm
31a498c2a1 bcol/iboffload: fix errors from param -> var updates
This commit was SVN r28280.
2013-04-03 18:33:19 +00:00
Ralph Castain
66f3a81488 Cleanup warnings found when building v1.7
cmr:v1.7

This commit was SVN r28279.
2013-04-03 17:37:02 +00:00
Vishwanath Venkatesan
74c418b860 Adding typecasting with intptr_t to remove warnings.
This commit was SVN r28278.
2013-04-03 17:07:43 +00:00
Vishwanath Venkatesan
784337aab1 typecasting with intptr_t to remove warnings
This commit was SVN r28276.
2013-04-03 17:06:02 +00:00
Jeff Squyres
64d39a4e97 Technically speaking, we're creating a QP with 1 send WQE and 1
receive WQE, so it's good form to have a CQ with 2 entries, not 1.

This commit was SVN r28256.
2013-03-28 13:11:31 +00:00
George Bosilca
9c6374b515 Swap the open and register.
This commit was SVN r28253.
2013-03-27 22:19:57 +00:00
Nathan Hjelm
f1fa290157 btl/vader: add missing return statement
This commit was SVN r28252.
2013-03-27 22:16:21 +00:00
Nathan Hjelm
113fadd749 btl/vader: do not use common/sm for shared memory fragments
This commit was SVN r28250.
2013-03-27 22:10:02 +00:00
Nathan Hjelm
9d4a26f47d Update OMPI frameworks to use the MCA framework system.
Notes:
  - This commit also eliminates the need for an available components list in use
    in several frameworks. None of the code in question was making use of the
    priority field of the priority component list item so these extra lists were
    removed.
  - Cleaned up selection code in several frameworks to sort lists using opal_list_sort.
  - Cleans up the ompi/orte-info functions. Expose the functions that construct the
    list of params so they can be used elsewhere.

patches for mtl/portals4 from brian

missed a few output variables in openib

This commit was SVN r28241.
2013-03-27 21:17:31 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
317915225c Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct.
This commit was SVN r28223.
2013-03-26 20:09:49 +00:00
Jeff Squyres
44e371a65d Remove (bogus) port number from the opal_output -- there's no port
number associated with creating a QP.

This commit was SVN r28222.
2013-03-26 19:48:50 +00:00
Vishwanath Venkatesan
e092cc34e0 Fixing the read all bugs discovered by Coverity
This commit was SVN r28189.
2013-03-20 20:27:09 +00:00
Samuel Gutierrez
8ce2041102 Cleanup in error path. Fixes CID 967211. Thanks, Jeff.
This commit was SVN r28183.
2013-03-19 20:00:08 +00:00