1
1

1956 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
3262080391 Cleanup udcm structures to avoid issues with nesting structures with
flexible members.

UDCM is ready to go for 1.7.4 with this patch.

cmr=v1.7.4:ticket=3940

This commit was SVN r29861.

The following Trac tickets were found above:
  Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
2013-12-12 05:24:37 +00:00
Nathan Hjelm
e0e94a6029 Fix warning caused by typo in r29815
This commit was SVN r29860.

The following SVN revision numbers were found above:
  r29815 --> open-mpi/ompi@d556b60b21
2013-12-11 21:45:39 +00:00
Nathan Hjelm
6ab69c758b Fix warnings in udcm.
cmr=v1.7.4:reviewer=rhc:ticket=3940

This commit was SVN r29859.

The following Trac tickets were found above:
  Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
2013-12-11 21:40:06 +00:00
Rolf vandeVaart
3ae88f8a24 Ensure no fork support with GDR. CUDA-aware code only.
This commit was SVN r29854.
2013-12-10 18:08:53 +00:00
Rolf vandeVaart
1cc55f305f Add extra check for GDR. Adjust some names and replace opal_output with opal_show_help.
This commit was SVN r29853.
2013-12-10 16:04:08 +00:00
Jeff Squyres
0f61bb651e Technically, the PORT_ACTIVE is not a "bad" event.
Note that this event should never happen within a single OMPI job,
because OMPI will ignore usnic ports that are down.  The PORT_ACTIVE
event should only occur if a port ''was'' down and is now ''up''.  But
what the heck -- if we ever do get this event, it is harmless -- just
ignore it.

This commit was SVN r29852.
2013-12-09 20:45:55 +00:00
Jeff Squyres
3bd9c603ff Clean up variables used in configure with OPAL_VAR_SCOPE.
This is helpful in the work for #3694: ensure that many places that
eventually end up in configure don't overly-pollute the global shell
variable space (because debugging accidental shell variable pollution
can be a real pain).

Refs trac:3694

This commit was SVN r29830.

The following Trac tickets were found above:
  Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-06 23:40:34 +00:00
Rolf vandeVaart
d556b60b21 Chnage some CUDA configure code and macro names per review request by jsquyres in ticket #3880.
Functionally, nothing changes.

This commit was SVN r29815.
2013-12-06 14:35:10 +00:00
Dave Goodell
da26226e3c usnic: add some extra debug-build sanity checks
On the off chance that the PML is twiddling fields that it really
shouldn't be...

Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29804.
2013-12-05 00:28:11 +00:00
Jeff Squyres
ba018b3603 Protect the container_of #define.
MOFED apparently has a /usr/include/infiniband/verbs.h that also
defines a (slightly different but fully compatible) container_of
macro.  So put proper #ifndef protection around our definition of
container_of.

Thanks to Rolf vandeVaart for pointing out the issue.

Reviewed by Dave Goodell.

cmr=v1.7.4:reviewer=ompi-rm1.7

This commit was SVN r29799.
2013-12-04 14:24:56 +00:00
Jeff Squyres
c74c1e86d3 Per suggestion from Paul Kapinos, report in BTL verbosity if a device
is skipped because it is too far away.

(see thread starting here:
http://www.open-mpi.org/community/lists/devel/2013/06/12470.php)

This commit was SVN r29790.
2013-12-03 22:44:11 +00:00
Rolf vandeVaart
ab77435d9b Fix the CUDA-aware case where we are not sending any GPU data.
This commit was SVN r29788.
2013-12-03 20:25:58 +00:00
Nathan Hjelm
fe327d9859 udcm: cleanup code and improve the ack handling
Originally udcm acks used the immediate data to indicate which message was
being acknowleged. This data was (mysteriously) junk when using QLogic HCAs so I
updated udcm to use the source info (slid, qp, etc) to determine which message was being
acked. This works as long as we don't have two messages simultaneously in flight
to a particular peer and then loose the first of the two messages. The chances of this
happening are tiny. To fix this case I updated the udcm message header to include
a pointer to the in flight message. This pointer is then sent back to the sending
process to ack receipt.

cmr=v1.7.4:ticket=trac:3940

This commit was SVN r29775.

The following Trac tickets were found above:
  Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
2013-12-02 20:18:46 +00:00
Nathan Hjelm
fb0b0442c4 openib/connect: re-enable xrc support in the openib btl
This commit updates the udcm cpc to support xrc. The steps followed by udcm
mimic those in the removed xoob cpc. This update has been tested with both XRC
and RC.

Mellanox, this is intended to go into 1.7.4. Please review carefully and let
me know if there are any issues.

cmr=v1.7.4:reviewer=miked

This commit was SVN r29767.
2013-11-27 22:28:04 +00:00
Devendar Bureddy
4a311ae9fd continue search sorted openib device list if no btls found with nearest HCA.
cmr=v1.7.4:reviewer=jladd

This commit was SVN r29725.
2013-11-20 22:23:12 +00:00
Nathan Hjelm
24a7e7aa34 Add support for the udreg registration cache and dynamics on XE/XK/XC.
To support the new mpool two changes were made to the mpool infrastructure:

 1) Added an mpool flag to indicate that an mpool does not need the memory
    hooks to use the leave pinned protocols. This flag is checked in the
    mpool lookup.

 2) Add a mpool context to the base registration. This new member is used
    by the udreg mpool to store the udreg context associated with the
    particular registration. The new member will not break the ABI
    compatibility as the new member is only currently used by the udreg
    mpool.

Dynamics support for Cray systems makes use of the global rank provided by
orte to give the ugni library a unique rank for each process. Dynamics
support is not available under direct-launch (srun.)

cmr=v1.7.4

This commit was SVN r29719.
2013-11-18 04:58:37 +00:00
Jeff Squyres
5206e877be Help decrease conflicts between SVN trunk and Cisco git branch of OMPI v1.6 branch
This commit was SVN r29715.
2013-11-15 21:35:56 +00:00
Jeff Squyres
e6ed7c9f4d Avoid trivial "don't mix declarations and code" compiler warning
This commit was SVN r29714.
2013-11-15 21:31:10 +00:00
Ralph Castain
22e30a680d Given that the oob and xoob cpc's are no longer operable and haven't been since the OOB update, remove them to avoid confusion
cmr:v1.7.4:reviewer=hjelmn:subject=Remove stale cpcs from openib

This commit was SVN r29703.
2013-11-14 04:16:53 +00:00
Rolf vandeVaart
4964a5e98b Per this RFC from October 8, 2013 and as discuessed in telecon.
http://www.open-mpi.org/community/lists/devel/2013/10/13072.php

Add support for pinning GPU Direct RDMA in openib BTL for better small message latency of GPU buffers. 
Note that none of this is compiled in unless CUDA-aware support is requested.

This commit was SVN r29680.
2013-11-13 13:22:39 +00:00
Jeff Squyres
98ff91cfeb Refs trac:3091
Gah!  The "device" variable isn't used at all in this loop (my eye
glossed over the next line and thought that "device" was used in the
free() statement, but it's actually "devices" -- not "device").

This commit was SVN r29665.

The following Trac tickets were found above:
  Ticket 3091 --> https://svn.open-mpi.org/trac/ompi/ticket/3091
2013-11-12 23:01:04 +00:00
Jeff Squyres
7cb31111a6 Refs trac:3901
Feedback from Dave's review.

This commit was SVN r29664.

The following Trac tickets were found above:
  Ticket 3901 --> https://svn.open-mpi.org/trac/ompi/ticket/3901
2013-11-12 22:51:20 +00:00
Jeff Squyres
5a940f5ee7 Arrgh -- remove debugging printf.
This commit was SVN r29657.
2013-11-11 22:44:28 +00:00
Jeff Squyres
e20217eccc Expand the "btl_usnic" MPI_T enumeration to have strings of the form:
<usnic device name>,<eth device>,<ip address>/<CIDR prefix>

For example:

   usnic_0,eth4,10.1.0.15/16

This is just handy for mapping the usnic_X device back to the IP
network to which it corresponds.

This commit was SVN r29656.
2013-11-11 22:25:30 +00:00
Nathan Hjelm
3d3c29ae96 btl/scif: do not return resource busy if we started a connection attempt.
Resolves a hang when using scif for shared memory transfers. This is a
simple change and doesn't require a review.

cmr=v1.7.4:reviewer=ompi-rm1.7

This commit was SVN r29653.
2013-11-11 19:36:34 +00:00
Rolf vandeVaart
3290cde630 Various minor changes to bring smcuda up to date with sm.
This commit was SVN r29639.
2013-11-07 19:45:56 +00:00
Dave Goodell
82db913490 usnic: fix module_recv_buffers perf regression
Cisco v1.6 git commit 913ec6c and upstream trunk r29593 (segfault fix)
introduced a performance regression by inadvertently disabling the
`module_recv_buffers` functionality.  With those changes in place, the
`btl_usnic_recv.c` logic would end up mallocing a buffer that should
have otherwise come from a `module_recv_buffers` pool.  It also resulted
in a small, bounded memory leak (128 buffers at each power-of-two size
interval).

The new version just places the buffer after the free list item with a
flexible array member.  I bumped the pool to allocate all 128 elements
up front because the deferred allocation was modestly impacting IMB
Sendrecv performance at a few sizes.

Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29631.

The following SVN revision numbers were found above:
  r29593 --> open-mpi/ompi@1ed9b8ff43
2013-11-07 01:27:31 +00:00
Jeff Squyres
e28261898d Per discussion on the devel list, rename the btl_usnic_devices MPI_T
state pvar to be btl_usnic (i.e., the best suggestion so far).

See http://www.open-mpi.org/community/lists/devel/2013/11/13188.php
for more detail.

This commit was SVN r29614.
2013-11-06 06:19:03 +00:00
Rolf vandeVaart
e57795f097 Revert r29594. That was just plain wrong. Sorry about workday configure change.
This commit was SVN r29605.

The following SVN revision numbers were found above:
  r29594 --> open-mpi/ompi@ed7ddcd9c7
2013-11-05 14:45:56 +00:00
Rolf vandeVaart
ed7ddcd9c7 Fix CUDA-aware compile error introduces with r29581.
This commit was SVN r29594.

The following SVN revision numbers were found above:
  r29581 --> open-mpi/ompi@ee7510b025
2013-11-05 00:08:33 +00:00
Dave Goodell
1ed9b8ff43 usnic: fix segfault at finalize time
Without this commit, if you run IMB pingpong between two nodes with only
one usnic selected (e.g., via `--mca btl_usnic_if_include usnic_0`) then
the run will seem fine but will segfault at MPI_Finalize time.

This behavior has happened since Cisco v1.6 git commit ec7ddf8, upstream
trunk r29484, and upstream v1.7 r29507.

Root cause was that the free list element was being used as the recv
buffer instead of the data buffer associated with the element.  So the
reassembly code would stomp all over the free list element, which would
cause the destructor to explode when the free list attempted to clean up
all of its elements.  This surprisingly did not cause any other problems
until now.

Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29593.

The following SVN revision numbers were found above:
  r29484 --> open-mpi/ompi@a6ed232a10
  r29507 --> open-mpi/ompi@790d269ce8
2013-11-04 22:52:14 +00:00
Dave Goodell
73a943492c usnic: pack via convertor on the fly
If we need to use a convertor, go back to stashing that convertor in the
frag and populating segments "on the fly" (in
ompi_btl_usnic_module_progress_sends).  Previously we would pack into a
chain of chunk segments at prepare_src time, unnecessarily consuming
additional memory.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29592.
2013-11-04 22:52:03 +00:00
Dave Goodell
71d0d73575 usnic: refactor callback invocation
This makes it a little easier to see what's happening with callbacks to
the PML.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29591.
2013-11-04 22:51:48 +00:00
Dave Goodell
4c791e21d2 usnic: add MSGDEBUG1_OUT/MSGDEBUG2_OUT macros
This includes suppressing picky-mode warnings about __VA_ARGS__, which
we know are supported by any compilers we care about.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29590.
2013-11-04 22:51:35 +00:00
Dave Goodell
825686a205 usnic: certain send frag members are immutable
Ensure that they never are touched by checking in their destructors.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>

This commit was SVN r29589.
2013-11-04 22:51:24 +00:00
Rolf vandeVaart
ee7510b025 Remove redundant macro. This was from reviewed of earlier ticket.
Fixes trac:3878.  Reviewed by jsquyres.

This commit was SVN r29581.

The following Trac tickets were found above:
  Ticket 3878 --> https://svn.open-mpi.org/trac/ompi/ticket/3878
2013-11-01 12:19:40 +00:00
Mike Dubman
b0e64427a9 ompi/mca/btl/openib: Fix memory leak and accessing free'd memory issues
Let imagine that we have two btls in btl_openib_component_init() both points to the same openib_btl->device and as a result have the same openib_btl->device->endpoints array.

Finalization phase calls twice mca_btl_openib_finalize()->mca_btl_openib_finalize_resources().
mca_btl_openib_finalize_resources() frees endpoint related btl. But the second call of mca_btl_openib_finalize_resources() checks endpoint that is released by previus call.

fixed by Igor, reviewed by miked/vasily
cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r29563.
2013-10-30 11:47:49 +00:00
Jeff Squyres
6569019b06 Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.

Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS.  They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).

Most of the statistics are counters, but a small number are high
watermark values.  Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).

When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules.  The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module.  The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.

Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:

 * The btl_usnic_devices MPI_T state pvar is associated with a
   btl_usnic_device MPI_T enum, and be obtained via
   MPI_T_pvar_get_info().
 * If all usNIC pvars are of length N, the values [0,N) in the
   btl_usnic_device enum are associated with strings of the
   corresponding underlying Linux device.

For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum.  Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").

cmr=v1.7.4:subject="usnic BTL MPI_T pvars"

This commit was SVN r29545.
2013-10-28 22:23:08 +00:00
Nathan Hjelm
404cceb9c4 Always check the return of [mc]alloc and fix a warning introduced by
r29479.

This fixes some issues reported awhile ago in the openib btl. There
are a couple more unchecked mallocs but they are a bit more difficult
to fix since they are in void functions (btl_openib_endpoint.c).

Refs trac:2401.

cmr=v1.7.4:reviewer=miked

This commit was SVN r29543.

The following SVN revision numbers were found above:
  r29479 --> open-mpi/ompi@d6ead2a3a5

The following Trac tickets were found above:
  Ticket 2401 --> https://svn.open-mpi.org/trac/ompi/ticket/2401
2013-10-28 20:04:49 +00:00
Nathan Hjelm
26f3a029d3 Fix scif configury.
cmr=v1.7.4:ticket=3862

This commit was SVN r29493.

The following Trac tickets were found above:
  Ticket 3862 --> https://svn.open-mpi.org/trac/ompi/ticket/3862
2013-10-23 17:04:20 +00:00
Nathan Hjelm
6186b5ed9d Remove extra file that made its way into r29490.
cmr=v1.7.4:ticket=3862

This commit was SVN r29491.

The following SVN revision numbers were found above:
  r29490 --> open-mpi/ompi@cde3b05ed3

The following Trac tickets were found above:
  Ticket 3862 --> https://svn.open-mpi.org/trac/ompi/ticket/3862
2013-10-23 16:17:51 +00:00
Nathan Hjelm
cde3b05ed3 Add support for the Intel scif interface.
Depends on #3847.

cmr=v1.7.4:reviewer=rhc

This commit was SVN r29490.
2013-10-23 15:59:14 +00:00
Dave Goodell
d969cfa513 usnic: correctly clean up verbs resources
Due to deallocation ordering (and an entirely missed deallocation), we
were leaking modest amounts of memory inside libusnic_verbs.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

This commit was SVN r29485.
2013-10-23 15:51:33 +00:00
Dave Goodell
a6ed232a10 usnic: fix several memory leaks
- some free lists simply were not being OBJ_DESTRUCTed, so they never
  freed their internal memory

- channel->recv_segs.ctx was being assigned in a way that got clobbered
  by ompi_free_list_init_new, so the cleanup code that relied on it
  being set never ran

- numerous other ".ctx" assignments were similarly ineffectual and were
  not being consumed, so I deleted them

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

This commit was SVN r29484.
2013-10-23 15:51:22 +00:00
Dave Goodell
c9b2343982 usnic: add ompi_btl_usnic_component_debug helper
This new routine can be called in exceptional situations, either
conditionally in BTL code or from a debugger, to help with debugging in
cases where MSGDEBUG1/2 or stats logging are impractical but more detail
is needed.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

This commit was SVN r29483.
2013-10-23 15:51:11 +00:00
Dave Goodell
d0b7d125b2 usnic: refactor usnic_stats_callback
Pull the bulk of the functionality out into a new routine,
ompi_btl_usnic_print_stats, which can be used in other debugging
contexts.  This also lets us eliminate the module->final_stats state
tracking.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

This commit was SVN r29482.
2013-10-23 15:50:57 +00:00
Jeff Squyres
0fb8edd720 Trivial comment change
This commit was SVN r29480.
2013-10-23 10:15:18 +00:00
Mike Dubman
d6ead2a3a5 Add support for routable ROCE where different subnet_id is a valid to proceed with MPI routing.
(can happen in the same LAN)
developed by vasily, reviewed by miked
cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r29479.
2013-10-23 06:08:54 +00:00
Nathan Hjelm
280a89448f Make btl/vader valgrind safe.
cmr=v1.7.4:reviewer=samuel

This commit was SVN r29464.
2013-10-22 15:33:32 +00:00
Rolf vandeVaart
0cd1e8dfd9 Add runtime support to turn off CUDA IPC support.
This commit was SVN r29444.
2013-10-16 16:48:18 +00:00