1
1

2094 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
f27123a20d Fix the add_proc issue identified by Jeff: the TCP BTL now discard a
peer proc without TCP support instead of completely dropping TCP support for the entire job.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31753.
2014-05-14 13:47:57 +00:00
Nathan Hjelm
dd8de4d6eb btl/ugni: fix memory leaks and silence some warnings
The smsg_mboxes free list was not getting destructed. The construct
has been moved to module initialization and a matching destruct is now
in the module destruct.

This commit was SVN r31746.
2014-05-13 21:22:33 +00:00
Nathan Hjelm
a78519a2b2 btl/scif: remove call to pthread_cancel
There is no reason to cancel the listening thread. It should die
automatically when the file descriptor is closed. It is sufficient
to just wait for the thread to exit with pthread join.

cmr=v1.8.2:ticket=trac:4616:reviewer=jsquyres

This commit was SVN r31738.

The following Trac tickets were found above:
  Ticket 4616 --> https://svn.open-mpi.org/trac/ompi/ticket/4616
2014-05-13 17:29:53 +00:00
Gilles Gouaillardet
209378efec btl/scif: prevent SIGSEGV from occuring when the module is unloaded
Fixes trac:4615

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r31717.

The following Trac tickets were found above:
  Ticket 4615 --> https://svn.open-mpi.org/trac/ompi/ticket/4615
2014-05-13 10:04:38 +00:00
Nathan Hjelm
0b8bb2339b btl/scif: update the size when preparing send fragments
Thanks to Gilles Gouaillardet for catching this.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31715.
2014-05-12 22:05:31 +00:00
Jeff Squyres
e37c7af0fb usnic: update cclient/cagent to use unix domain sockets (not RML)
In preparation for moving the BTLs down to OPAL, discontinue the use
of the RML for connectivity client/agent communication.  Instead, use
local unix domain sockets in the job session directory (all
communication is between processes on the same server, so unix domain
sockets are fine).

This commit was SVN r31710.
2014-05-09 20:35:36 +00:00
Jeff Squyres
184e4fc0ca usnic: ensure that procs agree on use_udp value
Add the component use_udp value into the modex.  If my component's
use_udp value doesn't agree with the use_udp value from a peer's modex
data, print a helpful message and disqualify the usnic BTL (the usnic
BTL will not be used).  This prevents accidental customer
misconfigurations.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31689.
2014-05-08 16:43:50 +00:00
Jeff Squyres
e9c3df652e usnic: reduce sizeof(ompi_btl_usnic_addr_t) to 56 bytes
Trivial struct re-ordering to eliminate holes in the middle of the
struct (although there's still a hole at the end) and reduce the
overall size of the struct from 64 to 56 bytes.  Also change mtu from
int to uint16_t; there was no need for it to be that large.

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31688.
2014-05-08 16:38:59 +00:00
Jeff Squyres
a61e4d6425 usnic: fix connectivity checker timeout
Fix mismatch between the MCA param (which expresses the timeout in
*mili*seconds) and the struct timeval timeout (which expresses the
timeout in *micro*seconds).

Reviewed by Dave Goodell

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31687.
2014-05-08 16:36:07 +00:00
Ralph Castain
76f5991ab2 Couple of minor fixes
This commit was SVN r31680.
2014-05-08 02:26:45 +00:00
Ralph Castain
a8e2d6c3a6 The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
top_ompi_srcdir  ->  OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR

We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.

Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.

This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
fdfb331e13 Per RFC, continue the renaming process
This commit was SVN r31663.
2014-05-06 20:53:55 +00:00
Jeff Squyres
10e8ab493e btl_usnic_mca.c: Increase default connectivity checker frequency
In abusive MPI communication patterns, sending a UDP ping only once a
second may not be sufficient -- all the UDP pings may be dropped.  So
increase the frequency of the pings to every quarter second, and allow
more total pings to be sent.

Total timeout time is still the same (10 seconds) -- we'll just now
try 40 times (i.e., once every quarter second) as opposed to 10 times
(i.e., once a second).  Testing has shown that this frequency allows
the connectivity checker to always succeed even in the many-to-one
abusive communication patterns.

cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r31602.
2014-05-02 11:06:18 +00:00
Jeff Squyres
bf82ee2a14 btl_usnic_connectivity.h: fix PACK_BYTES macro
We're passing a char foo[x] into PACK_BYTES, so we don't need to take
its address in the macro.  This is parallel to the UNPACK_BYTES macro
(where we pass a char bar[x] into it, and don't take its address in
the macro).

The value we're packing is only used to output in a show_help message,
which is why this wasn't noticed before (i.e., it's not used in
network or addressing that would have caused a failure).

cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r31594.
2014-05-01 22:23:22 +00:00
Jeff Squyres
49383f0aaa Oops: remove errant "v4" string.
This commit was SVN r31591.
2014-05-01 20:21:43 +00:00
Jeff Squyres
56ecb92b10 Per discussion with George and Ralph, change this BTL_ERROR message to
an opal_show_help() so that its output is deduplicated.

This commit was SVN r31590.
2014-05-01 20:15:33 +00:00
Ralph Castain
e20dae536c Last step under current RFC: OMPI_CHECK_WITHDIR -> OPAL_CHECK_WITHDIR
This commit was SVN r31585.
2014-05-01 15:38:07 +00:00
Ralph Castain
3b64c603b4 First stage of RFC to rename OMPI_foo build system support: change OMPI_CHECK_PACKAGE -> OPAL_CHECK_PACKAGE
This commit was SVN r31582.
2014-05-01 14:24:56 +00:00
Jeff Squyres
c4d85ec6ca btl_usnic_cclient.c: update to use the new opal dstore
Use the new opal dstore API (vs. the old RTE DB API).

(dstore is not going to the v1.8 series, so there's no need to CMR
this to v1.8)

This commit was SVN r31580.
2014-04-30 22:32:47 +00:00
Nathan Hjelm
c9a257f1a0 btl/ugni: always buffer sendi fragments
This commit will improve the message rate when using the sendi function
by not waiting for the send to get to the remote process.

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r31526.
2014-04-24 18:50:29 +00:00
Nathan Hjelm
0849d61e38 btl/vader: improve performance under heavy load and eliminate a racy
feature

This commit should fix a hang seen when running some of the one-sided
tests. The downside of this fix is it reduces the maximum size of the
messages that use the fast boxes. I will fix this in a later commit.

To improve performance under a heavy load I introduced sequencing to
ensure messages are given to the pml in order. I have seen little-no
impact on the message rate or latency with this change and there is a
clear improvement to the heavy message rate case.

Lets let this sit in the trunk for a couple of days to ensure that
everything is working correctly.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31522.
2014-04-24 17:36:03 +00:00
George Bosilca
024221f469 Initialize some fields (prevent valgrind complaints).
This commit was SVN r31503.
2014-04-23 13:38:30 +00:00
Jeff Squyres
5d17628823 Add in an opal_output_verbose() so that we'll see the case where there
are no usNICs found.

Refs trac:4549

This commit was SVN r31489.

The following Trac tickets were found above:
  Ticket 4549 --> https://svn.open-mpi.org/trac/ompi/ticket/4549
2014-04-22 18:59:10 +00:00
Jeff Squyres
a3acc49688 usnic_component.c: don't complain if there are no usNIC devices
cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r31468.
2014-04-21 19:28:48 +00:00
Jeff Squyres
a28d7af262 Remove set-but-unused variable.
This commit was SVN r31457.
2014-04-19 12:51:51 +00:00
Rolf vandeVaart
1fab9bb37f Fixes per review by jsquyres. Piggy back on btl_base_verbose rather than using my own special MCA var.
This commit was SVN r31427.
2014-04-18 18:09:09 +00:00
Rolf vandeVaart
8897e2f5bb Fix typo error in commit r31388.
This commit was SVN r31398.

The following SVN revision numbers were found above:
  r31388 --> open-mpi/ompi@ccb33ff811
2014-04-15 19:50:54 +00:00
Nathan Hjelm
ccb33ff811 btl: Use C99 sub-object naming when initializing BTL components
Two things to note:

 - This change will allow us to expand the BTL interface without
   having to worry about modifying BTLs that will not support the new
   interfaces. More on this will come later this year as part of the
   1.9 series.

 - C99 guarantees that uninitialed members of structs declared outside
   of functions (DATA binary section) will be initialized with
   0's. This allows us to drop stuff like .btl_flags = 0, or .btl_get
   = NULL.

This commit was SVN r31388.
2014-04-14 19:29:26 +00:00
Jeff Squyres
16f90acbaf btl usnic: Add some SHOW_HELP: tokens and remove 2 unused help messages
This commit was SVN r31322.
2014-04-07 15:40:19 +00:00
Nathan Hjelm
9112977d86 btl/openib/udcm: fix two race conditions
This commit fixes two nasty races:

 - One can occur if the connection request message and connection completion
   message arrive out of order. This can happen normally when adaptive routing
   is used and also in a timeout situation where a UD message is lost.

 - One occurs when handling an ack at the same time as we are handling the
   message timeout. In this case we can not free the message or the timeout
   will be operating on invalid data. This fix is a band-aid until I can come
   up with a better approach. Instead of freeing the message it is marked
   as inactive and the event callback is triggered immediately (this has no
   affect if the callback is already active). The callback then frees the
   message if it is inactive.

cmr=v1.8.1:reviewer=pasha

This commit was SVN r31305.
2014-04-02 15:09:50 +00:00
Nathan Hjelm
ecce211403 btl/vader: create the shared memory backing file in the proc's session
directory not the job's

This bug didn't affect the correctness of the vader results just the
cleanup. This commit removes an error message about removing a non-existent
file.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31265.
2014-03-28 00:38:19 +00:00
Jeff Squyres
cdb396697c usnic: do not disqualify if a peer does not put usnic modex info
If ompi_modex_recv() fails with OPAL_ERR_DATA_VALUE_NOT_FOUND, it
simply means that the peer process did not put any usnic BTL modex
info -- it is not an error.  So have the usnic BTL simply ignore that
peer (vs. disqualifying itself / treating this like a real error).

Refs trac:4442.

This commit was SVN r31258.

The following Trac tickets were found above:
  Ticket 4442 --> https://svn.open-mpi.org/trac/ompi/ticket/4442
2014-03-27 19:37:07 +00:00
Nathan Hjelm
b3bb90cf2d Do not include inttypes.h directly in Open MPI. Use opal_stdint.h instead.
This commit should finish the work started for #869. Closing that ticket
with this commit.

Closes trac:869

cmr=v1.8.1:reviewer=jsquyres

This commit was SVN r31257.

The following Trac tickets were found above:
  Ticket 869 --> https://svn.open-mpi.org/trac/ompi/ticket/869
2014-03-27 17:56:00 +00:00
Vasily Filipov
8ef2e746e6 BTL/OPENIB: fix for rdma cm AF_IB case - user private data pointer points to a lib RDMA CM header and not to a "Consumer Private Data".
This commit was SVN r31247.
2014-03-27 14:04:02 +00:00
Nathan Hjelm
b9da3ef462 btl/vader: actually set the correct send size in all cases
Fix a one line bug when dealing with non-contiguous sends in prepare_src. Bug was
identified by the intel test suite.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31232.
2014-03-26 21:50:07 +00:00
Nathan Hjelm
5400e21688 btl/vader: unlink the shared memory segment when finished
cmr=v1.8:reviewer=jsquyres

This commit was SVN r31230.
2014-03-26 16:25:02 +00:00
Jeff Squyres
5a09ee5d8c usnic: drop unknown connection checker packets without erroring out
In most cases, bad messages received by the connectivty checker are
just dropped.  However, in one specific code path, a bad packet caused
an abort.  Doh!

This commit does two things:

1. Improve verbose messages for all these cases
1. Simply drop incoming messages that cannot be identified as ACKs or PINGs

Submitted by Jeff Squyres, reviewed by Dave Goodell.

cmr=v1.8:reviewer=ompi-rm1.8

This commit was SVN r31225.
2014-03-25 21:05:20 +00:00
Mike Dubman
b8dddabcfb add config section for upcoming ConnectiX4 card
cmr=v1.8:reviewer=ompi-rm1.8

This commit was SVN r31199.
2014-03-25 14:27:09 +00:00
Vasily Filipov
c424ad94f3 BTL/OPENIB: remove AC_RUN_IFELSE from configure and check AF_IB support by lib rdmacm during component_init.
This commit was SVN r31194.
2014-03-24 13:36:04 +00:00
Jeff Squyres
c6994adf66 Add missing show_help message.
Found via Cisco MTT (i.e., it complained of not being able to find
this show_help message).

cmr=v1.8:reviewer=dgoodell

This commit was SVN r31144.
2014-03-19 14:09:19 +00:00
Jeff Squyres
7933de4928 Fix segv when ibv_create_ah fails.
* Ensure that all endpoints[x] values are initialized to NULL
* If ibv_create_ah fails, remove each endpoint from the
  module->all_endpoints list so that the endpoint can be destructed
  properly.

Submitted by Jeff Squyres, reviewed by Dave Goodell.

cmr=v1.7.5:reviewer=ompi-rm1.7

This commit was SVN r31111.
2014-03-18 15:52:55 +00:00
Jeff Squyres
554303af08 Two minor usnic fixes
* clang warning stomp
* memory barrier for volatile variable use

These can go to 1.7.5 or can slip to v1.8 -- RM decision.

Submitted by Jeff Squyres, reviewed by Dave Goodell

cmr=v1.7.5:reviewer=ompi-rm1.7

This commit was SVN r30944.
2014-03-05 21:21:53 +00:00
Jeff Squyres
fbde50e7cd Fix ibv_port_query() usnic extension usage
* Older versions of libusnic_verbs actually return 0 when querying for
  an unknown port.  So also check for a magic ID in the returned data
  to *really* know if the usnic extensions are supported.
* Use a union (in the common_verbs area) and memcpy (in the btl) to
  avoid undefined C type aliasing behavior.
* Ensure to memset the function table to 0 if the usnic extensions
  are not supported.

Submitted by Jeff Squyres, reviewed by Dave Goodell.

cmr=v1.7.5:reviewer=ompi-rm1.7

This commit was SVN r30935.
2014-03-04 22:11:47 +00:00
Rolf vandeVaart
c2ae29d860 Adjust priorities in smcuda BTL so it is used when CUDA-aware compiled in.
cmr=v1.7.5:reviewer=hjelmn

This commit was SVN r30925.
2014-03-04 14:44:44 +00:00
Vasily Filipov
5acb96d11b BTL/OPENIB: always define BTL_OPENIB_RDMACM_IB_ADDR to 0 or 1.
This commit was SVN r30923.
2014-03-04 08:54:17 +00:00
Jeff Squyres
6710c2ef3f usnic: remove unnecessary header union
Realistically, the usnic BTL doesn't need to know anything about the
underlying transport except for its header length (so that it knows
where the payload begins in a received buffer).  So remove the use of
the specific transport prefix union and just rely on the usnic verbs
extension to tell us what the header length is if we're using the
usNIC/UDP transport, or sizeof(struct ibv_grh) if we're using usNIC/L2
transport.

This commit was SVN r30914.
2014-03-03 21:33:12 +00:00
Jeff Squyres
d61765cb2a usnic: use the new usnic verbs extensions
If they exist, call the usnic verbs extensions to both enable UDP
support and get the UD receiver header length that should be used
(rather than assume 40/struct GRH).

This commit was SVN r30912.
2014-03-03 21:31:42 +00:00
Vasily Filipov
f36d50d494 OPENIB BTL/CONNECT: warning fixes caused by r30875.
This commit was SVN r30905.

The following SVN revision numbers were found above:
  r30875 --> open-mpi/ompi@f2014b96e7
2014-03-03 06:41:46 +00:00
Jeff Squyres
e819b5a34a Remove the vendor_ids parsing.
We don't use this functionality any more; we use the transport_type
and device name to identify usnic devices.  It's slightly easier
because we can transport_type+name from ibv_device_open() and don't
have to do an additional ibv_query_device() to get its attributes.

Reviewed by Dave Goodell.

cmr=v1.7.5:reviewer=ompi-rm1.7

This commit was SVN r30882.
2014-02-27 21:47:01 +00:00
Jeff Squyres
3cbdf33b88 This is what r30852 should have been: Consolidate into a single, outter loop of ibv_create_ah() calls
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style.  That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint.  If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later.  We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30879.

The following SVN revision numbers were found above:
  r30850 --> open-mpi/ompi@3641500442
  r30852 --> open-mpi/ompi@4e282a3295

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-27 17:19:50 +00:00