1
1
Граф коммитов

22863 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
2f137ff151 openib: reset memalign threshhold properly
Now that open-mpi/ompi#638 is fixed, reset the openib BTL memalign
threshhold properly.

This effectively re-instates commit
open-mpi/ompi@ce915b5757.
2015-06-12 20:11:47 -07:00
Jeff Squyres
88c13adc8c openib: only set the memory hook if it is enabled
Instead of unconditionally setting the memory hook, only set it when
the memory hooks are both available and have been enabled (e.g.,
opal/mca/memory/linux has decided that it *can* be enabled, and when
the mpi_leave_pinned MCA param is set to 1, or is set to -1 and some
component requested the memory hooks be enabled).

If we set the memory hook when memory hooks are not enabled,
__malloc_hook will be NULL, which will cause problems when
btl_openib_malloc_hook() tries to invoke it.

Fixes open-mpi/ompi#638.
2015-06-12 20:11:47 -07:00
rhc54
2de42c873d Merge pull request #641 from rhc54/mq
Bring over George's message_queue repair
2015-06-12 15:07:38 -07:00
George Bosilca
67b70bb47a Add multi-threaded support. 2015-06-12 14:22:17 -07:00
George Bosilca
b2cf74cabc A first cut at a possible solution for the missing requests
from the message queues (a debugging feature). With this approach
all blocking (single threaded) requests are allocated from the main
freelist, so they will be accounted for during the message queues
investigation).
2015-06-12 14:22:17 -07:00
Ryan Grant
eec120678c Merge pull request #614 from tkordenbrock/topic/portals4.triggered.collectives
coll-portals4: implement collective operations using Portals4 triggered operations
2015-06-11 08:20:55 -06:00
Ralph Castain
12d3c9ca22 Revert "Fix a typo that incorrectly set the alignment threshold in the openib BTL."
This reverts commit ce915b5757.
2015-06-10 14:02:49 -07:00
Jeff Squyres
a6b308a722 Merge pull request #632 from jithinjosepkl/pr/pml-cm-opt
Initialize convertor in pml-cm-send/recv
2015-06-10 14:39:09 -04:00
Jithin Jose
7cfbfc4c89 Initialize convertor in pml-cm-send and recv.
Signed-off-by: Jithin Jose <jithin.jose@intel.com>
2015-06-10 09:39:31 -07:00
Gilles Gouaillardet
8885b34637 mca/base: fix a misc memory leak
as reported by Coverity with CID 1294415
2015-06-10 15:10:57 +09:00
Gilles Gouaillardet
9e278a21ce opal/crs: fix a string overflow
and revamp out of resource handling
fixes resource leak as reported by Coverity with CID 1304752
2015-06-10 14:23:25 +09:00
Ralph Castain
cebdf0b7c0 Add missing include 2015-06-09 22:08:05 -07:00
Jeff Squyres
fbaf6888f8 ompi/include/Makefile.am: rm mpi_portable_platform.h first
We've seen this a few times (e.g.,
http://www.open-mpi.org/community/lists/users/2015/06/27057.php
reported via @siegmargross).  I'm not entirely sure why it happens --
the best I can come up with is a poorly-synchronized network
filesystem and/or a bug in "make".  For example: this code hasn't
changed in forever, and it only happens to users *sometimes*.

Regardless, avoid the error altogether by removing the file before
making the sym link (it should be a sym link anyway -- if there's
something there, it should be safe to remove it before we re-create
the sym link that should be there in the first place).

(cherry picked from commit 0edd265ea045e649c9489e3cb8fdb657800d95c3)
2015-06-09 17:59:32 -07:00
Ryan Grant
8dd9183496 Merge pull request #630 from tkordenbrock/topic/portals4.verify.PtlPTAlloc.result
mtl-portals4: Verify the result of PtlPTAlloc()
2015-06-09 14:48:04 -06:00
Todd Kordenbrock
b725186768 mtl-portals4: Verify the result of PtlPTAlloc()
The Portals4 MTL allocates two Portals IDs requesting specific
well-known IDs and assumes that those IDs are allocated.  If those IDs
are in use, PtlPTAlloc() will allocate a different ID.  This commit
verifies that the requested IDs were allocated.
2015-06-09 14:43:50 -05:00
Nathan Hjelm
090922887b win_get_attr: fix coverity issues
CID 71734 Self assignment (NO_EFFECT)

This code has no effect. The original author of the offending code
does not remember why the self-assignment is there. Fortran
MPI_Win_get_attr tests are working with or without it so remove the
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-09 09:34:26 -06:00
Nathan Hjelm
6772d32b85 opal/crs: silence clang warnings introduced by coverity fixes
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-08 09:16:13 -06:00
Gilles Gouaillardet
bcdb2d1380 add missing #include
sscanf requires stdio.h
fixes commit open-mpi/ompi@6ca57724c4
2015-06-08 09:13:11 +09:00
Jeff Squyres
4b59be4e4c btl tcp: cosmetic changes and updates
No logic changes.

Update some stale/incorrect comments, fix some indenting and style.
2015-06-06 10:17:20 -07:00
Jeff Squyres
d164fe9bc5 opal_params.c: fix typo in comment 2015-06-06 10:17:20 -07:00
Jeff Squyres
0acec2b676 opal/util/net.c: remove stale comment
Also wrap a long "if" statement -- but make no code logic changes.
2015-06-06 10:17:20 -07:00
Jeff Squyres
6ca57724c4 opal/util/net.c: remove superflous #include 2015-06-06 10:17:20 -07:00
Jeff Squyres
cddc8945e0 btl_tcp_proc.c: add missing "continue"
Also add another (superflous but symmetric) continue statement.

This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.

Fixes #585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
2015-06-06 10:17:12 -07:00
Rolf vandeVaart
8622b34664 Check for GPU Direct RDMA and leave pinned turned off 2015-06-04 14:25:24 -04:00
Jeff Squyres
347290f785 pml/Makefile.am: add missing file to $(headers) 2015-06-02 20:07:54 -07:00
Gilles Gouaillardet
bf38f82dc2 MPI_Win_{get,set}_info: add missing files
fixes commit open-mpi/ompi@558d34a5c3
2015-06-03 09:04:04 +09:00
Gilles Gouaillardet
7179d442c0 MPI_Win_{attach,detach}: add missing files
fixes commit open-mpi/ompi@9600e2bc63
2015-06-03 09:02:50 +09:00
Gilles Gouaillardet
1d8ce96305 MPI_Win_Create_dynamic: add missing files
fixes commit open-mpi/ompi@f45244d5a5
2015-06-03 09:00:04 +09:00
Jeff Squyres
a55eb5e2c6 Merge pull request #602 from jithinjosepkl/pr/pml-cm-opt
Optimizations to PML-CM
2015-06-02 13:47:10 -05:00
Todd Kordenbrock
a274d2795c coll-portals4: implement collective operations using Portals4 triggered operations
This commit implements the reduce, allreduce, barrier and bcast
collective operations using Portals4 triggered operations.
2015-06-02 11:41:19 -05:00
Howard Pritchard
8bb00824b6 Merge pull request #619 from hppritcha/topic/fix_busted_cray_build
odls/alps: fix busted build for cray.
2015-06-02 09:49:01 -06:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Gilles Gouaillardet
558d34a5c3 MPI_Win_{get,set}_info : add Fortran bindings 2015-06-02 17:45:52 +09:00
Gilles Gouaillardet
9600e2bc63 MPI_Win_{attach,detach} : add Fortran bindings 2015-06-02 17:45:44 +09:00
Gilles Gouaillardet
f45244d5a5 MPI_Win_create_dynamic : add Fortran bindings 2015-06-02 17:45:32 +09:00
Nathan Hjelm
16abe2e4f3 Merge pull request #615 from hjelmn/opal_coverity
event/libevent2022: fix coverity issue
2015-06-01 19:33:36 -06:00
Nathan Hjelm
f72b6d45c7 crs/none: fix coverity issues
CID 1301389 Resource leak (RESOURCE_LEAK)

There is no conceivable reason to strdup cr_argv[0] in either
location. Removed the calls to strdup.

CID 741357 Resource leak (RESOURCE_LEAK)

cr_argv was created by opal_argv_split (tmp_argv[0], ' '). Why should
we call opal_argv_join (' ') on this array. Leak fixed by printing out
tmp_argv[0] instead of calling opal_argv_join.

CID 741358 Resource leak (RESOURCE_LEAK)

The code does not handle exec failure correctly. The error should be
communicated to the parent process but the function in question is
only called by the parent. This calls into question some of the
structure of the function in general (like what is the point of
returning the child process id). That said, I will go ahead and add
the opal_argv_free to quiet this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 16:00:51 -06:00
Nathan Hjelm
7e34997746 event/libevent2022: fix coverity issue
CID 1269841 Out-of-bounds access (OVERRUN)

Correct issue. If the string being concatingated fills the remaining
buffer then a \0 is written past the end of the string. In practice
this should never happen but it should be fixed. I re-organized the
code a bit to clear this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 15:38:54 -06:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
rhc54
daa55fd582 Merge pull request #613 from rhc54/topic/listener
Centralize listener connection support
2015-05-29 15:55:19 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
b1c100c402 win_get_info: fix indentation
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 10:16:25 -06:00
Nathan Hjelm
c87ef46599 Merge pull request #612 from hjelmn/opal_coverity
opal coverity fixes
2015-05-29 10:02:17 -06:00
Nathan Hjelm
7b7993e406 pmix/base: fix coverity issue
CID 1269707 Logically dead code (DEADCODE)

Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 09:02:56 -06:00
Nathan Hjelm
1d27b1f944 pmix/native: fix coverity issue
CID 1269730 Dereference after null check (FORWARD_NULL)

The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:48:15 -06:00
Nathan Hjelm
5e2bc2c662 btl/openib: fix coverity issue
CID 1269821 Dereference null return value (NULL_RETURNS)

This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:44:03 -06:00
Nathan Hjelm
65472a383f mca/base: add yes/no as valid values for boolean variables
This commit expands the set of accepted values for boolean values to
include yes/no as synonyms for 1/0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:41:51 -06:00
Nathan Hjelm
61fe2cc629 win: add support for returning non_locks info key
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:35:01 -06:00
Jeff Squyres
e8a8a6d223 README: remove a weird parenthetical 2015-05-28 18:26:43 -04:00