1
1
Граф коммитов

23064 Коммитов

Автор SHA1 Сообщение Дата
Ryan Grant
eec120678c Merge pull request #614 from tkordenbrock/topic/portals4.triggered.collectives
coll-portals4: implement collective operations using Portals4 triggered operations
2015-06-11 08:20:55 -06:00
Jithin Jose
5922645080 Set convertor pDesc and count in OPAL_CONVERTOR_PREPARE (including cases
where count = 0)

Signed-off-by: Jithin Jose <jithin.jose@intel.com>
2015-06-10 17:08:54 -07:00
Ralph Castain
12d3c9ca22 Revert "Fix a typo that incorrectly set the alignment threshold in the openib BTL."
This reverts commit ce915b5757.
2015-06-10 14:02:49 -07:00
Jeff Squyres
a6b308a722 Merge pull request #632 from jithinjosepkl/pr/pml-cm-opt
Initialize convertor in pml-cm-send/recv
2015-06-10 14:39:09 -04:00
Jithin Jose
7cfbfc4c89 Initialize convertor in pml-cm-send and recv.
Signed-off-by: Jithin Jose <jithin.jose@intel.com>
2015-06-10 09:39:31 -07:00
Gilles Gouaillardet
8885b34637 mca/base: fix a misc memory leak
as reported by Coverity with CID 1294415
2015-06-10 15:10:57 +09:00
Gilles Gouaillardet
9e278a21ce opal/crs: fix a string overflow
and revamp out of resource handling
fixes resource leak as reported by Coverity with CID 1304752
2015-06-10 14:23:25 +09:00
Ralph Castain
cebdf0b7c0 Add missing include 2015-06-09 22:08:05 -07:00
Jeff Squyres
fbaf6888f8 ompi/include/Makefile.am: rm mpi_portable_platform.h first
We've seen this a few times (e.g.,
http://www.open-mpi.org/community/lists/users/2015/06/27057.php
reported via @siegmargross).  I'm not entirely sure why it happens --
the best I can come up with is a poorly-synchronized network
filesystem and/or a bug in "make".  For example: this code hasn't
changed in forever, and it only happens to users *sometimes*.

Regardless, avoid the error altogether by removing the file before
making the sym link (it should be a sym link anyway -- if there's
something there, it should be safe to remove it before we re-create
the sym link that should be there in the first place).

(cherry picked from commit 0edd265ea045e649c9489e3cb8fdb657800d95c3)
2015-06-09 17:59:32 -07:00
Ryan Grant
8dd9183496 Merge pull request #630 from tkordenbrock/topic/portals4.verify.PtlPTAlloc.result
mtl-portals4: Verify the result of PtlPTAlloc()
2015-06-09 14:48:04 -06:00
Todd Kordenbrock
b725186768 mtl-portals4: Verify the result of PtlPTAlloc()
The Portals4 MTL allocates two Portals IDs requesting specific
well-known IDs and assumes that those IDs are allocated.  If those IDs
are in use, PtlPTAlloc() will allocate a different ID.  This commit
verifies that the requested IDs were allocated.
2015-06-09 14:43:50 -05:00
Nathan Hjelm
090922887b win_get_attr: fix coverity issues
CID 71734 Self assignment (NO_EFFECT)

This code has no effect. The original author of the offending code
does not remember why the self-assignment is there. Fortran
MPI_Win_get_attr tests are working with or without it so remove the
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-09 09:34:26 -06:00
Nathan Hjelm
062de45899 Add support for MPI-3.1 MPI_Aint functions
This commit adds support for MPI_Aint_add and MPI_Aint_diff. These
functions are implemented as macros in C (explicitly allowed by
MPI-3.1). The fortran implementations are a similar mess to the
MPI_Wtime implementations.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-09 09:31:33 -06:00
Nathan Hjelm
6772d32b85 opal/crs: silence clang warnings introduced by coverity fixes
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-08 09:16:13 -06:00
Gilles Gouaillardet
bcdb2d1380 add missing #include
sscanf requires stdio.h
fixes commit open-mpi/ompi@6ca57724c4
2015-06-08 09:13:11 +09:00
Jeff Squyres
4b59be4e4c btl tcp: cosmetic changes and updates
No logic changes.

Update some stale/incorrect comments, fix some indenting and style.
2015-06-06 10:17:20 -07:00
Jeff Squyres
d164fe9bc5 opal_params.c: fix typo in comment 2015-06-06 10:17:20 -07:00
Jeff Squyres
0acec2b676 opal/util/net.c: remove stale comment
Also wrap a long "if" statement -- but make no code logic changes.
2015-06-06 10:17:20 -07:00
Jeff Squyres
6ca57724c4 opal/util/net.c: remove superflous #include 2015-06-06 10:17:20 -07:00
Jeff Squyres
cddc8945e0 btl_tcp_proc.c: add missing "continue"
Also add another (superflous but symmetric) continue statement.

This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.

Fixes #585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
2015-06-06 10:17:12 -07:00
Ralph Castain
d9f23627fd Add in hwloc 1.11.0rc1 - will overwrite with final version 2015-06-04 15:35:56 -07:00
Rolf vandeVaart
8622b34664 Check for GPU Direct RDMA and leave pinned turned off 2015-06-04 14:25:24 -04:00
Jeff Squyres
347290f785 pml/Makefile.am: add missing file to $(headers) 2015-06-02 20:07:54 -07:00
Gilles Gouaillardet
bf38f82dc2 MPI_Win_{get,set}_info: add missing files
fixes commit open-mpi/ompi@558d34a5c3
2015-06-03 09:04:04 +09:00
Gilles Gouaillardet
7179d442c0 MPI_Win_{attach,detach}: add missing files
fixes commit open-mpi/ompi@9600e2bc63
2015-06-03 09:02:50 +09:00
Gilles Gouaillardet
1d8ce96305 MPI_Win_Create_dynamic: add missing files
fixes commit open-mpi/ompi@f45244d5a5
2015-06-03 09:00:04 +09:00
Jeff Squyres
a55eb5e2c6 Merge pull request #602 from jithinjosepkl/pr/pml-cm-opt
Optimizations to PML-CM
2015-06-02 13:47:10 -05:00
Todd Kordenbrock
a274d2795c coll-portals4: implement collective operations using Portals4 triggered operations
This commit implements the reduce, allreduce, barrier and bcast
collective operations using Portals4 triggered operations.
2015-06-02 11:41:19 -05:00
Howard Pritchard
8bb00824b6 Merge pull request #619 from hppritcha/topic/fix_busted_cray_build
odls/alps: fix busted build for cray.
2015-06-02 09:49:01 -06:00
Nathan Hjelm
69a0e1dd08 comm: fix coverity issues
CID 1269683 Unchecked return value (CHECKED_RETURN)
CID 1269684 Unchecked return value (CHECKED_RETURN)

Use ompi_comm_rank instead of MPI_Comm_rank here. There is no reason to
be using the MPI interface over the internal interface. This should clear
up these issues.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-02 09:40:26 -06:00
Nathan Hjelm
632f829eb7 mpit: fix coverity issues
CID 1047284 Uninitialized scalar variable (UNINIT)
CID 1047285 Uninitialized scalar variable (UNINIT)
CID 1047286 Uninitialized scalar variable (UNINIT)

If a performance variable session has no handles we should be returning MPI_SUCCESS
for MPI_T_pvar_start, MPI_T_pvar_stop, and MPI_T_pvar_reset. The code was returning
an unitialized value. This commit also updates the error code to return the proper
error on failure.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-02 09:15:53 -06:00
Nathan Hjelm
472e5635c7 topo/base: fix coverity issue
CID 1295340 Unchecked return value (CHECKED_RETURN)

Check the return code of mca_base_framework_open. If the call fails for some reason
the component array will not be properly defined. This will cause issues in
mca_topo_base_find_available.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-02 08:59:15 -06:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Gilles Gouaillardet
558d34a5c3 MPI_Win_{get,set}_info : add Fortran bindings 2015-06-02 17:45:52 +09:00
Gilles Gouaillardet
9600e2bc63 MPI_Win_{attach,detach} : add Fortran bindings 2015-06-02 17:45:44 +09:00
Gilles Gouaillardet
f45244d5a5 MPI_Win_create_dynamic : add Fortran bindings 2015-06-02 17:45:32 +09:00
Nathan Hjelm
16abe2e4f3 Merge pull request #615 from hjelmn/opal_coverity
event/libevent2022: fix coverity issue
2015-06-01 19:33:36 -06:00
Nathan Hjelm
f72b6d45c7 crs/none: fix coverity issues
CID 1301389 Resource leak (RESOURCE_LEAK)

There is no conceivable reason to strdup cr_argv[0] in either
location. Removed the calls to strdup.

CID 741357 Resource leak (RESOURCE_LEAK)

cr_argv was created by opal_argv_split (tmp_argv[0], ' '). Why should
we call opal_argv_join (' ') on this array. Leak fixed by printing out
tmp_argv[0] instead of calling opal_argv_join.

CID 741358 Resource leak (RESOURCE_LEAK)

The code does not handle exec failure correctly. The error should be
communicated to the parent process but the function in question is
only called by the parent. This calls into question some of the
structure of the function in general (like what is the point of
returning the child process id). That said, I will go ahead and add
the opal_argv_free to quiet this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 16:00:51 -06:00
Nathan Hjelm
7e34997746 event/libevent2022: fix coverity issue
CID 1269841 Out-of-bounds access (OVERRUN)

Correct issue. If the string being concatingated fills the remaining
buffer then a \0 is written past the end of the string. In practice
this should never happen but it should be fixed. I re-organized the
code a bit to clear this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 15:38:54 -06:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
rhc54
daa55fd582 Merge pull request #613 from rhc54/topic/listener
Centralize listener connection support
2015-05-29 15:55:19 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
b1c100c402 win_get_info: fix indentation
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 10:16:25 -06:00
Nathan Hjelm
c87ef46599 Merge pull request #612 from hjelmn/opal_coverity
opal coverity fixes
2015-05-29 10:02:17 -06:00
Nathan Hjelm
7b7993e406 pmix/base: fix coverity issue
CID 1269707 Logically dead code (DEADCODE)

Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 09:02:56 -06:00
Nathan Hjelm
1d27b1f944 pmix/native: fix coverity issue
CID 1269730 Dereference after null check (FORWARD_NULL)

The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:48:15 -06:00
Nathan Hjelm
5e2bc2c662 btl/openib: fix coverity issue
CID 1269821 Dereference null return value (NULL_RETURNS)

This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:44:03 -06:00
Nathan Hjelm
65472a383f mca/base: add yes/no as valid values for boolean variables
This commit expands the set of accepted values for boolean values to
include yes/no as synonyms for 1/0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:41:51 -06:00
Nathan Hjelm
61fe2cc629 win: add support for returning non_locks info key
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:35:01 -06:00