1
1

22851 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
cebdf0b7c0 Add missing include 2015-06-09 22:08:05 -07:00
Jeff Squyres
fbaf6888f8 ompi/include/Makefile.am: rm mpi_portable_platform.h first
We've seen this a few times (e.g.,
http://www.open-mpi.org/community/lists/users/2015/06/27057.php
reported via @siegmargross).  I'm not entirely sure why it happens --
the best I can come up with is a poorly-synchronized network
filesystem and/or a bug in "make".  For example: this code hasn't
changed in forever, and it only happens to users *sometimes*.

Regardless, avoid the error altogether by removing the file before
making the sym link (it should be a sym link anyway -- if there's
something there, it should be safe to remove it before we re-create
the sym link that should be there in the first place).

(cherry picked from commit 0edd265ea045e649c9489e3cb8fdb657800d95c3)
2015-06-09 17:59:32 -07:00
Ryan Grant
8dd9183496 Merge pull request #630 from tkordenbrock/topic/portals4.verify.PtlPTAlloc.result
mtl-portals4: Verify the result of PtlPTAlloc()
2015-06-09 14:48:04 -06:00
Todd Kordenbrock
b725186768 mtl-portals4: Verify the result of PtlPTAlloc()
The Portals4 MTL allocates two Portals IDs requesting specific
well-known IDs and assumes that those IDs are allocated.  If those IDs
are in use, PtlPTAlloc() will allocate a different ID.  This commit
verifies that the requested IDs were allocated.
2015-06-09 14:43:50 -05:00
Nathan Hjelm
090922887b win_get_attr: fix coverity issues
CID 71734 Self assignment (NO_EFFECT)

This code has no effect. The original author of the offending code
does not remember why the self-assignment is there. Fortran
MPI_Win_get_attr tests are working with or without it so remove the
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-09 09:34:26 -06:00
Nathan Hjelm
6772d32b85 opal/crs: silence clang warnings introduced by coverity fixes
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-08 09:16:13 -06:00
Gilles Gouaillardet
bcdb2d1380 add missing #include
sscanf requires stdio.h
fixes commit open-mpi/ompi@6ca57724c4
2015-06-08 09:13:11 +09:00
Jeff Squyres
4b59be4e4c btl tcp: cosmetic changes and updates
No logic changes.

Update some stale/incorrect comments, fix some indenting and style.
2015-06-06 10:17:20 -07:00
Jeff Squyres
d164fe9bc5 opal_params.c: fix typo in comment 2015-06-06 10:17:20 -07:00
Jeff Squyres
0acec2b676 opal/util/net.c: remove stale comment
Also wrap a long "if" statement -- but make no code logic changes.
2015-06-06 10:17:20 -07:00
Jeff Squyres
6ca57724c4 opal/util/net.c: remove superflous #include 2015-06-06 10:17:20 -07:00
Jeff Squyres
cddc8945e0 btl_tcp_proc.c: add missing "continue"
Also add another (superflous but symmetric) continue statement.

This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.

Fixes #585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
2015-06-06 10:17:12 -07:00
Rolf vandeVaart
8622b34664 Check for GPU Direct RDMA and leave pinned turned off 2015-06-04 14:25:24 -04:00
Jeff Squyres
347290f785 pml/Makefile.am: add missing file to $(headers) 2015-06-02 20:07:54 -07:00
Gilles Gouaillardet
bf38f82dc2 MPI_Win_{get,set}_info: add missing files
fixes commit open-mpi/ompi@558d34a5c3
2015-06-03 09:04:04 +09:00
Gilles Gouaillardet
7179d442c0 MPI_Win_{attach,detach}: add missing files
fixes commit open-mpi/ompi@9600e2bc63
2015-06-03 09:02:50 +09:00
Gilles Gouaillardet
1d8ce96305 MPI_Win_Create_dynamic: add missing files
fixes commit open-mpi/ompi@f45244d5a5
2015-06-03 09:00:04 +09:00
Jeff Squyres
a55eb5e2c6 Merge pull request #602 from jithinjosepkl/pr/pml-cm-opt
Optimizations to PML-CM
2015-06-02 13:47:10 -05:00
Howard Pritchard
8bb00824b6 Merge pull request #619 from hppritcha/topic/fix_busted_cray_build
odls/alps: fix busted build for cray.
2015-06-02 09:49:01 -06:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Gilles Gouaillardet
558d34a5c3 MPI_Win_{get,set}_info : add Fortran bindings 2015-06-02 17:45:52 +09:00
Gilles Gouaillardet
9600e2bc63 MPI_Win_{attach,detach} : add Fortran bindings 2015-06-02 17:45:44 +09:00
Gilles Gouaillardet
f45244d5a5 MPI_Win_create_dynamic : add Fortran bindings 2015-06-02 17:45:32 +09:00
Nathan Hjelm
16abe2e4f3 Merge pull request #615 from hjelmn/opal_coverity
event/libevent2022: fix coverity issue
2015-06-01 19:33:36 -06:00
Nathan Hjelm
f72b6d45c7 crs/none: fix coverity issues
CID 1301389 Resource leak (RESOURCE_LEAK)

There is no conceivable reason to strdup cr_argv[0] in either
location. Removed the calls to strdup.

CID 741357 Resource leak (RESOURCE_LEAK)

cr_argv was created by opal_argv_split (tmp_argv[0], ' '). Why should
we call opal_argv_join (' ') on this array. Leak fixed by printing out
tmp_argv[0] instead of calling opal_argv_join.

CID 741358 Resource leak (RESOURCE_LEAK)

The code does not handle exec failure correctly. The error should be
communicated to the parent process but the function in question is
only called by the parent. This calls into question some of the
structure of the function in general (like what is the point of
returning the child process id). That said, I will go ahead and add
the opal_argv_free to quiet this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 16:00:51 -06:00
Nathan Hjelm
7e34997746 event/libevent2022: fix coverity issue
CID 1269841 Out-of-bounds access (OVERRUN)

Correct issue. If the string being concatingated fills the remaining
buffer then a \0 is written past the end of the string. In practice
this should never happen but it should be fixed. I re-organized the
code a bit to clear this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 15:38:54 -06:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
rhc54
daa55fd582 Merge pull request #613 from rhc54/topic/listener
Centralize listener connection support
2015-05-29 15:55:19 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
b1c100c402 win_get_info: fix indentation
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 10:16:25 -06:00
Nathan Hjelm
c87ef46599 Merge pull request #612 from hjelmn/opal_coverity
opal coverity fixes
2015-05-29 10:02:17 -06:00
Nathan Hjelm
7b7993e406 pmix/base: fix coverity issue
CID 1269707 Logically dead code (DEADCODE)

Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 09:02:56 -06:00
Nathan Hjelm
1d27b1f944 pmix/native: fix coverity issue
CID 1269730 Dereference after null check (FORWARD_NULL)

The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:48:15 -06:00
Nathan Hjelm
5e2bc2c662 btl/openib: fix coverity issue
CID 1269821 Dereference null return value (NULL_RETURNS)

This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:44:03 -06:00
Nathan Hjelm
65472a383f mca/base: add yes/no as valid values for boolean variables
This commit expands the set of accepted values for boolean values to
include yes/no as synonyms for 1/0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:41:51 -06:00
Nathan Hjelm
61fe2cc629 win: add support for returning non_locks info key
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:35:01 -06:00
Jeff Squyres
e8a8a6d223 README: remove a weird parenthetical 2015-05-28 18:26:43 -04:00
Nathan Hjelm
0d763ea0bc Merge pull request #611 from hjelmn/opal_coverity
btl/openib: more coverity fixes
2015-05-28 13:01:37 -06:00
Nathan Hjelm
b038eb6434 btl/openib: more coverity fixes
CID 1301390 Dereference before null check (REVERSE_INULL)

endpoint can not be NULL here. Remove NULL check.

CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)
CID 1301388 Bad bit shift operation (BAD_SHIFT)

Add ull to integer constants to ensure the math is done in 64-bits not
32.

CID 715749 Explicit null dereferenced (FORWARD_NULL)

As far as I can tell this parser function does not accept a line that
does match key = value. If that is the case then value should never be
NULL. If it is it is a parse error. Updated the code to reflect
this. Also modified the intify function to do something more sane
(strtol vs atoi with hex detection).

CID 1269820 Dereference null return value (NULL_RETURNS)

This is a false positive as strchr will never return NULL here. It
makes sense, though, to quiet the warning by changing the do {} while
() loop to a while () loop.

CID 1269780 Dereference after null check (FORWARD_NULL)

Just return an error if the endpoint's cpc data is NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 11:58:17 -06:00
Jeff Squyres
85f0fff189 README: update for the new version numbering scheme 2015-05-28 10:44:58 -07:00
rhc54
83eec67952 Merge pull request #603 from hjelmn/orte_coverity
orte_quit: Remove logically dead code
2015-05-28 08:35:02 -07:00
Nathan Hjelm
9c170a8c00 Merge pull request #608 from hjelmn/opal_coverity
Opal coverity fixes
2015-05-28 09:06:31 -06:00
Nathan Hjelm
ceb319170a btl/openib: fix more coverity issues
CID 1269674 Ignoring number of bytes read (CHECKED_RETURN)

Check that we read enough bytes to get a complete async command.

CID 1269793 Missing break in switch (MISSING_BREAK)

Added comment to indicate fall through was intentional.

CID 1269702: Constant variable guards dead code (DEADCODE)

Remove an unused argument to opal_show_help. This will quiet the
coverity issue.

CID 1269675 Ignoring number of bytes read (CHECKED_RETURN)

Check that at least sizeof(int) bytes are read. If this is not the
case then it is an error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
9353fcea95 crs/base: fix coverity issues
CID 1196720 Resource leak (RESOURCE_LEAK)
CID 1196721 Resource leak (RESOURCE_LEAK)

The code in question does leak loc_token and loc_value. Cleaned up the
code a bit and plugged the leak.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
0e3c32a98a opal/sys_limits: fix coverity issue
CID 996175 Dereference before null check (REVERSE_NULL)

If lims is NULL then we ran out of memory. Return an error and remove
the NULL check at cleanup.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
3edb421adc common/verbs: fix coverity issues
CID 1269864 Resource leak (RESOURCE_LEAK)
CID 1269865 Resource leak (RESOURCE_LEAK)

Slightly refactored the code to remove extra goto statements and
ensure the if_include_list and if_exclude_list are actually released
on success.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
43d678e7ca btl/openib: fix more coverity issues
CID 1269931 Uninitialized scalar variable (UNINIT)

Initialize complete async message. This was not a bug but the fix
contributes to valgrind cleanness (uninitialed write).

CID 1269915 Unintended sign extension (SIGN_EXTENSION)

Should never happen. Quieting this by explicitly casting to uint64_t.

CID 1269824 Dereference null return value (NULL_RETURNS)

It is impossible for opal_list_remove_first to return NULL if
opal_list_is_empty returns false. I refactored the code in question to
not use opal_list_is_empty but loop until NULL is returned by
opal_list_remove_first. That will quiet the issue.

CID 1269913 Dereference before null check (REVERSE_INULL)

The storage parameter should never be NULL. The check intended to
check if *storage was NULL not storage.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
32d4d7b6ea opal/dss: silence coverity issues
CID 1269988 Use after free (USE_AFTER_FREE)
CID 1269987 Use after free (USE_AFTER_FREE)

Both are false positives as convert is always overwritten by the call
to opal_dss_unpack_string(). Set convert to prevent this issue from
re-appearing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00
Nathan Hjelm
6b86e74218 btl/openib: fix coverity issues
CID 1269933 Uninitialized scalar variable (UNINIT)

This CID isn't really an error but it is best for both valgrind and
coverity cleanness to not write uninitialized data. Added an
initializer for async_command in btl_openib_component_close.

CID 1269930 Uninitialized scalar variable (UNINIT)

Same as above. Best not to write uninitialized data. Added an
initializer for async_command.

CID 1269701 Logically dead code (DEADCODE)

Coverity is correct. The smallest_pp_qp will always be 0. Changed the
initial value so that the smallest_pp_qp is set as intended. If no
per-per queue pair exists then use the last shared queue pair. This
queue pair should have the smallest message size. This will reduce
buffer waste.

CID 1269713 Logically dead code (DEADCODE)

False positive but easy to silence. The two check are meaningless if
HAVE_XRC is 0 so protect them with #if HAVE_XRC.

CID 1269726 Division or modulo by zero (DIVIDE_BY_ZERO)

Indeed an issue. If we get an invalid value for rd_win then this will
cause a divide-by-zero exception. Added a check to ensure rd_win is >
0. Also updated the help message to reflect this requirement.

CID 1269672 Ignoring number of bytes read (CHECKED_RETURN)

This error was somewhat intentional. Linux parameter files are
probably not empty but it is safer to check the return code of read to
make sure we got something. If 0 bytes are read this code could SEGV
whe running strtoull.

CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)

Add a range check to read_module_param to ensure we do not
overflow. In the future it might be worthwhile to report an error
because these parameters should never cause overflow in this
calculation.

CID 1269692 Calling risky function (DC.WEAK_CRYPTO)

??? This call was added in 2006 but I see no calls to the rest of the
rand48 family of functions. Anyway, we SHOULD NEVER be calling seed48,
srand, etc because it messes with user code. Removed the call to
seed48.

CID 1269823 Dereference null return value (NULL_RETURNS)

This is likely a false positive. The endpoint lock is being held so no
other thread should be able to remove fragments from the list. Also,
mca_btl_openib_endpoint_post_send should not be removing items from
the list. If a NULL fragment is ever returned it will likely be a
coding error on the part of an Open MPI developer. Added an assert()
to catch this and quiet the coverity error.

CID 1269671 Unchecked return value (CHECKED_RETURN)

Added a check for the return code of mca_btl_openib_endpoint_post_send
to quiet the coverity error. It is unlikely this error path will be
traversed.

CID 1270229 Missing break in switch (MISSING_BREAK)

Add a comment to indicate that the fall-through is intentional.

CID 1269735 Dereference after null check (FORWARD_NULL)

There should always be an endpoint when handling a work
completion. The endpoint is either stored on the fragment or can be
looked up using the immediate data. Move the immediate data code up
and add an assert for a NULL endpoint.

CID 1269740 Dereference after null check (FORWARD_NULL)
CID 1269741 Explicit null dereferenced (FORWARD_NULL)

Similar to CID 1269735 fix.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00