1
1

22840 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
d9f23627fd Add in hwloc 1.11.0rc1 - will overwrite with final version 2015-06-04 15:35:56 -07:00
Rolf vandeVaart
8622b34664 Check for GPU Direct RDMA and leave pinned turned off 2015-06-04 14:25:24 -04:00
Jeff Squyres
347290f785 pml/Makefile.am: add missing file to $(headers) 2015-06-02 20:07:54 -07:00
Gilles Gouaillardet
bf38f82dc2 MPI_Win_{get,set}_info: add missing files
fixes commit open-mpi/ompi@558d34a5c3
2015-06-03 09:04:04 +09:00
Gilles Gouaillardet
7179d442c0 MPI_Win_{attach,detach}: add missing files
fixes commit open-mpi/ompi@9600e2bc63
2015-06-03 09:02:50 +09:00
Gilles Gouaillardet
1d8ce96305 MPI_Win_Create_dynamic: add missing files
fixes commit open-mpi/ompi@f45244d5a5
2015-06-03 09:00:04 +09:00
Jeff Squyres
a55eb5e2c6 Merge pull request #602 from jithinjosepkl/pr/pml-cm-opt
Optimizations to PML-CM
2015-06-02 13:47:10 -05:00
Howard Pritchard
8bb00824b6 Merge pull request #619 from hppritcha/topic/fix_busted_cray_build
odls/alps: fix busted build for cray.
2015-06-02 09:49:01 -06:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Gilles Gouaillardet
558d34a5c3 MPI_Win_{get,set}_info : add Fortran bindings 2015-06-02 17:45:52 +09:00
Gilles Gouaillardet
9600e2bc63 MPI_Win_{attach,detach} : add Fortran bindings 2015-06-02 17:45:44 +09:00
Gilles Gouaillardet
f45244d5a5 MPI_Win_create_dynamic : add Fortran bindings 2015-06-02 17:45:32 +09:00
Nathan Hjelm
16abe2e4f3 Merge pull request #615 from hjelmn/opal_coverity
event/libevent2022: fix coverity issue
2015-06-01 19:33:36 -06:00
Nathan Hjelm
f72b6d45c7 crs/none: fix coverity issues
CID 1301389 Resource leak (RESOURCE_LEAK)

There is no conceivable reason to strdup cr_argv[0] in either
location. Removed the calls to strdup.

CID 741357 Resource leak (RESOURCE_LEAK)

cr_argv was created by opal_argv_split (tmp_argv[0], ' '). Why should
we call opal_argv_join (' ') on this array. Leak fixed by printing out
tmp_argv[0] instead of calling opal_argv_join.

CID 741358 Resource leak (RESOURCE_LEAK)

The code does not handle exec failure correctly. The error should be
communicated to the parent process but the function in question is
only called by the parent. This calls into question some of the
structure of the function in general (like what is the point of
returning the child process id). That said, I will go ahead and add
the opal_argv_free to quiet this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 16:00:51 -06:00
Nathan Hjelm
7e34997746 event/libevent2022: fix coverity issue
CID 1269841 Out-of-bounds access (OVERRUN)

Correct issue. If the string being concatingated fills the remaining
buffer then a \0 is written past the end of the string. In practice
this should never happen but it should be fixed. I re-organized the
code a bit to clear this error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-06-01 15:38:54 -06:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
rhc54
daa55fd582 Merge pull request #613 from rhc54/topic/listener
Centralize listener connection support
2015-05-29 15:55:19 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
b1c100c402 win_get_info: fix indentation
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 10:16:25 -06:00
Nathan Hjelm
c87ef46599 Merge pull request #612 from hjelmn/opal_coverity
opal coverity fixes
2015-05-29 10:02:17 -06:00
Nathan Hjelm
7b7993e406 pmix/base: fix coverity issue
CID 1269707 Logically dead code (DEADCODE)

Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 09:02:56 -06:00
Nathan Hjelm
1d27b1f944 pmix/native: fix coverity issue
CID 1269730 Dereference after null check (FORWARD_NULL)

The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:48:15 -06:00
Nathan Hjelm
5e2bc2c662 btl/openib: fix coverity issue
CID 1269821 Dereference null return value (NULL_RETURNS)

This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:44:03 -06:00
Nathan Hjelm
65472a383f mca/base: add yes/no as valid values for boolean variables
This commit expands the set of accepted values for boolean values to
include yes/no as synonyms for 1/0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:41:51 -06:00
Nathan Hjelm
61fe2cc629 win: add support for returning non_locks info key
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:35:01 -06:00
Jeff Squyres
e8a8a6d223 README: remove a weird parenthetical 2015-05-28 18:26:43 -04:00
Nathan Hjelm
0d763ea0bc Merge pull request #611 from hjelmn/opal_coverity
btl/openib: more coverity fixes
2015-05-28 13:01:37 -06:00
Nathan Hjelm
b038eb6434 btl/openib: more coverity fixes
CID 1301390 Dereference before null check (REVERSE_INULL)

endpoint can not be NULL here. Remove NULL check.

CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)
CID 1301388 Bad bit shift operation (BAD_SHIFT)

Add ull to integer constants to ensure the math is done in 64-bits not
32.

CID 715749 Explicit null dereferenced (FORWARD_NULL)

As far as I can tell this parser function does not accept a line that
does match key = value. If that is the case then value should never be
NULL. If it is it is a parse error. Updated the code to reflect
this. Also modified the intify function to do something more sane
(strtol vs atoi with hex detection).

CID 1269820 Dereference null return value (NULL_RETURNS)

This is a false positive as strchr will never return NULL here. It
makes sense, though, to quiet the warning by changing the do {} while
() loop to a while () loop.

CID 1269780 Dereference after null check (FORWARD_NULL)

Just return an error if the endpoint's cpc data is NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 11:58:17 -06:00
Jeff Squyres
85f0fff189 README: update for the new version numbering scheme 2015-05-28 10:44:58 -07:00
rhc54
83eec67952 Merge pull request #603 from hjelmn/orte_coverity
orte_quit: Remove logically dead code
2015-05-28 08:35:02 -07:00
Nathan Hjelm
9c170a8c00 Merge pull request #608 from hjelmn/opal_coverity
Opal coverity fixes
2015-05-28 09:06:31 -06:00
Nathan Hjelm
ceb319170a btl/openib: fix more coverity issues
CID 1269674 Ignoring number of bytes read (CHECKED_RETURN)

Check that we read enough bytes to get a complete async command.

CID 1269793 Missing break in switch (MISSING_BREAK)

Added comment to indicate fall through was intentional.

CID 1269702: Constant variable guards dead code (DEADCODE)

Remove an unused argument to opal_show_help. This will quiet the
coverity issue.

CID 1269675 Ignoring number of bytes read (CHECKED_RETURN)

Check that at least sizeof(int) bytes are read. If this is not the
case then it is an error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
9353fcea95 crs/base: fix coverity issues
CID 1196720 Resource leak (RESOURCE_LEAK)
CID 1196721 Resource leak (RESOURCE_LEAK)

The code in question does leak loc_token and loc_value. Cleaned up the
code a bit and plugged the leak.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
0e3c32a98a opal/sys_limits: fix coverity issue
CID 996175 Dereference before null check (REVERSE_NULL)

If lims is NULL then we ran out of memory. Return an error and remove
the NULL check at cleanup.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
3edb421adc common/verbs: fix coverity issues
CID 1269864 Resource leak (RESOURCE_LEAK)
CID 1269865 Resource leak (RESOURCE_LEAK)

Slightly refactored the code to remove extra goto statements and
ensure the if_include_list and if_exclude_list are actually released
on success.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
43d678e7ca btl/openib: fix more coverity issues
CID 1269931 Uninitialized scalar variable (UNINIT)

Initialize complete async message. This was not a bug but the fix
contributes to valgrind cleanness (uninitialed write).

CID 1269915 Unintended sign extension (SIGN_EXTENSION)

Should never happen. Quieting this by explicitly casting to uint64_t.

CID 1269824 Dereference null return value (NULL_RETURNS)

It is impossible for opal_list_remove_first to return NULL if
opal_list_is_empty returns false. I refactored the code in question to
not use opal_list_is_empty but loop until NULL is returned by
opal_list_remove_first. That will quiet the issue.

CID 1269913 Dereference before null check (REVERSE_INULL)

The storage parameter should never be NULL. The check intended to
check if *storage was NULL not storage.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:10 -06:00
Nathan Hjelm
32d4d7b6ea opal/dss: silence coverity issues
CID 1269988 Use after free (USE_AFTER_FREE)
CID 1269987 Use after free (USE_AFTER_FREE)

Both are false positives as convert is always overwritten by the call
to opal_dss_unpack_string(). Set convert to prevent this issue from
re-appearing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00
Nathan Hjelm
6b86e74218 btl/openib: fix coverity issues
CID 1269933 Uninitialized scalar variable (UNINIT)

This CID isn't really an error but it is best for both valgrind and
coverity cleanness to not write uninitialized data. Added an
initializer for async_command in btl_openib_component_close.

CID 1269930 Uninitialized scalar variable (UNINIT)

Same as above. Best not to write uninitialized data. Added an
initializer for async_command.

CID 1269701 Logically dead code (DEADCODE)

Coverity is correct. The smallest_pp_qp will always be 0. Changed the
initial value so that the smallest_pp_qp is set as intended. If no
per-per queue pair exists then use the last shared queue pair. This
queue pair should have the smallest message size. This will reduce
buffer waste.

CID 1269713 Logically dead code (DEADCODE)

False positive but easy to silence. The two check are meaningless if
HAVE_XRC is 0 so protect them with #if HAVE_XRC.

CID 1269726 Division or modulo by zero (DIVIDE_BY_ZERO)

Indeed an issue. If we get an invalid value for rd_win then this will
cause a divide-by-zero exception. Added a check to ensure rd_win is >
0. Also updated the help message to reflect this requirement.

CID 1269672 Ignoring number of bytes read (CHECKED_RETURN)

This error was somewhat intentional. Linux parameter files are
probably not empty but it is safer to check the return code of read to
make sure we got something. If 0 bytes are read this code could SEGV
whe running strtoull.

CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)

Add a range check to read_module_param to ensure we do not
overflow. In the future it might be worthwhile to report an error
because these parameters should never cause overflow in this
calculation.

CID 1269692 Calling risky function (DC.WEAK_CRYPTO)

??? This call was added in 2006 but I see no calls to the rest of the
rand48 family of functions. Anyway, we SHOULD NEVER be calling seed48,
srand, etc because it messes with user code. Removed the call to
seed48.

CID 1269823 Dereference null return value (NULL_RETURNS)

This is likely a false positive. The endpoint lock is being held so no
other thread should be able to remove fragments from the list. Also,
mca_btl_openib_endpoint_post_send should not be removing items from
the list. If a NULL fragment is ever returned it will likely be a
coding error on the part of an Open MPI developer. Added an assert()
to catch this and quiet the coverity error.

CID 1269671 Unchecked return value (CHECKED_RETURN)

Added a check for the return code of mca_btl_openib_endpoint_post_send
to quiet the coverity error. It is unlikely this error path will be
traversed.

CID 1270229 Missing break in switch (MISSING_BREAK)

Add a comment to indicate that the fall-through is intentional.

CID 1269735 Dereference after null check (FORWARD_NULL)

There should always be an endpoint when handling a work
completion. The endpoint is either stored on the fragment or can be
looked up using the immediate data. Move the immediate data code up
and add an assert for a NULL endpoint.

CID 1269740 Dereference after null check (FORWARD_NULL)
CID 1269741 Explicit null dereferenced (FORWARD_NULL)

Similar to CID 1269735 fix.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00
Nathan Hjelm
13e0a9da3a sec/base: fix coverity issues
CID 1292483 Uninitialized pointer read (UNINIT)

Initialize the method and credential members of the opal_sec_cred_t to
avoid possible invalid read when calling cleanup_cred.

CID 1292484 Double free (USE_AFTER_FREE)

Set method and credential members to NULL after freeing in
cleanup_cred.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00
Nathan Hjelm
f5389cbb03 opal/keyval: fix coverity issues
CID 1292738 Dereference after null check (FORWARD_NULL)

It is an error if NULL is passed for val in add_to_env_str. Removed
the NULL-check @ keyval_parse.c:253 and added a NULL check and an
error return.

CID 1292737 Logically dead code (DEADCODE)

Coverity is correct, the error code at the end of parse_line_new is
never reached. This means we fail to report parsing errors when
parsing -x and -mca lines in keyval files. I moved the error code into
the loop and removed the checks @ keyval_parse.c:314.

I also named the parse state enum type and updated parse_line_new to
use this type.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-28 08:38:09 -06:00
Gilles Gouaillardet
b896ce7c4a opal/ddt: fix opal_dt_swap_bytes
fix a bug from commit open-mpi/ompi@ef74566734
2015-05-28 16:00:03 +09:00
Edgar Gabriel
aa72e5b2ca fix the selection logic to not overwrite on the new aggregator side the list of
aggregators determined by the algorithm.
2015-05-27 22:35:45 -05:00
Jithin Jose
5ba5a9ade2 Offset buffer by datatype true_lb to handle resized datatypes.
- Follow up patch for 56869bff38e264ee91ea68ae2fabfafe9456548e

Signed-off-by: Jithin Jose <jithin.jose@intel.com>
2015-05-27 13:51:05 -07:00
Jeff Squyres
7b9e349498 iopenmpi-nightly-tarball.sh: make v1.10 tarballs 2015-05-27 11:45:40 -07:00
Jithin Jose
c745854d9b Avoid opal_convertor_pack for contigous data types in MXM mtl
Signed-off-by: Jithin Jose <jithin.jose@intel.com>
2015-05-27 11:09:25 -07:00
Mike Dubman
ae0d577513 Merge pull request #606 from miked-mellanox/topic/fix_confgure_logic
mxm: fix configure logic
2015-05-27 10:44:00 +03:00
Mike Dubman
be5601b5d0 build: fix mxm configure
configure w/ CPPFLAGS/LDFLAGS can add empty -L statement from mxm
Thanks to David Shrader <dshrader@lanl.gov> for patch
2015-05-27 09:59:42 +03:00
George Bosilca
ef74566734 Optimize the heterogeneous case when there are multiple
identical types.
2015-05-27 01:07:29 -04:00
George Bosilca
79d5e2a92b Use C99 array initialization. Add a comment about
the limitations of the functions generated with
COPY_TYPE_HETEROGENEOUS.
2015-05-27 00:44:17 -04:00