We've seen this a few times (e.g.,
http://www.open-mpi.org/community/lists/users/2015/06/27057.php
reported via @siegmargross). I'm not entirely sure why it happens --
the best I can come up with is a poorly-synchronized network
filesystem and/or a bug in "make". For example: this code hasn't
changed in forever, and it only happens to users *sometimes*.
Regardless, avoid the error altogether by removing the file before
making the sym link (it should be a sym link anyway -- if there's
something there, it should be safe to remove it before we re-create
the sym link that should be there in the first place).
(cherry picked from commit 0edd265ea045e649c9489e3cb8fdb657800d95c3)
The Portals4 MTL allocates two Portals IDs requesting specific
well-known IDs and assumes that those IDs are allocated. If those IDs
are in use, PtlPTAlloc() will allocate a different ID. This commit
verifies that the requested IDs were allocated.
CID 71734 Self assignment (NO_EFFECT)
This code has no effect. The original author of the offending code
does not remember why the self-assignment is there. Fortran
MPI_Win_get_attr tests are working with or without it so remove the
code.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for MPI_Aint_add and MPI_Aint_diff. These
functions are implemented as macros in C (explicitly allowed by
MPI-3.1). The fortran implementations are a similar mess to the
MPI_Wtime implementations.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Also add another (superflous but symmetric) continue statement.
This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.
Fixes#585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
CID 1269683 Unchecked return value (CHECKED_RETURN)
CID 1269684 Unchecked return value (CHECKED_RETURN)
Use ompi_comm_rank instead of MPI_Comm_rank here. There is no reason to
be using the MPI interface over the internal interface. This should clear
up these issues.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
CID 1047284 Uninitialized scalar variable (UNINIT)
CID 1047285 Uninitialized scalar variable (UNINIT)
CID 1047286 Uninitialized scalar variable (UNINIT)
If a performance variable session has no handles we should be returning MPI_SUCCESS
for MPI_T_pvar_start, MPI_T_pvar_stop, and MPI_T_pvar_reset. The code was returning
an unitialized value. This commit also updates the error code to return the proper
error on failure.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
CID 1295340 Unchecked return value (CHECKED_RETURN)
Check the return code of mca_base_framework_open. If the call fails for some reason
the component array will not be properly defined. This will cause issues in
mca_topo_base_find_available.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
CID 1301389 Resource leak (RESOURCE_LEAK)
There is no conceivable reason to strdup cr_argv[0] in either
location. Removed the calls to strdup.
CID 741357 Resource leak (RESOURCE_LEAK)
cr_argv was created by opal_argv_split (tmp_argv[0], ' '). Why should
we call opal_argv_join (' ') on this array. Leak fixed by printing out
tmp_argv[0] instead of calling opal_argv_join.
CID 741358 Resource leak (RESOURCE_LEAK)
The code does not handle exec failure correctly. The error should be
communicated to the parent process but the function in question is
only called by the parent. This calls into question some of the
structure of the function in general (like what is the point of
returning the child process id). That said, I will go ahead and add
the opal_argv_free to quiet this error.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269841 Out-of-bounds access (OVERRUN)
Correct issue. If the string being concatingated fills the remaining
buffer then a \0 is written past the end of the string. In practice
this should never happen but it should be fixed. I re-organized the
code a bit to clear this error.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
CID 1269707 Logically dead code (DEADCODE)
Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269730 Dereference after null check (FORWARD_NULL)
The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269821 Dereference null return value (NULL_RETURNS)
This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>