Instead of unconditionally setting the memory hook, only set it when
the memory hooks are both available and have been enabled (e.g.,
opal/mca/memory/linux has decided that it *can* be enabled, and when
the mpi_leave_pinned MCA param is set to 1, or is set to -1 and some
component requested the memory hooks be enabled).
If we set the memory hook when memory hooks are not enabled,
__malloc_hook will be NULL, which will cause problems when
btl_openib_malloc_hook() tries to invoke it.
Fixesopen-mpi/ompi#638.
Also add another (superflous but symmetric) continue statement.
This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.
Fixes#585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
CID 1269821 Dereference null return value (NULL_RETURNS)
This is another false positive that can be silenced by looping on
opal_list_remove_first instead of using both opal_list_is_empty and
opal_list_remove_first.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1301390 Dereference before null check (REVERSE_INULL)
endpoint can not be NULL here. Remove NULL check.
CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)
CID 1301388 Bad bit shift operation (BAD_SHIFT)
Add ull to integer constants to ensure the math is done in 64-bits not
32.
CID 715749 Explicit null dereferenced (FORWARD_NULL)
As far as I can tell this parser function does not accept a line that
does match key = value. If that is the case then value should never be
NULL. If it is it is a parse error. Updated the code to reflect
this. Also modified the intify function to do something more sane
(strtol vs atoi with hex detection).
CID 1269820 Dereference null return value (NULL_RETURNS)
This is a false positive as strchr will never return NULL here. It
makes sense, though, to quiet the warning by changing the do {} while
() loop to a while () loop.
CID 1269780 Dereference after null check (FORWARD_NULL)
Just return an error if the endpoint's cpc data is NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269674 Ignoring number of bytes read (CHECKED_RETURN)
Check that we read enough bytes to get a complete async command.
CID 1269793 Missing break in switch (MISSING_BREAK)
Added comment to indicate fall through was intentional.
CID 1269702: Constant variable guards dead code (DEADCODE)
Remove an unused argument to opal_show_help. This will quiet the
coverity issue.
CID 1269675 Ignoring number of bytes read (CHECKED_RETURN)
Check that at least sizeof(int) bytes are read. If this is not the
case then it is an error.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269931 Uninitialized scalar variable (UNINIT)
Initialize complete async message. This was not a bug but the fix
contributes to valgrind cleanness (uninitialed write).
CID 1269915 Unintended sign extension (SIGN_EXTENSION)
Should never happen. Quieting this by explicitly casting to uint64_t.
CID 1269824 Dereference null return value (NULL_RETURNS)
It is impossible for opal_list_remove_first to return NULL if
opal_list_is_empty returns false. I refactored the code in question to
not use opal_list_is_empty but loop until NULL is returned by
opal_list_remove_first. That will quiet the issue.
CID 1269913 Dereference before null check (REVERSE_INULL)
The storage parameter should never be NULL. The check intended to
check if *storage was NULL not storage.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269933 Uninitialized scalar variable (UNINIT)
This CID isn't really an error but it is best for both valgrind and
coverity cleanness to not write uninitialized data. Added an
initializer for async_command in btl_openib_component_close.
CID 1269930 Uninitialized scalar variable (UNINIT)
Same as above. Best not to write uninitialized data. Added an
initializer for async_command.
CID 1269701 Logically dead code (DEADCODE)
Coverity is correct. The smallest_pp_qp will always be 0. Changed the
initial value so that the smallest_pp_qp is set as intended. If no
per-per queue pair exists then use the last shared queue pair. This
queue pair should have the smallest message size. This will reduce
buffer waste.
CID 1269713 Logically dead code (DEADCODE)
False positive but easy to silence. The two check are meaningless if
HAVE_XRC is 0 so protect them with #if HAVE_XRC.
CID 1269726 Division or modulo by zero (DIVIDE_BY_ZERO)
Indeed an issue. If we get an invalid value for rd_win then this will
cause a divide-by-zero exception. Added a check to ensure rd_win is >
0. Also updated the help message to reflect this requirement.
CID 1269672 Ignoring number of bytes read (CHECKED_RETURN)
This error was somewhat intentional. Linux parameter files are
probably not empty but it is safer to check the return code of read to
make sure we got something. If 0 bytes are read this code could SEGV
whe running strtoull.
CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)
Add a range check to read_module_param to ensure we do not
overflow. In the future it might be worthwhile to report an error
because these parameters should never cause overflow in this
calculation.
CID 1269692 Calling risky function (DC.WEAK_CRYPTO)
??? This call was added in 2006 but I see no calls to the rest of the
rand48 family of functions. Anyway, we SHOULD NEVER be calling seed48,
srand, etc because it messes with user code. Removed the call to
seed48.
CID 1269823 Dereference null return value (NULL_RETURNS)
This is likely a false positive. The endpoint lock is being held so no
other thread should be able to remove fragments from the list. Also,
mca_btl_openib_endpoint_post_send should not be removing items from
the list. If a NULL fragment is ever returned it will likely be a
coding error on the part of an Open MPI developer. Added an assert()
to catch this and quiet the coverity error.
CID 1269671 Unchecked return value (CHECKED_RETURN)
Added a check for the return code of mca_btl_openib_endpoint_post_send
to quiet the coverity error. It is unlikely this error path will be
traversed.
CID 1270229 Missing break in switch (MISSING_BREAK)
Add a comment to indicate that the fall-through is intentional.
CID 1269735 Dereference after null check (FORWARD_NULL)
There should always be an endpoint when handling a work
completion. The endpoint is either stored on the fragment or can be
looked up using the immediate data. Move the immediate data code up
and add an assert for a NULL endpoint.
CID 1269740 Dereference after null check (FORWARD_NULL)
CID 1269741 Explicit null dereferenced (FORWARD_NULL)
Similar to CID 1269735 fix.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Followup to open-mpi/ompi@65b66ab: if we're not debugging, then #if
out an entire block so that the compiler doesn't warn about variables
that are assigned and not used.
When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint. Without this fix, OMPI will hang endlessly
while retransmitting packets to port 0 on the remote host.
In days past, some implementations of Portals4 could not cover all
of memory with a single Memory Descriptor so multiple large
overlapping Memory Descriptors were created. Because none of the
current implementations have this limitation (and no future
implementations should either), this commit removes the overlapping
Memory Descriptors code.
A few uninitialized common symbols are remaining:
common symbols generated by flex :
* opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
* opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
* opal/util/show_help_lex.l: opal_show_help_yyleng
* opal/util/show_help_lex.l: opal_show_help_yytext
common symbol generated by "external" hwloc library:
* opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map
When a libibverbs driver returns NULL for its context, it's the Open
MPI libibverbs fake driver. Hence, this device is simply not
supported -- ignore it.
Defer initializing the CPCs until we know that we have devices/ports
to use. This both prevents some useless work at startup when there
are no devices/ports to use, and also prevents librdmacm complaining
that there are no verbs-capable RDMA devices available (e.g., if a
Cisco usNIC device is present, but does not present a verbs RDMA
interface).
Defer initializing the CPCs until we know that we have devices/ports
to use. This both prevents some useless work at startup when there
are no devices/ports to use, and also prevents librdmacm complaining
that there are no verbs-capable RDMA devices available (e.g., if a
Cisco usNIC device is present, but does not present a verbs RDMA
interface).
This commit fixes an assert when trying to cleanup a module we failed
to initialize. There is no protection around the OBJ_DESTRUCT calls so
they will always be called so similarly we should always call
OBJ_CONSTRUCT at init.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Portals4 spec requires that PtlSetMap() be called before any
Portals4 call other than PtlNIInit(), PtlGetMap() and
PtlGetPhysId(). To satisfy this requirement, this commit delays
interface initialization until the end add_procs() at which time
PtlSetMap() has been called.
The Portals4 BTL is registered with the PML as an RDMA BTL, so
prepare_src() is only used in limited cases. This commit removes
the code path from prepare_src() for unbuffered contiguous buffers
with no PML reserve. This is now handled in register_mem().
In the large message case, the sender issues a PtlMEAppend() in
order to generate events when the receiver issues a PtlGet(). This
commit moves the PtlMEAppend() from mca_btl_portals4_prepare_src()
to mca_btl_portals4_register_mem() which is the way it's done in
BTL 3.0.
This commit fixes a typo in mca_btl_vader_progress_endpoints where
OPAL_THREAD_LOCK was used when OPAL_THREAD_UNLOCK was intended.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several vagrind errors. Included:
- installdirs did not correctly reinitialize all pointers to NULL
at close. This causes valgrind errors on a subsequent call to
opal_init_tool.
- several opal strings were leaked by opal_deregister_params which
was setting them to NULL instead of letting them be freed by the
MCA variable system.
- move opal_net_init to AFTER the variable system is initialized and
opal's MCA variables have been registered. opal_net_init uses a
variable registered by opal_register_params!
- do not leak ompi_mpi_main_thread when it is allocated by
MPI_T_init_thread.
- do not overwrite ompi_mpi_main_thread if it is already set (by
MPI_T_init_thread).
- mca_base_var: read_files was overwritting mca_base_var_file_list
even if it was non-NULL.
- mca_base_var: set all file global variables to initial states on
finalize.
- btl/vader: decrement enumerator reference count to ensure that it
is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On 32-bit architectures loads/stores of fast box headers may take
multiple instructions. This can lead to a data race between the
sender/receiver when reading/writing the sequence number. This can
lead to a situation where the receiver could process incomplete
data. To fix the issue this commit re-orders the fast box header to
put the sequence number and the tag in the same 32-bits to ensure they
are always loaded/stored together.
Fixes#473
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
fi_av_insert() is invoked with a context containing each endpoint
USNIC_NUM_CHANNELS times. If the address on that endpoint fails to
resolve / is unreachable / has some error, we'll therefore get
USNIC_NUM_CHANNELS error completions with that same endpoint. We
therefore only want to warn about the unreachability of (and
OBJ_RELEASE) that endpoint the *first* time.
Fixes CSCut46822.