* User defined ops leave the op_type unset which can confuse logic
in a collective component that is trying to convert the op to the
approprate local function.
I am not sure if we should continue to maintain the request support
for FT_CR, but I tried here to simplify the code while maintaining
the same meaning.
Rewrite the ompi_request_complete function to take in account the
with_signal argument. Change the comment to explain the expected
behavior.
Alter all the ompi_request_complete uses to make sure the status of the
request is set before calling ompi_request_complete.
bot🏷️enhancement
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes two bugs in MPI_Wait_any:
- If all requests are inactive then the sync wait would hang forever
because no requests are attached to the sync.
- The request pointer was pointing to the request before the completed
request which caused the wrong request to be freed or marked inactive.
MPI_Wait_some had a similar issue if all the requests were pending.
These issues were identified by MTT.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Regarding BFO it should be mentionned that this component is currently
unmaintained, and that despite my efforts I could not make it compile
(it would not compile before this patch either).
This fixes a hang caused by the request refactor work. The cm pml was
not updated and was hanging is most cases.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The request.h header is unfortunately included files in the C++
bindings. C++ does not allow assigning from void * to another
pointer without a cast. This commit adds the cast. We can clean this
up when the C++ bindings are deleted.
Fixesopen-mpi/ompi#1707
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).
* Fix the condition to release the request.
This commit adds support for the MPI-3.1 accumulate_ordering info
key. The default value is rar,war,raw,waw and is supported using an
MCA variable flag enumerator.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add checks to bail out if our precomputed value is less
than needed (we are already at fault).
bot:milestone:v1.10.3
bot:milestone:v2.0
bot🏷️bug
bot:assign: @ggouaillardet
This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.
This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().
Fixesopen-mpi/ompi#1676
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
As more providers get added to libfabric, the default exclude list would need
to be updated.
Instead, we choose to include only the providers known to work by default.
New default:
- include: psm,psm2,gni
- exclude: none
lines up to the next ".fi" -- which for functions that do not implement the corresponding interface
as code will have all eliminated.
Change to delete the man page's content up to the next section header ".SH"
Also in case of make V=1, we'd like to see the command line, too.
Amend OMPI_Affinity_str according to the other man-pages definitions.
There were some old/stale function names in some debugging/verbose
opal_output calls. Use __func__ instead, so that they won't become
stale in the future.
Thanks to Durga Choudhury for pointing out the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Update external as well
Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
Intel TrueScale and Intel OmniPath, and detect a link in ACTIVE state.
This fix addresses the scenario reported in the below OMPI users email,
including formerly named Qlogic IB, now Intel True scale. Given the
nature of the PSM/PSM2 mtls this fix applies to OmniPath:
https://www.open-mpi.org/community/lists/users/2016/04/29018.php
MPIR-1.0 specifies that the following symbols are only relevant in the
starter process:
- MPIR_Breakpoint
- MPIR_being_debugged
- MPIR_debug_state
- MPIR_debug_abort_string
I.e., the code filling in values in these various symbols was useless
/ never used.
MPIR-1.1 will define that MPIR_being_debugged *is* relevant in MPI
processes. That symbol is currently defined in libopen-rte (which is
currently causing a duplicate symbol error for static builds -- this
commit fixes that error), and is therefore still available for MPI
processes.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
MPI-3.1 says that even if no info keys are set on the file, we need to
return a new, empty info.
Thanks to Lisandro Dalcin for identifying the issue.
Fixesopen-mpi/ompi#1630
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
tear down
HCOLL barrier may not complete if HCOLL progress is not called periodically.
which is the case in HCOLL teardown progress in the finalize.
(cherry picked from commit 793244d75dd94d1d5e0243bcccf6d04318750f3f)
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
When building an empty datatype (aka. size = 0) because the count of
included datatypes is 0, be less strict on what the arguments are
(allow NULL pointers).
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.
Fixesopen-mpi/ompi#1599
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug when sparse groups are in use. Since sparse
group do not actually increment the reference counts of any procs
(they just retain the parent group) it is wrong to decrement the
reference counts of all procs in the group using
ompi_group_decrement_proc_count(). This commit makes the call to
ompi_group_decrement_proc_count() conditional on the group being
dense.
Fixesopen-mpi/ompi#1593
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If during the request completion callback we post another request that
completes right away (such a small send or a match for an unexpected
short message) we will try to complete the second request while holding
the lock for the completion of the first. For performance reasons
(mainly to avoid unlocking and locking the request mutex several times)
we have made the request lock recursive.
There is a potential race condition in MPI_Init() where an orte even
thread could be in a function that uses OPAL_THREAD_LOCK /
OPAL_THREAD_UNLOCK when ompi_mpi_init calls opal_set_using_threads().
Closesopen-mpi/ompi#1586
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This is a follow-on to open-mpi/ompi@7373111: add some comments
explaining why the code is the way it is. Also update a previous
comment.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
MPI_LONG_LONG_INT is a named predefined datatype, so its name is now MPI_LONG_LONG_INT
MPI_LONG_LONG is a synonym of MPI_LONG_LONG_INT, and its name is also MPI_LONG_LONG_INT
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)
From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.
Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)
Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 72296: Resource leak (RESOURCE_LEAK):
Changed code to goto exit instead of returning to ensure memory is
freed.
Fix CID 712589: Out-of-bounds read (OVERRUN):
In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.
Fix CID 741363: Uninitialized scalar variable (UNINIT):
Allocate tmp_len with calloc to insure every index is initialized.
Fix CID 741364: Uninitialized pointer read (UNINIT):
Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.
Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.
Fix CID 712588: Out-of-bounds read (OVERRUN):
Similar to CID 712589. Should silence the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit makes it possible to set relative priorities for
components. Before the addition of the patched component there was
only one component that would run on any system but that is no longer
the case. When determining which component to open each component's
query function is called and the one that returns the highest priority
is opened. The default priority of the patcher component is set
slightly higher than the old ptmalloc2/ummunotify component.
This commit fixes a long-standing break in the abstration of the
memory components. ompi_mpi_init.c was referencing the linux malloc
hook initilize function to ensure the hooks are initialized for
libmpi.so. The abstraction break has been fixed by adding a memory
base function that calls the open memory component's malloc hook init
function if it has one. The code is not yet complete but is intended
to support ptmalloc in 2.0.0. In that case the base function will
always call the ptmalloc hook init if exists.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* datatype: Fix a incorrect datatype name of `MPI_UNSIGNED`
Name of predefined datatype for C `unsigned int` gotten by
`MPI_TYPE_GET_NAME` should be `MPI_UNSIGNED`, not `MPI_UNSIGNED_INT`.
* datatype: Fix incorrect datatype names of `MPI_C_BOOL` and `MPI_CXX_*`
Names of predefined datatypes gotten by `MPI_TYPE_GET_NAME` are:
after this commit (correct) | before this commit (incorrect)
-----------------------------------------------------------
MPI_C_BOOL MPI_BOOL
MPI_CXX_BOOL MPI_BOOL
MPI_CXX_FLOAT_COMPLEX MPI_C_FLOAT_COMPLEX
MPI_CXX_DOUBLE_COMPLEX MPI_C_DOUBLE_COMPLEX
MPI_CXX_LONG_DOUBLE_COMPLEX MPI_C_LONG_DOUBLE_COMPLEX
* datatype: Fix a incorrect datatype name of `MPI_2DOUBLE_PRECISION`
Name of the predefined datatype for Fortran two `double precision`
gotten by `MPI_TYPE_GET_NAME` should be `MPI_2DOUBLE_PRECISION`,
not `MPI_2DBLPREC`.
This bug was caused by setting the name to `opal_datatype_t::name`
instead of `ompi_datatype_t::name`.
* datatype: Fix `MPI_UNSIGNED_CHAR` internal flag
`MPI_UNSIGNED_CHAR` is an integer type.
* ompi/cxx: Fix C++ `MPI::LONG_DOUBLE_INT` definition
Just a typo fix. Without this fix, `MPI::MAX_LOC` and `MPI::MIN_LOC`
cannot be used with `MPI::LONG_DOUBLE_INT` in C++ programs.
I know the C++ binding is obsolete, but fixing this is harmless.
* Add FUJITSU copyright
This commit adds the following symbols
MPI_Alloc_mem_cptr_f
MPI_Alloc_mem_cptr_f08
PMPI_Alloc_mem_cptr_f
PMPI_Alloc_mem_cptr_f08
These are implemented in the same way as other `_cptr` routines.
Before this commit, a same PML tag may be used for distinct
communications for long messages. For example, consider a condition
where rank A calls ```MPI_PUT``` targeting rank B and rank B calls
```MPI_GET``` targeting rank A simultaneously.
A PML tag for the ```MPI_PUT``` is acquired on rank A and is used
for the long-message communication from rank A to rank B.
A PML tag for the ```MPI_GET``` is acquired on rank B and is used
for the long-message communication from rank A to rank B.
These two tags may become a same value because they are managed
independently on each rank. This will cause a data corruption.
This commit separates the tag used in a single RMA communication
call, one for communication from an origin to a target, and one
for communication from a target to an origin. A "base" tag
is acquired using ```get_tag``` function and PML tag is caluculated
from the base tag by ```tag_to_target``` and ```tag_to_origin```
function.
`sendcounts`, `sdispls`, and `sendtype(s)` must be ignored
if `MPI_IN_PLACE` is specified for `sendbuf`.
This commit makes the param check code same as the blocking
`ALLTOALL{V|W}` function.
This commit ensures the bml is always enabled whether or not it will
be used. This ensures that any available btls communicate their modex
so that they can be used for one-sided communication.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
This commit adds code to detect when procs are unreachable when using
the dynamic add_procs functionality.
Fixes#1501
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Or at least that was the origin of the issue. It turns out
we were freeing the wrong buffer (but as it only happen in the
case of an error we never noticed).
This patch addresses most (if not all) @derbeyn concerns
expressed on #1015. I added checks for the requests allocation
in all functions, ompi_coll_base_free_reqs is called with the
right number of requests, I removed the unnecessary basic_module_comm_t
and use the base_module_comm_t instead, I remove all uses of the
COLL_BASE_BCAST_USE_BLOCKING define, and other minor fixes.
This commit adds a new type of enumerator meant to support flag
values. The enumerator parses comma-delimited strings and matches
each string or value to a list of valid flags. Additionally, the
enumerator does some basic checks to see if 1) a flag is valid in the
enumerator, and 2) if any conflicting flags are specified.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1315298: Resource leak (RESOURCE_LEAK) :
Fix CID 1315300: Resource leak (RESOURCE_LEAK):
Fix CID 1315299: Resource leak (RESOURCE_LEAK):
Fix CID 1315297 (#1 of 1): Resource leak (RESOURCE_LEAK):
Confirmed leaks in error paths. Added the leaked arrays to the
ERR_EXIT macro to ensure they are freed.
Fix CID 1315296 (#1 of 1): Resource leak (RESOURCE_LEAK):
Confirmed leak in error paths. Both the oversub and reqs arrays are
leaked. Free these arrays on error.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1269976 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269979 (#1 of 1): Unused value (UNUSED_VALUE):
Removed unused variables k_temp1 and k_temp2.
Fix CID 1269981 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269974 (#1 of 1): Unused value (UNUSED_VALUE):
Removed gotos and use the matched flags to decide whether to return.
Fix CID 715755 (#1 of 1): Dereference null return value (NULL_RETURNS):
This was also a leak. The items on cs->ctl_structures are allocated using OBJ_NEW so they mist be released using OBJ_RELEASE not OBJ_DESTRUCT. Replaced the loop with OPAL_LIST_DESTRUCT().
Fix CID 715776 (#1 of 1): Dereference before null check (REVERSE_INULL):
Rework error path to remove REVERSE_INULL. Also added a free to an error path where it was missing.
Fix CID 1196603 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fix CID 1196601 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Both of these are false positives but it is still worthwhile to fix so they no longer appear. The loop conditional has been updated to use radix_mask_pow instead of radix_mask to quiet these issues.
Fix CID 1269804 (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS):
In general close (-1) is safe but coverity doesn’t like it. Reworked the error path for open to not try to close (-1).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 715744 (#1 of 1): Logically dead code (DEADCODE):
Fix CID 715745 (#1 of 1): Logically dead code (DEADCODE):
The free of scratch_num in either place is defensive programming. Instead of removing the free the conditional around the free has been removed to quiet the warning.
Fix CID 715753 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 715778 (#1 of 1): Dereference before null check (REVERSE_INULL):
Fixed the conditional to check for collective_alg != NULL instead of collective_alg->functions != NULL.
Fix CID 715749 (#1 of 4): Explicit null dereferenced (FORWARD_NULL):
Updated code to ensure that none of the parse functions are reached with a non-NULL value.
Fix CID 715746 (#1 of 1): Logically dead code (DEADCODE):
Removed dead code.
Fix CID 715768 (#1 of 1): Resource leak (RESOURCE_LEAK):
Fix CID 715769 (#2 of 2): Resource leak (RESOURCE_LEAK):
Fix CID 715772 (#1 of 1): Resource leak (RESOURCE_LEAK):
Move free calls to before error checks to cleanup leak in error paths.
Fix CID 741334 (#1 of 1): Explicit null dereferenced (FORWARD_NULL):
Added a check to ensure temp is not dereferenced if it is NULL.
Fix CID 1196605 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fixed overflow in calculation by replacing int mask with 1ul.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1325868 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 1325869 (#1-2 of 2): Dereference after null check (FORWARD_NULL):
Here reqs can indeed be NULL. Added a check to
ompi_coll_base_free_reqs to prevent dereferencing NULL pointer.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1324726 (#1 of 1): Free of address-of expression (BAD_FREE):
Indeed, if a lock conflicts with the lock_all we will end up trying to
free an invalid pointer.
Fix CID 1328826 (#1 of 1): Dereference after null check (FORWARD_NULL):
This was intentional but it would be a good idea to check for
module->comm being non_NULL to be safe. Also cleaned out some checks
for NULL before free().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>