the coll_array functions are truly only used by the fcoll modules, so move
them to fcoll/base. There is currently one exception to that rule (number of aggreagtors
logic), but that function will be moved in a long term also to fcoll/base.
This commit fixes a bug in the RDMA compare-and-swap implementation
that caused the origin value to always be written even if the compare
should have failed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
as reported by Coverity with CIDs 1363349-1363362
Offset temporary buffer when a non zero lower bound datatype is used.
Thanks Hristo Iliev for the report
(cherry picked from commit 0e393195d9)
- correctly handle non commutative operators
- correctly handle non zero lower bound ddt
- correctly handle ddt with size > extent
- revamp NBC_Sched_op so it takes two buffers and matches ompi_op_reduce semantic
- various fix for inter communicators
Thanks Yuki Matsumoto for the report
It was introduced in PR https://github.com/open-mpi/ompi/pull/1228
in particular in commit 041a6a9f53.
Original solution was using "flexible array member" called "mxm_base"
to "fall-through" to the "mxm" send/recv member that located in the
outer structure.
After changing number of elements in "mxm_base" from 0 to 1 we actually
allocating 2 mxm_req_base_t elements which leads to increased overal
size and harms cache performance.
It also brakes "mca_pml_yalla_check_request_state" function.
* If hcoll is given a negative priority, but not enabled=0 then
the module is constructed, but then destructed before calling
it's query(). So the previous pointers are not initialized.
If we try to OBJ_RELEASE them in a debug build an assert will fire.
This commit adds some protection against that and initializes
the _module pointers to NULL.
* Print a verbose message if the component was disqualified because of
a negative priority.
* If a disqualified component provided a module, release it.
* Display list of selected components in priority order
- During the process of volunteering collective functions for a
communicator, print the component name and priority. This will
cause the verbose messages to be displayed in reverse priority
order (lowest priority first, up to highest). This is helpful
when determining which collective components are active in which
order for a given communicator.
To see the messages you need the following MCA parameter set to 9
or higher: `-mca coll_base_verbose 9`
* Adjust verbose for commonly needed verbose output from 10 to 9 to
make it easier to access this information.
This commit makes bml/r2 more restrictive on which endpoints end up in the rdma
endpoint list. Before this commit an endpoint was added if it supported either
put or get. This was done to ensure that endpoints are available for RMA.
Thought it is possible to support put or get endpoints we only currently
support endpoints that have put, get, and amos. bml/r2 now reflects this
support.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Need to increment the total size after checking the local offset not
before. This typo causes large allocations with MPI_Win_allocate() to
fail.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
This commit fixes two bugs in pml/ob1:
- Do not called MCA_PML_OB1_PROGRESS_PENDING from
mca_pml_ob1_send_request_start_copy as this may lead to a recursive
call to mca_pml_ob1_send_request_process_pending.
- In mca_pml_ob1_send_request_start_rdma return the rdma frag object
if a btl fragment can not be allocated. This fixes a leak
identified by @abouteiller and @bosilca.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* mpi/start: fix bugs in cm and ob1 start functions
There were several problems with the implementation of start in Open
MPI:
- There are no checks whatsoever on the state of the request(s)
provided to MPI_Start/MPI_Start_all. It is erroneous to provide an
active request to either of these calls. Since we are already
looping over the provided requests there is little overhead in
verifying that the request can be started.
- Both ob1 and cm were always throwing away the request on the
initial call to start and start_all with a particular
request. Subsequent calls would see that the request was
pml_complete and reuse it. This introduced a leak as the initial
request was never freed. Since the only pml request that can
be mpi complete but not pml complete is a buffered send the
code to reallocate the request has been moved. To detect that
a request is indeed mpi complete but not pml complete isend_init
in both cm and ob1 now marks the new request as pml complete.
- If a new request was needed the callbacks on the original request
were not copied over to the new request. This can cause osc/pt2pt
to hang as the incoming message callback is never called.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* osc/pt2pt: add request for gc after starting a new request
Starting a new receive may cause a recursive call into the pt2pt
frag receive function. If this happens and the prior request is
on the garbage collection list it could cause problems. This commit
moves the gc insert until after the new request has been posted.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Rewrite the ompi_request_complete function to take in account the
with_signal argument. Change the comment to explain the expected
behavior.
Alter all the ompi_request_complete uses to make sure the status of the
request is set before calling ompi_request_complete.
bot🏷️enhancement
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Regarding BFO it should be mentionned that this component is currently
unmaintained, and that despite my efforts I could not make it compile
(it would not compile before this patch either).
This fixes a hang caused by the request refactor work. The cm pml was
not updated and was hanging is most cases.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).
* Fix the condition to release the request.
This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.
This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().
Fixesopen-mpi/ompi#1676
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
As more providers get added to libfabric, the default exclude list would need
to be updated.
Instead, we choose to include only the providers known to work by default.
New default:
- include: psm,psm2,gni
- exclude: none
There were some old/stale function names in some debugging/verbose
opal_output calls. Use __func__ instead, so that they won't become
stale in the future.
Thanks to Durga Choudhury for pointing out the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Update external as well
Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
Intel TrueScale and Intel OmniPath, and detect a link in ACTIVE state.
This fix addresses the scenario reported in the below OMPI users email,
including formerly named Qlogic IB, now Intel True scale. Given the
nature of the PSM/PSM2 mtls this fix applies to OmniPath:
https://www.open-mpi.org/community/lists/users/2016/04/29018.php
MPI-3.1 says that even if no info keys are set on the file, we need to
return a new, empty info.
Thanks to Lisandro Dalcin for identifying the issue.
Fixesopen-mpi/ompi#1630
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
tear down
HCOLL barrier may not complete if HCOLL progress is not called periodically.
which is the case in HCOLL teardown progress in the finalize.
(cherry picked from commit 793244d75dd94d1d5e0243bcccf6d04318750f3f)
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.
Fixesopen-mpi/ompi#1599
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If during the request completion callback we post another request that
completes right away (such a small send or a match for an unexpected
short message) we will try to complete the second request while holding
the lock for the completion of the first. For performance reasons
(mainly to avoid unlocking and locking the request mutex several times)
we have made the request lock recursive.
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)
From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.
Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)
Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 72296: Resource leak (RESOURCE_LEAK):
Changed code to goto exit instead of returning to ensure memory is
freed.
Fix CID 712589: Out-of-bounds read (OVERRUN):
In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.
Fix CID 741363: Uninitialized scalar variable (UNINIT):
Allocate tmp_len with calloc to insure every index is initialized.
Fix CID 741364: Uninitialized pointer read (UNINIT):
Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.
Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.
Fix CID 712588: Out-of-bounds read (OVERRUN):
Similar to CID 712589. Should silence the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Before this commit, a same PML tag may be used for distinct
communications for long messages. For example, consider a condition
where rank A calls ```MPI_PUT``` targeting rank B and rank B calls
```MPI_GET``` targeting rank A simultaneously.
A PML tag for the ```MPI_PUT``` is acquired on rank A and is used
for the long-message communication from rank A to rank B.
A PML tag for the ```MPI_GET``` is acquired on rank B and is used
for the long-message communication from rank A to rank B.
These two tags may become a same value because they are managed
independently on each rank. This will cause a data corruption.
This commit separates the tag used in a single RMA communication
call, one for communication from an origin to a target, and one
for communication from a target to an origin. A "base" tag
is acquired using ```get_tag``` function and PML tag is caluculated
from the base tag by ```tag_to_target``` and ```tag_to_origin```
function.
This commit ensures the bml is always enabled whether or not it will
be used. This ensures that any available btls communicate their modex
so that they can be used for one-sided communication.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
This commit adds code to detect when procs are unreachable when using
the dynamic add_procs functionality.
Fixes#1501
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Or at least that was the origin of the issue. It turns out
we were freeing the wrong buffer (but as it only happen in the
case of an error we never noticed).
This patch addresses most (if not all) @derbeyn concerns
expressed on #1015. I added checks for the requests allocation
in all functions, ompi_coll_base_free_reqs is called with the
right number of requests, I removed the unnecessary basic_module_comm_t
and use the base_module_comm_t instead, I remove all uses of the
COLL_BASE_BCAST_USE_BLOCKING define, and other minor fixes.
Fix CID 1315298: Resource leak (RESOURCE_LEAK) :
Fix CID 1315300: Resource leak (RESOURCE_LEAK):
Fix CID 1315299: Resource leak (RESOURCE_LEAK):
Fix CID 1315297 (#1 of 1): Resource leak (RESOURCE_LEAK):
Confirmed leaks in error paths. Added the leaked arrays to the
ERR_EXIT macro to ensure they are freed.
Fix CID 1315296 (#1 of 1): Resource leak (RESOURCE_LEAK):
Confirmed leak in error paths. Both the oversub and reqs arrays are
leaked. Free these arrays on error.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1269976 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269979 (#1 of 1): Unused value (UNUSED_VALUE):
Removed unused variables k_temp1 and k_temp2.
Fix CID 1269981 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269974 (#1 of 1): Unused value (UNUSED_VALUE):
Removed gotos and use the matched flags to decide whether to return.
Fix CID 715755 (#1 of 1): Dereference null return value (NULL_RETURNS):
This was also a leak. The items on cs->ctl_structures are allocated using OBJ_NEW so they mist be released using OBJ_RELEASE not OBJ_DESTRUCT. Replaced the loop with OPAL_LIST_DESTRUCT().
Fix CID 715776 (#1 of 1): Dereference before null check (REVERSE_INULL):
Rework error path to remove REVERSE_INULL. Also added a free to an error path where it was missing.
Fix CID 1196603 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fix CID 1196601 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Both of these are false positives but it is still worthwhile to fix so they no longer appear. The loop conditional has been updated to use radix_mask_pow instead of radix_mask to quiet these issues.
Fix CID 1269804 (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS):
In general close (-1) is safe but coverity doesn’t like it. Reworked the error path for open to not try to close (-1).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 715744 (#1 of 1): Logically dead code (DEADCODE):
Fix CID 715745 (#1 of 1): Logically dead code (DEADCODE):
The free of scratch_num in either place is defensive programming. Instead of removing the free the conditional around the free has been removed to quiet the warning.
Fix CID 715753 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 715778 (#1 of 1): Dereference before null check (REVERSE_INULL):
Fixed the conditional to check for collective_alg != NULL instead of collective_alg->functions != NULL.
Fix CID 715749 (#1 of 4): Explicit null dereferenced (FORWARD_NULL):
Updated code to ensure that none of the parse functions are reached with a non-NULL value.
Fix CID 715746 (#1 of 1): Logically dead code (DEADCODE):
Removed dead code.
Fix CID 715768 (#1 of 1): Resource leak (RESOURCE_LEAK):
Fix CID 715769 (#2 of 2): Resource leak (RESOURCE_LEAK):
Fix CID 715772 (#1 of 1): Resource leak (RESOURCE_LEAK):
Move free calls to before error checks to cleanup leak in error paths.
Fix CID 741334 (#1 of 1): Explicit null dereferenced (FORWARD_NULL):
Added a check to ensure temp is not dereferenced if it is NULL.
Fix CID 1196605 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fixed overflow in calculation by replacing int mask with 1ul.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1325868 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 1325869 (#1-2 of 2): Dereference after null check (FORWARD_NULL):
Here reqs can indeed be NULL. Added a check to
ompi_coll_base_free_reqs to prevent dereferencing NULL pointer.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1324726 (#1 of 1): Free of address-of expression (BAD_FREE):
Indeed, if a lock conflicts with the lock_all we will end up trying to
free an invalid pointer.
Fix CID 1328826 (#1 of 1): Dereference after null check (FORWARD_NULL):
This was intentional but it would be a good idea to check for
module->comm being non_NULL to be safe. Also cleaned out some checks
for NULL before free().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:
- Before this change a significant portion of the rcache
functionality lived in mpool components. This meant that it was
impossible to add a new memory pool to use with rdma networks
(ugni, openib, etc) without duplicating the functionality of an
existing mpool component. All the registration functionality has
been removed from the mpool and placed in the rcache framework.
- All registration cache mpools components (udreg, grdma, gpusm,
rgpusm) have been changed to rcache components. rcaches are
allocated and released in the same way mpool components were.
- It is now valid to pass NULL as the resources argument when
creating an rcache. At this time the gpusm and rgpusm components
support this. All other rcache components require non-NULL
resources.
- A new mpool component has been added: hugepage. This component
supports huge page allocations on linux.
- Memory pools are now allocated using "hints". Each mpool component
is queried with the hints and returns a priority. The current hints
supported are NULL (uses posix_memalign/malloc), page_size=x (huge
page mpool), and mpool=x.
- The sm mpool has been moved to common/sm. This reflects that the sm
mpool is specialized and not meant for any general
allocations. This mpool may be moved back into the mpool framework
if there is any objection.
- The opal_free_list_init arguments have been updated. The unused0
argument is not used to pass in the registration cache module. The
mpool registration flags are now rcache registration flags.
- All components have been updated to make use of the new framework
interfaces.
As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix segfault due to mca_pml_ob1_cuda_need_buffers not handling the case of the
endpoint not being there. Calling mca_bml_get_endpoint() seems to fix the problem.
Fixesopen-mpi/ompi#1402
This commit removes the --with-mpi-thread-multiple option and forces
MPI_THREAD_MULTIPLE support. This cleans up an abstration violation
in opal where OMPI_ENABLE_THREAD_MULTIPLE determines whether the
opal_using_threads is meaningful. To reduce the performance hit on
MPI_THREAD_SINGLE programs an OPAL_UNLIKELY is used for the
check on opal_using_threads in OPAL_THREAD_* macros.
This commit does not clean up the arguments to the various functions
that take whether muti-threading support is enabled. That should be
done at a later time.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
This commit fixes several bugs identified by @ggouaillardet and MTT:
- Fix SEGV in long send completion caused by missing update to the
request callback data.
- Add an MPI_Barrier to the fence short-cut. This fixes potential
semantic issues where messages may be received before fence is
reached.
- Ensure fragments are flushed when using request-based RMA. This
allows MPI_Test/MPI_Wait/etc to work as expected.
- Restore the tag space back to 16-bits. It was intended that the
space be expanded to 32-bits but the required change to the
fragment headers was not committed. The tag space may be expanded
in a later commit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:
- The code that signaled the actual start of an access epoch changed
the eager_send_active flag on a synchronization object without
holding the object's lock. This could cause another thread waiting
on eager sends to block indefinitely because the entirety of
ompi_osc_pt2pt_sync_expected could exectute between the check of
eager_send_active and the conditon wait of
ompi_osc_pt2pt_sync_wait.
- The bookkeeping of fragments could get screwed up when performing
long put/accumulate operations from different threads. This was
caused by the fragment flush code at the end of both put and
accumulate. This code was put in place to avoid sending a large
number of unexpected messages to a peer. To fix the bookkeeping
issue we now 1) wait for eager sends to be active before stating
any large isend's, and 2) keep track of the number of large isends
associated with a fragment. If the number of large isends reaches
32 the active fragment is flushed.
- Use atomics to update the large receive/send tag counters. This
prevents duplicate tags from being used. The tag space has also
been updated to use the entire 16-bits of the tag space.
These changes should also fixopen-mpi/ompi#1299.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds code to handle large unaligned gets. There are two
possible code paths for these transactions:
1) The remote region and local region have the same alignment. In
this case the get will be broken down into at most three get
transactions: 1 transaction to get the unaligned start of the region
(buffered), 1 transaction to get the aligned portion of the region,
and 1 transaction to get the end of the region.
2) The remote and local regions do not have the same alignment. This
should be an uncommon case and is not optimized. In this case a
buffer is allocated and registered locally to hold the aligned data
from the remote region. There may be cases where this fails (low
memory, can't register memory). Those conditions are unlikely and
will be handled later.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If the state of the request is not set to OMPI_REQUEST_ACTIVE
then MPI_Test would immediately signal such request completed
while hcoll may still be working on it.
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.
Fixesopen-mpi/ompi#1241.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Optimizing put aggregation in the presence of threads will require a
redesign of the code. For now just ensure that put aggregation is
turned off when MPI_THREAD_MULTIPLE is enabled.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>