This commit fixes some bugs uncovered during thread testing of
2.0.1rc1. With these fixes the component is running cleanly with
threads.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Also, remove the lock/unlock in the file_open ompi-interface routines of romio314.
The global lock in the romio component does probably not work, it is easy to construct a testcase where two threads perform collective I/O operations on different file handles. With a global lock it is easy to deadlock. THe lock has to be at least on the file handle basis.
move the mutex to file/file.c to avoid duplicate symbol problem in file_open.c pfile_open.c
This commit changes the sematics of ompi request callbacks. If a
request's callback has freed or re-posted (using start) a request
the callback must return 1 instead of OMPI_SUCCESS. This indicates
to ompi_request_complete that the request should not be modified
further. This fixes a race condition in osc/pt2pt that could lead
to the req_state being inconsistent if a request is freed between
the callback and setting the request as complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The original lock_all algorithm in osc/pt2pt sent a lock message to
each peer in the communicator even if the peer is never the target of
an operation. Since this scales very poorly the implementation has
been replaced by one that locks the remote peer on first communication
after a call to MPI_Win_lock_all.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue that can occur if a target gets overwhelmed with
requests. This can cause osc/pt2pt to go into deep recursion with a stack
like req_complete_cb -> ompi_osc_pt2pt_callback -> start -> req_complete_cb
-> ... . At small scale this is fine as the recursion depth stays small but
at larger scale we can quickly exhaust the stack processing frag requests.
To fix the issue the request callback now simply puts the request on a
list and returns. The osc/pt2pt progress function then handles the
processing and reposting of the request.
As part of this change osc/pt2pt can now post multiple fragment receive
requests per window. This should help prevent a target from being overwhelmed.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit updates the btl selection logic for the RDMA and RDMA
pipeline protocols to use a btl iff: 1) the btl is also used for eager
messages (high exclusivity), or 2) no other RDMA btl is available on
an endpoint and the pml_ob1_use_all_rdma MCA variable is true. This
fixes a performance regression with shared memory when an RDMA capable
network is available.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Clang 5.1 on my mac was a sad panda compiling a couple
of files, complaining about uninitialized stack variables.
This commit makes clang a happier panda (or at least not so sad).
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
This is similar to open-mpi/ompi@223d75595d
It is possible for the start call to complete the requests. For this
reason the module rdma_frag field should be filled in before start is
called. If the request completes the completion callback will reset
the rdma_frag field to NULL. Fixes a bug discovered by @tkordenbrock.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On start we were not correctly resetting all request fields. This was
leading to a double-completion on persistent receives. This commit
updates the base start code to reset the receive req_bytes_packed and
the send request convertor.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
- move the sort_iovec operations to fcoll/base
- move set_view_internal to common/ompio
- move set_file_default to common/ompio
- remove io_ompio_sort, not used anymore.
This commit expands the OPAL_THREAD macros to include 32- and 64-bit
atomic swap. Additionally, macro declararations have been updated to
include both OPAL_THREAD_* and OPAL_ATOMIC_*. Before this commit the
former was used with add and the later with cmpset.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
allow for toggling of both control/data progress models.
allow for using FI_AV_TABLE or FI_AV_MAP for av type.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
the coll_array functions are truly only used by the fcoll modules, so move
them to fcoll/base. There is currently one exception to that rule (number of aggreagtors
logic), but that function will be moved in a long term also to fcoll/base.
This commit fixes a bug in the RDMA compare-and-swap implementation
that caused the origin value to always be written even if the compare
should have failed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
as reported by Coverity with CIDs 1363349-1363362
Offset temporary buffer when a non zero lower bound datatype is used.
Thanks Hristo Iliev for the report
(cherry picked from commit 0e393195d9)
- correctly handle non commutative operators
- correctly handle non zero lower bound ddt
- correctly handle ddt with size > extent
- revamp NBC_Sched_op so it takes two buffers and matches ompi_op_reduce semantic
- various fix for inter communicators
Thanks Yuki Matsumoto for the report
It was introduced in PR https://github.com/open-mpi/ompi/pull/1228
in particular in commit 041a6a9f53.
Original solution was using "flexible array member" called "mxm_base"
to "fall-through" to the "mxm" send/recv member that located in the
outer structure.
After changing number of elements in "mxm_base" from 0 to 1 we actually
allocating 2 mxm_req_base_t elements which leads to increased overal
size and harms cache performance.
It also brakes "mca_pml_yalla_check_request_state" function.
* If hcoll is given a negative priority, but not enabled=0 then
the module is constructed, but then destructed before calling
it's query(). So the previous pointers are not initialized.
If we try to OBJ_RELEASE them in a debug build an assert will fire.
This commit adds some protection against that and initializes
the _module pointers to NULL.
* Print a verbose message if the component was disqualified because of
a negative priority.
* If a disqualified component provided a module, release it.
* Display list of selected components in priority order
- During the process of volunteering collective functions for a
communicator, print the component name and priority. This will
cause the verbose messages to be displayed in reverse priority
order (lowest priority first, up to highest). This is helpful
when determining which collective components are active in which
order for a given communicator.
To see the messages you need the following MCA parameter set to 9
or higher: `-mca coll_base_verbose 9`
* Adjust verbose for commonly needed verbose output from 10 to 9 to
make it easier to access this information.
This commit makes bml/r2 more restrictive on which endpoints end up in the rdma
endpoint list. Before this commit an endpoint was added if it supported either
put or get. This was done to ensure that endpoints are available for RMA.
Thought it is possible to support put or get endpoints we only currently
support endpoints that have put, get, and amos. bml/r2 now reflects this
support.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Need to increment the total size after checking the local offset not
before. This typo causes large allocations with MPI_Win_allocate() to
fail.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
This commit fixes two bugs in pml/ob1:
- Do not called MCA_PML_OB1_PROGRESS_PENDING from
mca_pml_ob1_send_request_start_copy as this may lead to a recursive
call to mca_pml_ob1_send_request_process_pending.
- In mca_pml_ob1_send_request_start_rdma return the rdma frag object
if a btl fragment can not be allocated. This fixes a leak
identified by @abouteiller and @bosilca.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* mpi/start: fix bugs in cm and ob1 start functions
There were several problems with the implementation of start in Open
MPI:
- There are no checks whatsoever on the state of the request(s)
provided to MPI_Start/MPI_Start_all. It is erroneous to provide an
active request to either of these calls. Since we are already
looping over the provided requests there is little overhead in
verifying that the request can be started.
- Both ob1 and cm were always throwing away the request on the
initial call to start and start_all with a particular
request. Subsequent calls would see that the request was
pml_complete and reuse it. This introduced a leak as the initial
request was never freed. Since the only pml request that can
be mpi complete but not pml complete is a buffered send the
code to reallocate the request has been moved. To detect that
a request is indeed mpi complete but not pml complete isend_init
in both cm and ob1 now marks the new request as pml complete.
- If a new request was needed the callbacks on the original request
were not copied over to the new request. This can cause osc/pt2pt
to hang as the incoming message callback is never called.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* osc/pt2pt: add request for gc after starting a new request
Starting a new receive may cause a recursive call into the pt2pt
frag receive function. If this happens and the prior request is
on the garbage collection list it could cause problems. This commit
moves the gc insert until after the new request has been posted.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Rewrite the ompi_request_complete function to take in account the
with_signal argument. Change the comment to explain the expected
behavior.
Alter all the ompi_request_complete uses to make sure the status of the
request is set before calling ompi_request_complete.
bot🏷️enhancement
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Regarding BFO it should be mentionned that this component is currently
unmaintained, and that despite my efforts I could not make it compile
(it would not compile before this patch either).
This fixes a hang caused by the request refactor work. The cm pml was
not updated and was hanging is most cases.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).
* Fix the condition to release the request.
This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.
This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().
Fixesopen-mpi/ompi#1676
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
As more providers get added to libfabric, the default exclude list would need
to be updated.
Instead, we choose to include only the providers known to work by default.
New default:
- include: psm,psm2,gni
- exclude: none
There were some old/stale function names in some debugging/verbose
opal_output calls. Use __func__ instead, so that they won't become
stale in the future.
Thanks to Durga Choudhury for pointing out the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Update external as well
Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
Intel TrueScale and Intel OmniPath, and detect a link in ACTIVE state.
This fix addresses the scenario reported in the below OMPI users email,
including formerly named Qlogic IB, now Intel True scale. Given the
nature of the PSM/PSM2 mtls this fix applies to OmniPath:
https://www.open-mpi.org/community/lists/users/2016/04/29018.php
MPI-3.1 says that even if no info keys are set on the file, we need to
return a new, empty info.
Thanks to Lisandro Dalcin for identifying the issue.
Fixesopen-mpi/ompi#1630
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
tear down
HCOLL barrier may not complete if HCOLL progress is not called periodically.
which is the case in HCOLL teardown progress in the finalize.
(cherry picked from commit 793244d75dd94d1d5e0243bcccf6d04318750f3f)
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.
Fixesopen-mpi/ompi#1599
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If during the request completion callback we post another request that
completes right away (such a small send or a match for an unexpected
short message) we will try to complete the second request while holding
the lock for the completion of the first. For performance reasons
(mainly to avoid unlocking and locking the request mutex several times)
we have made the request lock recursive.
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)
From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.
Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)
Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 72296: Resource leak (RESOURCE_LEAK):
Changed code to goto exit instead of returning to ensure memory is
freed.
Fix CID 712589: Out-of-bounds read (OVERRUN):
In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.
Fix CID 741363: Uninitialized scalar variable (UNINIT):
Allocate tmp_len with calloc to insure every index is initialized.
Fix CID 741364: Uninitialized pointer read (UNINIT):
Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.
Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.
Fix CID 712588: Out-of-bounds read (OVERRUN):
Similar to CID 712589. Should silence the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Before this commit, a same PML tag may be used for distinct
communications for long messages. For example, consider a condition
where rank A calls ```MPI_PUT``` targeting rank B and rank B calls
```MPI_GET``` targeting rank A simultaneously.
A PML tag for the ```MPI_PUT``` is acquired on rank A and is used
for the long-message communication from rank A to rank B.
A PML tag for the ```MPI_GET``` is acquired on rank B and is used
for the long-message communication from rank A to rank B.
These two tags may become a same value because they are managed
independently on each rank. This will cause a data corruption.
This commit separates the tag used in a single RMA communication
call, one for communication from an origin to a target, and one
for communication from a target to an origin. A "base" tag
is acquired using ```get_tag``` function and PML tag is caluculated
from the base tag by ```tag_to_target``` and ```tag_to_origin```
function.
This commit ensures the bml is always enabled whether or not it will
be used. This ensures that any available btls communicate their modex
so that they can be used for one-sided communication.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>