Data transferred by `MPI_BSEND` may corrupt if all of the following
conditions are met.
- The message size is less than the eager limit.
- The `btl_alloc` function in the BTL interface returns `NULL`
for some reason.
- The MPI program overwrites the send buffer after `MPI_BSEND`
returns.
The problem is in the way of pending a send request in ob1 PML.
The `mca_pml_ob1_send_request_start_copy` function retruns
`OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns
`des = NULL`. In this case, the send request is added to the
`send_pending` list and `MPI_BSEND` returns immediately. Next time
the `mca_pml_ob1_send_request_start_copy` function tries sending,
the user buffer may have been overwritten by the MPI program.
Call hierarchy of `MPI_BSEND`:
```
MPI_Bsend
mca_pml_ob1_send
if (MCA_PML_BASE_SEND_BUFFERED == sendmode)
mca_pml_ob1_isend
MCA_PML_OB1_SEND_REQUEST_START_W_SEQ
mca_pml_ob1_send_request_start_seq
mca_pml_ob1_send_request_start_btl
if (size <= eager_limit)
if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED)
mca_pml_ob1_send_request_start_copy
mca_bml_base_alloc
btl_alloc
if (OMPI_ERR_OUT_OF_RESOURCE == rc)
add_request_to_send_pending
ompi_request_free
```
To solve this problem, we should save the data to the buffer
attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`.
This problem was introduced by ob1 optimization (commits 2b57f422
and a06e491c) in v1.8 series.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
This commit fixes#4795
- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
sequence number from uint16_t.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
This commit fixes a flaw in the eager limit check in pml/ob1. The
check was incorrectly checking if RDMA-only BTLs (BTLs without the
send flag) has a valid eager limit. This commit fixes the check by
adding an additional check for the send flag on the BTL module.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Rework the logic to handle the out-of-sequence fragments on the receiver
side. A large number of OOS messages are still arriving even in single
threaded scenarios.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
If not the pvars will remain valid after the OB1 PML is unloaded, and
any access will segfault (the callbacks associated with the pvar will
point to the memory of the dlclosed module).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
In multithreaded case, it is expensive to release the lock, call the slow match
and retake the lock again just to queue the frag. This patch will eliminate number of
lock taken by queueing the frag right away and return.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
they are supposed to be unsigned, casting them to a signed
value for all atomic operations is as errorprone as handling
them as signed entities.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
This commit fixes a bug that occurs when the btl callback happens before
the rget returns. In this case the fragment has been returned and is no
longer valid. This commit saves the size before calling rget. This is
valid since the BTL is not allowed to change the read size.
Fixes#3821
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds code to allow support for the info assertions added
by mpi-forum/mpi-issues#11. The assertions added are:
mpi_assert_no_any_tag, mpi_assert_no_any_source,
mpi_assert_exact_length, and mpi_assert_allow_overtaking.
This commit also adds support for the mpi_assert_no_any_source and
mpi_assert_allow_overtaking info keys to the ob1 pml.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds the `req_start` member to the `ompi_request_t` struct.
The `MPI_START` and `MPI_STARTALL` routines call this callback function
instead of `MCA_PML_CALL(start(...))`. So components that return
persistent request must set this member to their request objects.
`mca_pml_base_module_t::pml_start` is not deleted because
`MCA_PML_CALL(start(...))` is still used elsewhere across OMPI.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This commit fixes a bug that disabled both the RDMA pipeline and RDMA
protocols in ob1. ob1 was internally caching the values of
opal_leave_pinned and opal_leave_pinned_pipeline at init time. This is
no longer valid as opal_leave_pinned may be set by any call to a btl's
add_procs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
According to the MPI-3.1 p.52 and p.53 (cited below), a request
created by `MPI_*_INIT` but not yet started by `MPI_START` or
`MPI_STARTALL` is inactive therefore `MPI_WAIT` or its friends
must return immediately if such a request is passed.
The current implementation hangs in `MPI_WAIT` and its friends
in such case because a persistent request is initialized as
`req_complete = REQUEST_PENDING`. This commit fixes the
initialization.
Also, this commit fixes internal requests used in `MPI_PROBE`
and `MPI_IPROBE` which was marked wrongly as persistent.
MPI-3.1 p.52:
We shall use the following terminology: A null handle is a handle
with value MPI_REQUEST_NULL. A persistent request and the handle
to it are inactive if the request is not associated with any ongoing
communication (see Section 3.9). A handle is active if it is neither
null nor inactive. An empty status is a status which is set to return
tag = MPI_ANY_TAG, source = MPI_ANY_SOURCE, error = MPI_SUCCESS, and
is also internally configured so that calls to MPI_GET_COUNT,
MPI_GET_ELEMENTS, and MPI_GET_ELEMENTS_X return count = 0 and
MPI_TEST_CANCELLED returns false. We set a status variable to empty
when the value returned by it is not significant. Status is set in
this way so as to prevent errors due to accesses of stale information.
MPI-3.1 p.53:
One is allowed to call MPI_WAIT with a null or inactive request
argument. In this case the operation returns immediately with empty
status.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`sturct mca_pml_ob1_comm_proc_t`, which is allocated per
connected rank in a communicator, had two paddings after
`expected_sequence` and `send_sequence` by alignments.
By changing the order of the members, the size of
`mca_pml_ob1_comm_proc_t` is reduced by 8 bytes on 64-bit
architectures.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
this is generally done in mca_pml_ob1_recv_request_free(), but this is not invoked
in via mca_pml_ob1_recv(), so do it manually
Thanks Yvan Fournier for the report
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
recvreq->req_recv.req_base.req_type should always be set before invoking
MCA_PML_OB1_RECV_REQUEST_INIT(recvreq, ...) otherwise, the previous type
might be set, and you could end up with MPC_PML_REQUEST_IMPROBE when
MCA_PML_REQUEST_RECV is expected.
Thanks Chris Pattison for the report and test case.
Fixesopen-mpi/ompi#2275
This commit fixes an issue that can occur if a target gets overwhelmed with
requests. This can cause osc/pt2pt to go into deep recursion with a stack
like req_complete_cb -> ompi_osc_pt2pt_callback -> start -> req_complete_cb
-> ... . At small scale this is fine as the recursion depth stays small but
at larger scale we can quickly exhaust the stack processing frag requests.
To fix the issue the request callback now simply puts the request on a
list and returns. The osc/pt2pt progress function then handles the
processing and reposting of the request.
As part of this change osc/pt2pt can now post multiple fragment receive
requests per window. This should help prevent a target from being overwhelmed.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit updates the btl selection logic for the RDMA and RDMA
pipeline protocols to use a btl iff: 1) the btl is also used for eager
messages (high exclusivity), or 2) no other RDMA btl is available on
an endpoint and the pml_ob1_use_all_rdma MCA variable is true. This
fixes a performance regression with shared memory when an RDMA capable
network is available.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On start we were not correctly resetting all request fields. This was
leading to a double-completion on persistent receives. This commit
updates the base start code to reset the receive req_bytes_packed and
the send request convertor.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes two bugs in pml/ob1:
- Do not called MCA_PML_OB1_PROGRESS_PENDING from
mca_pml_ob1_send_request_start_copy as this may lead to a recursive
call to mca_pml_ob1_send_request_process_pending.
- In mca_pml_ob1_send_request_start_rdma return the rdma frag object
if a btl fragment can not be allocated. This fixes a leak
identified by @abouteiller and @bosilca.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* mpi/start: fix bugs in cm and ob1 start functions
There were several problems with the implementation of start in Open
MPI:
- There are no checks whatsoever on the state of the request(s)
provided to MPI_Start/MPI_Start_all. It is erroneous to provide an
active request to either of these calls. Since we are already
looping over the provided requests there is little overhead in
verifying that the request can be started.
- Both ob1 and cm were always throwing away the request on the
initial call to start and start_all with a particular
request. Subsequent calls would see that the request was
pml_complete and reuse it. This introduced a leak as the initial
request was never freed. Since the only pml request that can
be mpi complete but not pml complete is a buffered send the
code to reallocate the request has been moved. To detect that
a request is indeed mpi complete but not pml complete isend_init
in both cm and ob1 now marks the new request as pml complete.
- If a new request was needed the callbacks on the original request
were not copied over to the new request. This can cause osc/pt2pt
to hang as the incoming message callback is never called.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* osc/pt2pt: add request for gc after starting a new request
Starting a new receive may cause a recursive call into the pt2pt
frag receive function. If this happens and the prior request is
on the garbage collection list it could cause problems. This commit
moves the gc insert until after the new request has been posted.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).
* Fix the condition to release the request.