This commit is a large update to the osc/rdma component. Included in
this commit:
- Add support for using hardware atomics for fetch-and-op and single
count accumulate when using the accumulate lock. This will improve
the performance of these operations even when not setting the
single intrinsic info key.
- Rework how large accumulates are done. They now block on the get
operation to fix some bugs discovered by an IBM one-sided test. I
may roll back some of the changes if the underlying bug in the
original design is discovered. There appear to be no real
difference (on the hardware this was tested with) in performance so
its probably a non-issue. References #2530.
- Add support for an additional lock-all algorithm: on-demand. The
on-demand algorithm will attempt to acquire the peer lock when
starting an RMA operation. The lock algorithm default has not
changed. The algorithm can be selected by setting the
osc_rdma_locking_mode MCA variable. The valid values are two_level
and on_demand.
- Make use of the btl_flush function if available. This can improve
performance with some btls.
- When using btl_flush do not keep track of the number of put
operations. This reduces the number of atomic operations in the
critical path.
- Make the window buffers more friendly to multi-threaded
applications. This was done by dropping support for multiple
buffers per MPI window. I intend to re-add that support once the
underlying performance bug under the old buffering scheme is
fixed.
- Fix a bug in request completion in the accumulate, get, and put
paths. This also helps with #2530.
- General code cleanup and fixes.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
data sieving has to occur for any offset provided that is larger
or equal zero for this implementation to work correctly.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit fixes an issue observed with romio314 and the hdf5 1.10.x testsuite.
The ADIOI_Datatype_iscontig() routine in romio314/src/io_romio314_module.c
will now return for a datatype of size 0 that it is contiguous, even if the extent
of the datatype is non-zero. This avoids a segmentation fault observed in the
ADIOI_Flatten routine, and fixes this particular with the hdf5 1.10.x testsuite in
OpenMPI with romio314.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes#4795
- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
sequence number from uint16_t.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
For some of our configuration this flag increases per-process contribution
by ~20% while it is not being used currently.
The consumer of this flag was communicator ID calculation logic, but it was
changed in 0bf06de3f1444f469303e47752430ec9b423b33f.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
-Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a
threshold that can be set by the Open MPI User.
-Users can adjust the number of events that can be handled in the
ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #".
-The default value for the the number of CQ events that can be read in a
single call to ofi progress is 100 which is an average
based off workload usecase anaylsis showing 70-128 as the range of
multiple events returned during ofi progress.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
This commit fixes a flaw in the progress function for libnbc. The
function was unconditionally taking a lock even if there are no
requests to process. This lock was showing up in vtune traces of
multi-threaded benchmarks.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
after performing the final OBJ_RELEASE on the request,
reset the user level variable to MPI_REQUEST_NULL.
Otherwise the c_2_f translation step in the fortran
interface fails.
Fixes issue #4807
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes a flaw in the eager limit check in pml/ob1. The
check was incorrectly checking if RDMA-only BTLs (BTLs without the
send flag) has a valid eager limit. This commit fixes the check by
adding an additional check for the send flag on the BTL module.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue #4523.
Extend the common monitoring API.
* Now it's possible to translate from local rank to world rank from
both the communicator and the group.
* Remove useless hashtable as we directly use the w_group contained
in window structure.
Add automatic generation at config time.
The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.
Only keep a generated argv-style array.
Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.
Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].
Thanks Martin Binder for the report.
Refs open-mpi/ompi#4770
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.
Fixes#4745
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Per MPI 3.1 chapter 13.3 :
"Derived etypes can be constructed by using any of the MPI
datatype constructor routines, provided all resulting typemap
displacements are non-negative and monotonically nondecreasing."
Same restriction applies to ftypes.
add the OMPI_DATATYPE_CHECK_FOR_VIEW() macro that is
check the underlying opal_datatype_t is monotonic, on top
of all checks performed in OMPI_DATATYPE_CHECK_FOR_RECV().
Since checking monotoniciy is expensive, check is only performed
when needed, but the result is cached by ompi_datatype_is_monotonic().
Thanks Wei-keng Liao for the valuable feedback.
Thanks George for the guidance.
Refs. open-mpi/ompi#4682
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
set grp_local_rank as MPI_UNDEFINED before invoking
ompi_comm_nexcid() in order to benefit from the optimizations
introduced in open-mpi/ompi@68167ec879
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This change makes comparison of `mpi-f08-interfaces.F90` and
`pmpi-f08-interfaces.F90` easier.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
They were incorrectly changed to subroutines in only `pmpi`
in 258d1aa1607.
Strictly speaking, this change involves binary incompatibility.
But nobody used these subroutines and nobody will be affected because
these subroutines were useless (didn't return a calculated value).
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
OMPI_FORTRAN_USEMPIF08_MOD macro was removed in open-mpi/ompi@791bcee6c0
so this macro is now manually expanded to mpi/fortran/use-mpi-f08/mod
Thanks to Nathan T. Weeks for reporting
Refs open-mpi/ompi#3605
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>