Adding a mutex to thje ompi_file_t structure allows to have a per-file handle
mutex lock for both ROMIO and OMPIO. I double checked that the size of the
ompi_file_t structure is still below the size of the predefined_file_t structure,
so we should be good from the backward compatibility perspective.
Also, remove the lock/unlock in the file_open ompi-interface routines of romio314.
The global lock in the romio component does probably not work, it is easy to construct a testcase where two threads perform collective I/O operations on different file handles. With a global lock it is easy to deadlock. THe lock has to be at least on the file handle basis.
move the mutex to file/file.c to avoid duplicate symbol problem in file_open.c pfile_open.c
This commit should restore the pre-non-blocking behavior of the CID
allocator when threads are used. There are two primary changes: 1)
do not hold the cid allocator lock past the end of a request callback,
and 2) if a lower id communicator is detected during CID allocation
back off and let the lower id communicator finish before continuing.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the sematics of ompi request callbacks. If a
request's callback has freed or re-posted (using start) a request
the callback must return 1 instead of OMPI_SUCCESS. This indicates
to ompi_request_complete that the request should not be modified
further. This fixes a race condition in osc/pt2pt that could lead
to the req_state being inconsistent if a request is freed between
the callback and setting the request as complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
We need to list all major project libraries in the private libraries
line to enable static linking to work properly.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The original lock_all algorithm in osc/pt2pt sent a lock message to
each peer in the communicator even if the peer is never the target of
an operation. Since this scales very poorly the implementation has
been replaced by one that locks the remote peer on first communication
after a call to MPI_Win_lock_all.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue that can occur if a target gets overwhelmed with
requests. This can cause osc/pt2pt to go into deep recursion with a stack
like req_complete_cb -> ompi_osc_pt2pt_callback -> start -> req_complete_cb
-> ... . At small scale this is fine as the recursion depth stays small but
at larger scale we can quickly exhaust the stack processing frag requests.
To fix the issue the request callback now simply puts the request on a
list and returns. The osc/pt2pt progress function then handles the
processing and reposting of the request.
As part of this change osc/pt2pt can now post multiple fragment receive
requests per window. This should help prevent a target from being overwhelmed.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit updates the btl selection logic for the RDMA and RDMA
pipeline protocols to use a btl iff: 1) the btl is also used for eager
messages (high exclusivity), or 2) no other RDMA btl is available on
an endpoint and the pml_ob1_use_all_rdma MCA variable is true. This
fixes a performance regression with shared memory when an RDMA capable
network is available.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Clang 5.1 on my mac was a sad panda compiling a couple
of files, complaining about uninitialized stack variables.
This commit makes clang a happier panda (or at least not so sad).
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
This is similar to open-mpi/ompi@223d75595d
It is possible for the start call to complete the requests. For this
reason the module rdma_frag field should be filled in before start is
called. If the request completes the completion callback will reset
the rdma_frag field to NULL. Fixes a bug discovered by @tkordenbrock.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On start we were not correctly resetting all request fields. This was
leading to a double-completion on persistent receives. This commit
updates the base start code to reset the receive req_bytes_packed and
the send request convertor.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
- move the sort_iovec operations to fcoll/base
- move set_view_internal to common/ompio
- move set_file_default to common/ompio
- remove io_ompio_sort, not used anymore.
The name of `MPI_INTEGER16` obtained using `MPI_TYPE_GET_NAME`
from Fortran program was incorrect (`MPI_INTEGER8` was obtained)
when `INTEGER*16` is not supported by a compiler.
This bug affects only the Fortran binding because `MPI_INTEGER16`
is not defined in `mpi.h` if a compiler does not support it.
This commit add the following Fortran named constants which are
defined in the MPI standard but are missing in Open MPI.
- `MPI_LONG_LONG` (defined as a synonym of `MPI_LONG_LONG_INT`)
- `MPI_CXX_FLOAT_COMPLEX`
- `MPI_C_BOOL`
And this commit also changes the value of the following Fortran
named constant for consistency.
- `MPI_C_COMPLEX`
`(MPI_C_FLOAT_COMPLEX` is defined as a synonym of this)
Each needs a different solution described below.
For `MPI_LONG_LONG`:
The value of `MPI_LONG_LONG` is defined to have a same value
as `MPI_LONG_LONG_INT` because of the following reasons.
1. It is defined as a synonym of `MPI_LONG_LONG_INT` in
the MPI standard.
2. `MPI_LONG_LONG_INT` and `MPI_LONG_LONG` has a same value
for C in `mpi.h`.
3. `ompi_mpi_long_long` is not defined in
`ompi/datatype/ompi_datatype_module.c`.
For `MPI_CXX_FLOAT_COMPLEX`:
Existing `MPI_CXX_COMPLEX` is replaced with `MPI_CXX_FLOAT_COMPLEX`
bacause `MPI_CXX_FLOAT_COMPLEX` is the right name defined in MPI-3.1
and `MPI_CXX_COMPLEX` is not defined in MPI-3.1 (nor older).
But for compatibility, `MPI_CXX_COMPLEX` is treated as a synonym
of `MPI_CXX_FLOAT_COMPLEX` on Open MPI.
For `MPI_C_BOOL`:
`MPI_C_BOOL` is newly added. The value which `MPI_C_COMPLEX` had
used (68) is assinged for it because the value becomes no longer
in use (described later) and it is a suited position as a datatype
added on MPI-2.2.
For `MPI_C_COMPLEX`:
Existing `MPI_C_FLOAT_COMPLEX` is replaced with `MPI_C_COMPLEX`
and `MPI_C_FLOAT_COMPLEX` is changed to have the same value.
In other words, make `MPI_C_COMPLEX` the canonical name and
make `MPI_C_FLOAT_COMPLEX` an alias of it.
This is bacause the relation of these datatypes is same as
the relation of `MPI_LONG_LONG_INT` and `MPI_LONG_LONG`, and
`MPI_LONG_LONG_INT` and `MPI_LONG_LONG` are implemented like that.
But in the datatype engine, we use `ompi_mpi_c_float_complex`
instead of `ompi_mpi_c_complex` as a variable name to keep
the consistency with the other similar types such as
`ompi_mpi_c_double_complex` (see George's comment in open-mpi/ompi#1927).
We don't delete `ompi_mpi_c_complex` now because it is used in
some other places in Open MPI code. It may be cleand up in the future.
In addition, `MPI_CXX_COMPLEX`, which was defined only in the Open MPI
Fortran binding, is added to `mpi.h` for the C binding.
This commit breaks binary compatibility of Fortran `MPI_C_COMPLEX`.
When this commit is merged into v2.x branch, the change of
`MPI_C_COMPLEX` should be excluded.
configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro.
it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or
{orte-info,ompi_info,oshmem_info} --all --parseable | grep ^config:cli:
This commit fixes the undefined `OPAL_MAXHOSTNAMELEN` error
which arises only when `--enable-timing` is specified for
`configure`.
This bug exists only in master branch because the commit 3322347
is not merged into other branches.
This commit expands the OPAL_THREAD macros to include 32- and 64-bit
atomic swap. Additionally, macro declararations have been updated to
include both OPAL_THREAD_* and OPAL_ATOMIC_*. Before this commit the
former was used with add and the later with cmpset.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
allow for toggling of both control/data progress models.
allow for using FI_AV_TABLE or FI_AV_MAP for av type.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit fixes typos on the C side of the request-based RMA binding. We
were not returning the request on success but on failure. Thanks to
@alazzaro for reporting and @ggouaillardet, and @vondele for tracking
this down.
Fixes part of open-mpi/ompi#1869
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit introduces a new algorithm for MPI_Comm_split_type. The
old algorithm performed an allgather on the communicator to decide
which processes were part of the new communicators. This does not
scale well in either time or memory.
The new algorithm performs a couple of all reductions to determine the
global parameters of the MPI_Comm_split_type call. If any rank gives
an inconsistent split_type (as defined by the standard) an error is
returned without proceeding further. The algorithm then creates a
communicator with all the ranks that match the split_type (no
communication required) in the same order as the original
communicator. It then does an allgather on the new communicator (which
should be much smaller) to determine 1) if the new communicator is in
the correct order, and 2) if any ranks in the new communicator
supplied MPI_UNDEFINED as the split_type. If either of these
conditions are detected the new communicator is split using
ompi_comm_split and the intermediate communicator is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit simplifies the communicator context ID generation by
removing the blocking code. The high level calls: ompi_comm_nextcid
and ompi_comm_activate remain but now call the non-blocking variants
and wait on the resulting request. This was done to remove the
parallel paths for context ID generation in preperation for further
improvements of the CID generation code.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
the coll_array functions are truly only used by the fcoll modules, so move
them to fcoll/base. There is currently one exception to that rule (number of aggreagtors
logic), but that function will be moved in a long term also to fcoll/base.
This commit fixes a bug in the RDMA compare-and-swap implementation
that caused the origin value to always be written even if the compare
should have failed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
as reported by Coverity with CIDs 1363349-1363362
Offset temporary buffer when a non zero lower bound datatype is used.
Thanks Hristo Iliev for the report
(cherry picked from commit 0e393195d9)
- correctly handle non commutative operators
- correctly handle non zero lower bound ddt
- correctly handle ddt with size > extent
- revamp NBC_Sched_op so it takes two buffers and matches ompi_op_reduce semantic
- various fix for inter communicators
Thanks Yuki Matsumoto for the report
It was introduced in PR https://github.com/open-mpi/ompi/pull/1228
in particular in commit 041a6a9f53.
Original solution was using "flexible array member" called "mxm_base"
to "fall-through" to the "mxm" send/recv member that located in the
outer structure.
After changing number of elements in "mxm_base" from 0 to 1 we actually
allocating 2 mxm_req_base_t elements which leads to increased overal
size and harms cache performance.
It also brakes "mca_pml_yalla_check_request_state" function.
* Matches the blocking versions of these interfaces
- `iallreduce.c` to match `allreduce.c`
- `ireduce.c` to match `reduce.c`
- `ireduce_scatter.c` to match `reduce_scatter.c`
* Workaround for IMB-NBC benchmark, similar to the workaround
in place for the IMB-MPI1 benchmark for the blocking collectives.
* If hcoll is given a negative priority, but not enabled=0 then
the module is constructed, but then destructed before calling
it's query(). So the previous pointers are not initialized.
If we try to OBJ_RELEASE them in a debug build an assert will fire.
This commit adds some protection against that and initializes
the _module pointers to NULL.
* Print a verbose message if the component was disqualified because of
a negative priority.
* If a disqualified component provided a module, release it.
* Display list of selected components in priority order
- During the process of volunteering collective functions for a
communicator, print the component name and priority. This will
cause the verbose messages to be displayed in reverse priority
order (lowest priority first, up to highest). This is helpful
when determining which collective components are active in which
order for a given communicator.
To see the messages you need the following MCA parameter set to 9
or higher: `-mca coll_base_verbose 9`
* Adjust verbose for commonly needed verbose output from 10 to 9 to
make it easier to access this information.
This commit fixes a hang reported by @nysal which happens when a
request is completed after a sync object is created but before the
sync object can be assigned to the request. In this case we need to
set the sync signaling field to false to ensure WAIT_SYNC_RELEASE does
not hang.
Fixesopen-mpi/ompi#1828
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>