It is needed because the fs components might be queried due to a MPI_File_delete call.
And in this case, we don't have a communicator value.
Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.
the same method is used in ompi_comm_request_progress()
Refs open-mpi/ompi#3901
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Move memory hooks init (for request based operation) in osc ucx to window
creation time, to avoid performance issue in MPI initialization.
Signed-off-by: Xin Zhao <xinz@mellanox.com>
This commit fixes a crash that occurs when using btl/vader as an RDMA
btl. This btl supports using CPU atomics and does not support using
the btl for self communication so we must use the local memory
optimizations in osc/rdma.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Data transferred by `MPI_BSEND` may corrupt if all of the following
conditions are met.
- The message size is less than the eager limit.
- The `btl_alloc` function in the BTL interface returns `NULL`
for some reason.
- The MPI program overwrites the send buffer after `MPI_BSEND`
returns.
The problem is in the way of pending a send request in ob1 PML.
The `mca_pml_ob1_send_request_start_copy` function retruns
`OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns
`des = NULL`. In this case, the send request is added to the
`send_pending` list and `MPI_BSEND` returns immediately. Next time
the `mca_pml_ob1_send_request_start_copy` function tries sending,
the user buffer may have been overwritten by the MPI program.
Call hierarchy of `MPI_BSEND`:
```
MPI_Bsend
mca_pml_ob1_send
if (MCA_PML_BASE_SEND_BUFFERED == sendmode)
mca_pml_ob1_isend
MCA_PML_OB1_SEND_REQUEST_START_W_SEQ
mca_pml_ob1_send_request_start_seq
mca_pml_ob1_send_request_start_btl
if (size <= eager_limit)
if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED)
mca_pml_ob1_send_request_start_copy
mca_bml_base_alloc
btl_alloc
if (OMPI_ERR_OUT_OF_RESOURCE == rc)
add_request_to_send_pending
ompi_request_free
```
To solve this problem, we should save the data to the buffer
attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`.
This problem was introduced by ob1 optimization (commits 2b57f422
and a06e491c) in v1.8 series.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This commit fixes an issue identified by MTT where we can have two
different sets of processes on the same node creating a shared memory
window with communicators sharing the same CID. To avoid this issue
the temporary filename now includes the creating processes vpid.
References #5363
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Use internal pack/unpack subroutines that operate on MPI_Aint instead of int
and hence solve some integer overflows.
Thanks Clyde Stanfield for reporting this issue.
Refs open-mpi/ompi#5383
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Current implementation of `coll/base/MPI_Scatter` is based on in-order binomial tree. This tree is right skewed and it provides good performance for a MPI_Gather operation. But for a MPI_Scatter operation left skewed binomial tree is effective.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
-Updated the design for sync send MPI calls to use 2 protocol bits for
denoting "sync_send" or "sync_send_ack".
-"Sync_send" is added to the send tag only and is masked out in receives
such that it can be read by the original Recv posted in the send/recv
operation.
-"Sync_send_ack" is sent from the recv callback to the send side. This 0
byte send does not generate a completion entry and instead sends the
message and immediately completes the opal completion in the recv.
-Tag formats ofi_tag_1 and ofi_tag_2 have been updated to include 2
more tag bits per format type due to the reduced protocal bits required
by OMPI.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
The call of MPI_Gather with sendbuf and sendtype parameters equal to MPI_IN_PLACE and NULL correspondingly, produces the segmentation fault in the root process.
The problem is that sendtype is used even when sendbuf value is MPI_IN_PLACE. But according to the standard (page 150, line 37), sendtype and sendcount parameters should be ignored in this case.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
- added common logging infrastructure for all
UCX modules
- all UCX modules are switched to new infra
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
- some common functionality of del_procs calls is moved into
mca_common module
- blocking ucp_put call is replaced by non-blocking routine
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
moving some code from fs/ufs into fs/base. The benefit of this approach is
that fs components that are fundamentally based on posix I/O (and only differ
in some non-posix functionality such as setting stripe size, or which hints
are being supported) can avoid having to replicate the same code over and
over again. First beneficiary is the lustre fs component, but more
are to follow soon.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).
References #5262
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The osc/rdma module did not wait for all pending atomics to complete
before tearing down. This could lead to weird issues as the target
location may no longer be registered or allocated.
This commit also fixes an offset calculation issue in
ompi_osc_get_data_blocking ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
a bug sneaked into constructing the list of aggregators
processes when using the fileview based grouping options
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
handle the situation where the user requests a non-zero amount
of data but has a zero-size fileview. My instrinct would have been
to return an error code, but according to the test that I used
it should be MPI_SUCCESS and zero bytes. It is definitely better
than segfaulting :-)
THis makes another test from the IBM testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the seek offset calculation did not treat the offset as a multiple
of the etype provided. Fixing this makes some more ibm tests pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
use an allocator to manage temporary buffers when copying
unmanaged data from GPU buffer to host. This is necessary,
since the buffers have to be pinned for better performance,
which is an expensive operation.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the two_phase compoment does not work with some collective I/O
operations on CUDA buffers due to the data sieving (i.e.
both read and write operations) executed on some buffers, which are
not anticipated in the GPU buffer management of the code.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit adds the initial support for cuda buffers in ompio, for blocking
and non-blocking individual read and write operations.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
two minor updates:
- in all components: use the fh->f_bytes_per_agg value
(which might have been set by an info object) instead
of re-reading the mca parameter
- vulcan and dynamic_gen2: replace one allgather operation
by an allreduce, since it is used to determine the sum
of an array.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@84765001aa)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@4d876ec6fe)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
add support for the info objects cb_buffer_size and collective_buffering.
Also, introduce a new mca parameter that allows to give feedback
on whether an info object is recognized (and honored).
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this component can only be used in very specific scenarios. However, since some file systems do not support file locking and processes might be distributed over multiple nodes (hence the sm sharedfp component is also inelligible), the component might be selected in some scenarios, even if an application does not intend to use shared file pointers.
Since the fseek_shared function is involved as part of the File_set_view operation, only complain about the inability to perform the seek_shared operation if actual shared file pointer operations are being used. This avoid spurious error values being returned.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit revamps the internal operations of the sharedfp components.
Specifically, it is focused around removing the second file_open
operation for shared file pointers. This makes the code more efficient.
Because of that, there is no necessity anymore for the sharedfp_lazy_open
mca parameter.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA. Add README file to OFI MTL
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
The `nbc_i*` functions don't start communication, but create a request.
`nbc_*_init` are appropriate names for them.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Persistent operation for `NBC_A2A_DISS` is not supported currently.
Though the algorithm is not selected at all currently, I put an
assertion not to select it by mistake.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`NBC_Copy` shoud not be called in `MPI_*_INIT`.
`NBC_Sched_copy` should be called instead.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Because a persistent reuqest does not free its `schedule` object
when the communication completes, the `NBC_Progress` function cannot
determine the completion using `schedule`.
Without this change, a hang occurs when the `NBC_Progress` function
is called recursively through the `NBC_Start_round` function.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Now libnbc COLL supports persistent collectives and all `*_init`
functions of the COLL interface are available. So let's enable the
check of availability of those functions on a communicator creation.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
prepare the upcoming persistent collectives by pre-factoring some code
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
fixup 808c3c62cd9475edd91ecde9d2d53b12e28b2c04
now that we have a shiny new fcoll component, no need
to keep the static component around. No use for it anymore.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
check for pending I/O operations and invalid modes
and return proper error codes before executing MPI_File_sync
makes the e_sync_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
in case the user opened a file using the DELETE_ON_CLOSE flag,
return the error code generated in the delete operation.
Note, that this is however just a partial fix to the e_close_1 test
from the ibm testsuite, since the object destructor that triggers
the file_close function does not have a mechanism right now to recognize
and return an error code.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
in file_get_byte_offset, return an error code if the offset
leads to an invalid position in file.
Makes the e_get_byte_offset_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
and some internal structure elements/components. Along the way,
add support for the cb_nodes Info object.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
since the request code is now being accessed also from the vulcan fcoll
component, the request code was relocated into the common/ompio
directory to avoid ld load problems.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
We introduced a new mca_vulcan parameter that specify the I/O synchronization
type (Async/sync I/O) applied within the collective write operation.
The user can explicitly choose to use async or sync write operation or make
the choice automatically made.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
For very large offsets, the data chunk size to be written by each aggregator
exceeds the capacity of an integer variable. Besides, some variables were
not large enough to hold intermediate values.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Identify the index of each aggregator process in order to restrict the call to write_init function by the specific aggregator.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
import of the new vulcan component. It is an enhanced version
of the two_phase component, which uses however the ompio internal
codes/loops to assemble the data arrays. It is therefore more inline
with the dynamic and dynamic_gen2 component, and will be easier to
maintain.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Implements butterfly algorithm for MPI_Reduce_scatter.
The algorithm can be used both by commutative and non-commutative operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:
* autogen --no-orte
* with external libevent, external hwloc, and external PMIx master
* configuring PMIx master with the same libevent and hwloc
* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun
Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".
Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See
https://github.com/open-mpi/ompi/issues/5084.
This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:
- not initialized
- init started
- init completed
- finalize started
- finalize past COMM_SELF destruction
- finalize completed
The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.
Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.
Thanks to @AndrewGaspar for reporting the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
for very large offsets, ome ariables used in the fcoll/dynamic_gen2
code base were under certain circumstances not large enough to hold
intermediate values. This issue was more detected in the vulcan component
but could happen in the dynamic_gen2 component as well.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit adds a new MCA variable to set the location of the backing
store: osc_sm_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per discussion at
https://github.com/open-mpi/ompi/issues/2614#issuecomment-392815654,
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active. Print a helpful message and
return a not-supported error.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit d0ffd660841623c02d1dfa3151e7f7afd3327698)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Implements butterfly algorithm for MPI_Reduce_scatter_block.
The algorithm can be used both by commutative and non-commutative
operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
get_dynamic_win_info() at osc_ucx_comm.c
In file included from /usr/include/string.h:494:0,
from ../../../../ompi/info/info.h:29,
from ../../../../ompi/mca/osc/base/base.h:24,
from osc_ucx_comm.c:13:
In function 'memcpy',
inlined from 'get_dynamic_win_info' at osc_ucx_comm.c:359:5,
inlined from 'ompi_osc_ucx_put' at osc_ucx_comm.c:401:18:
/usr/include/bits/string_fortified.h:34:10: warning: '__builtin___memcpy_chk' writing 8 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=]
return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is caused by a type size mismatch in a call to memcpy
This fix corrects the type definition of the win_count variable.
Signed-off-by: John Jolly <jjolly@suse.com>
Remove the MXM MTL, which has been deprecated in preference for
the Yalla PML. This was discussed at the last developers meeting
and somehow I ended up with the action item to do the removal.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
- rename ompi_coll_base_reduce_scatter_block_basic to
more self descriptive ompi_coll_base_reduce_scatter_block_basic_linear
- fix the description of the coll_tuned_reduce_scatter_block_algorithm
MCA param
this fixes and documents previous open-mpi/ompi@0e8b35b615
MPI_Reduce_scatter_block used to be implemented by the coll/basic module only.
A new algo (recursive doubling) was recently introduced and can be used via the coll/tuned module,
but we never intended to make it the default algo.
In order to "restore" the previous default, the initial algo was moved from coll/basic to coll/base,
and is now used by default by coll/tuned.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The files array was also storing $phase.prof. This was leading to
$phase.prof's output getting dumped into itself again and again. Updated
code to initialise files array with files other than $phase.prof.
Signed-off-by: Ninad Prabhukhanolkar <ninadchess96@gmail.com>
keep track of the sizeof the blocklen_per_process and displs_per_process
on the aggregator datastructure to minimze the number of realloc function
calls required in the shuffle_init operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a segfault in mtl-portals4 finalize(). The segfault
occurs if finalize() is called without any calls to add_procs(). This
commit resolves the segfault by skipping the progress() loop in
finalize() if the Portals was not initialized.
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
the ompio module resets the amode from WRONLY to RDWR in order
to accoomodate data sieving in the two-phase fcoll componet. This
leads however to an error if MPI_MODE_SEQUENTIAL has been requested
by the user, since MODE_SEQUENTIAL is incompatible with MODE_RDWR.
SInce the change to the amode was done after opening the file for
individual file pointers but before opening the file for shared filepointers,
this lead to an error message in the sharedfp component.
Note, that data sieving is never necessary if MODE_SEQUENTIAL is set,
so this should not be a problem for any scenario.
Fixes#4991
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Implements recursive doubling algorithm for MPI_Scan and MPI_Exscan.
The algorithm preserves order of operations so it can be used both
by commutative and non-commutative operations.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
This commit fixes a bug is osc/rdma that can occur if the total size
of the shared memory segment gets larger than 4 GiB. The bug was
caused by a typo. The type of my_base_offset should have been size_t
not int.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This will almost certainly never happen, but be defensive and
guarantee that we never return an uninitialized variable.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Ensure to initialized ret_code. This problem will likely never occur
in practice, but we might as well be defensive about it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>