The osc/rdma module did not wait for all pending atomics to complete
before tearing down. This could lead to weird issues as the target
location may no longer be registered or allocated.
This commit also fixes an offset calculation issue in
ompi_osc_get_data_blocking ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
a bug sneaked into constructing the list of aggregators
processes when using the fileview based grouping options
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
handle the situation where the user requests a non-zero amount
of data but has a zero-size fileview. My instrinct would have been
to return an error code, but according to the test that I used
it should be MPI_SUCCESS and zero bytes. It is definitely better
than segfaulting :-)
THis makes another test from the IBM testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the seek offset calculation did not treat the offset as a multiple
of the etype provided. Fixing this makes some more ibm tests pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
use an allocator to manage temporary buffers when copying
unmanaged data from GPU buffer to host. This is necessary,
since the buffers have to be pinned for better performance,
which is an expensive operation.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the two_phase compoment does not work with some collective I/O
operations on CUDA buffers due to the data sieving (i.e.
both read and write operations) executed on some buffers, which are
not anticipated in the GPU buffer management of the code.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit adds the initial support for cuda buffers in ompio, for blocking
and non-blocking individual read and write operations.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
two minor updates:
- in all components: use the fh->f_bytes_per_agg value
(which might have been set by an info object) instead
of re-reading the mca parameter
- vulcan and dynamic_gen2: replace one allgather operation
by an allreduce, since it is used to determine the sum
of an array.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@84765001aa)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@4d876ec6fe)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
add support for the info objects cb_buffer_size and collective_buffering.
Also, introduce a new mca parameter that allows to give feedback
on whether an info object is recognized (and honored).
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this component can only be used in very specific scenarios. However, since some file systems do not support file locking and processes might be distributed over multiple nodes (hence the sm sharedfp component is also inelligible), the component might be selected in some scenarios, even if an application does not intend to use shared file pointers.
Since the fseek_shared function is involved as part of the File_set_view operation, only complain about the inability to perform the seek_shared operation if actual shared file pointer operations are being used. This avoid spurious error values being returned.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit revamps the internal operations of the sharedfp components.
Specifically, it is focused around removing the second file_open
operation for shared file pointers. This makes the code more efficient.
Because of that, there is no necessity anymore for the sharedfp_lazy_open
mca parameter.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA. Add README file to OFI MTL
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
The `nbc_i*` functions don't start communication, but create a request.
`nbc_*_init` are appropriate names for them.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Persistent operation for `NBC_A2A_DISS` is not supported currently.
Though the algorithm is not selected at all currently, I put an
assertion not to select it by mistake.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`NBC_Copy` shoud not be called in `MPI_*_INIT`.
`NBC_Sched_copy` should be called instead.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Because a persistent reuqest does not free its `schedule` object
when the communication completes, the `NBC_Progress` function cannot
determine the completion using `schedule`.
Without this change, a hang occurs when the `NBC_Progress` function
is called recursively through the `NBC_Start_round` function.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Now libnbc COLL supports persistent collectives and all `*_init`
functions of the COLL interface are available. So let's enable the
check of availability of those functions on a communicator creation.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
prepare the upcoming persistent collectives by pre-factoring some code
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
fixup 808c3c62cd9475edd91ecde9d2d53b12e28b2c04
now that we have a shiny new fcoll component, no need
to keep the static component around. No use for it anymore.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
check for pending I/O operations and invalid modes
and return proper error codes before executing MPI_File_sync
makes the e_sync_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
in case the user opened a file using the DELETE_ON_CLOSE flag,
return the error code generated in the delete operation.
Note, that this is however just a partial fix to the e_close_1 test
from the ibm testsuite, since the object destructor that triggers
the file_close function does not have a mechanism right now to recognize
and return an error code.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
in file_get_byte_offset, return an error code if the offset
leads to an invalid position in file.
Makes the e_get_byte_offset_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
and some internal structure elements/components. Along the way,
add support for the cb_nodes Info object.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
since the request code is now being accessed also from the vulcan fcoll
component, the request code was relocated into the common/ompio
directory to avoid ld load problems.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
We introduced a new mca_vulcan parameter that specify the I/O synchronization
type (Async/sync I/O) applied within the collective write operation.
The user can explicitly choose to use async or sync write operation or make
the choice automatically made.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
For very large offsets, the data chunk size to be written by each aggregator
exceeds the capacity of an integer variable. Besides, some variables were
not large enough to hold intermediate values.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Identify the index of each aggregator process in order to restrict the call to write_init function by the specific aggregator.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
import of the new vulcan component. It is an enhanced version
of the two_phase component, which uses however the ompio internal
codes/loops to assemble the data arrays. It is therefore more inline
with the dynamic and dynamic_gen2 component, and will be easier to
maintain.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Implements butterfly algorithm for MPI_Reduce_scatter.
The algorithm can be used both by commutative and non-commutative operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:
* autogen --no-orte
* with external libevent, external hwloc, and external PMIx master
* configuring PMIx master with the same libevent and hwloc
* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun
Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".
Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See
https://github.com/open-mpi/ompi/issues/5084.
This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:
- not initialized
- init started
- init completed
- finalize started
- finalize past COMM_SELF destruction
- finalize completed
The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.
Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.
Thanks to @AndrewGaspar for reporting the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
for very large offsets, ome ariables used in the fcoll/dynamic_gen2
code base were under certain circumstances not large enough to hold
intermediate values. This issue was more detected in the vulcan component
but could happen in the dynamic_gen2 component as well.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit adds a new MCA variable to set the location of the backing
store: osc_sm_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per discussion at
https://github.com/open-mpi/ompi/issues/2614#issuecomment-392815654,
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active. Print a helpful message and
return a not-supported error.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit d0ffd660841623c02d1dfa3151e7f7afd3327698)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Implements butterfly algorithm for MPI_Reduce_scatter_block.
The algorithm can be used both by commutative and non-commutative
operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
get_dynamic_win_info() at osc_ucx_comm.c
In file included from /usr/include/string.h:494:0,
from ../../../../ompi/info/info.h:29,
from ../../../../ompi/mca/osc/base/base.h:24,
from osc_ucx_comm.c:13:
In function 'memcpy',
inlined from 'get_dynamic_win_info' at osc_ucx_comm.c:359:5,
inlined from 'ompi_osc_ucx_put' at osc_ucx_comm.c:401:18:
/usr/include/bits/string_fortified.h:34:10: warning: '__builtin___memcpy_chk' writing 8 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=]
return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is caused by a type size mismatch in a call to memcpy
This fix corrects the type definition of the win_count variable.
Signed-off-by: John Jolly <jjolly@suse.com>
Remove the MXM MTL, which has been deprecated in preference for
the Yalla PML. This was discussed at the last developers meeting
and somehow I ended up with the action item to do the removal.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
- rename ompi_coll_base_reduce_scatter_block_basic to
more self descriptive ompi_coll_base_reduce_scatter_block_basic_linear
- fix the description of the coll_tuned_reduce_scatter_block_algorithm
MCA param
this fixes and documents previous open-mpi/ompi@0e8b35b615
MPI_Reduce_scatter_block used to be implemented by the coll/basic module only.
A new algo (recursive doubling) was recently introduced and can be used via the coll/tuned module,
but we never intended to make it the default algo.
In order to "restore" the previous default, the initial algo was moved from coll/basic to coll/base,
and is now used by default by coll/tuned.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The files array was also storing $phase.prof. This was leading to
$phase.prof's output getting dumped into itself again and again. Updated
code to initialise files array with files other than $phase.prof.
Signed-off-by: Ninad Prabhukhanolkar <ninadchess96@gmail.com>
keep track of the sizeof the blocklen_per_process and displs_per_process
on the aggregator datastructure to minimze the number of realloc function
calls required in the shuffle_init operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a segfault in mtl-portals4 finalize(). The segfault
occurs if finalize() is called without any calls to add_procs(). This
commit resolves the segfault by skipping the progress() loop in
finalize() if the Portals was not initialized.
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
the ompio module resets the amode from WRONLY to RDWR in order
to accoomodate data sieving in the two-phase fcoll componet. This
leads however to an error if MPI_MODE_SEQUENTIAL has been requested
by the user, since MODE_SEQUENTIAL is incompatible with MODE_RDWR.
SInce the change to the amode was done after opening the file for
individual file pointers but before opening the file for shared filepointers,
this lead to an error message in the sharedfp component.
Note, that data sieving is never necessary if MODE_SEQUENTIAL is set,
so this should not be a problem for any scenario.
Fixes#4991
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Implements recursive doubling algorithm for MPI_Scan and MPI_Exscan.
The algorithm preserves order of operations so it can be used both
by commutative and non-commutative operations.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
This commit fixes a bug is osc/rdma that can occur if the total size
of the shared memory segment gets larger than 4 GiB. The bug was
caused by a typo. The type of my_base_offset should have been size_t
not int.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This will almost certainly never happen, but be defensive and
guarantee that we never return an uninitialized variable.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Ensure to initialized ret_code. This problem will likely never occur
in practice, but we might as well be defensive about it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
somehow the flag indicating to gather performance data
on collective io operations has changed to 1 accidentally.
Should be 0 ( false) by default.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
never got to move this sharedfp component into anything
usable. Can easily be restored if necessary.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
plfs components are at this point not utilized by anybody as far as I know.
Easy to bring back if we want to.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit is a large update to the osc/rdma component. Included in
this commit:
- Add support for using hardware atomics for fetch-and-op and single
count accumulate when using the accumulate lock. This will improve
the performance of these operations even when not setting the
single intrinsic info key.
- Rework how large accumulates are done. They now block on the get
operation to fix some bugs discovered by an IBM one-sided test. I
may roll back some of the changes if the underlying bug in the
original design is discovered. There appear to be no real
difference (on the hardware this was tested with) in performance so
its probably a non-issue. References #2530.
- Add support for an additional lock-all algorithm: on-demand. The
on-demand algorithm will attempt to acquire the peer lock when
starting an RMA operation. The lock algorithm default has not
changed. The algorithm can be selected by setting the
osc_rdma_locking_mode MCA variable. The valid values are two_level
and on_demand.
- Make use of the btl_flush function if available. This can improve
performance with some btls.
- When using btl_flush do not keep track of the number of put
operations. This reduces the number of atomic operations in the
critical path.
- Make the window buffers more friendly to multi-threaded
applications. This was done by dropping support for multiple
buffers per MPI window. I intend to re-add that support once the
underlying performance bug under the old buffering scheme is
fixed.
- Fix a bug in request completion in the accumulate, get, and put
paths. This also helps with #2530.
- General code cleanup and fixes.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
data sieving has to occur for any offset provided that is larger
or equal zero for this implementation to work correctly.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit fixes an issue observed with romio314 and the hdf5 1.10.x testsuite.
The ADIOI_Datatype_iscontig() routine in romio314/src/io_romio314_module.c
will now return for a datatype of size 0 that it is contiguous, even if the extent
of the datatype is non-zero. This avoids a segmentation fault observed in the
ADIOI_Flatten routine, and fixes this particular with the hdf5 1.10.x testsuite in
OpenMPI with romio314.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes#4795
- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
sequence number from uint16_t.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
-Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a
threshold that can be set by the Open MPI User.
-Users can adjust the number of events that can be handled in the
ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #".
-The default value for the the number of CQ events that can be read in a
single call to ofi progress is 100 which is an average
based off workload usecase anaylsis showing 70-128 as the range of
multiple events returned during ofi progress.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
This commit fixes a flaw in the progress function for libnbc. The
function was unconditionally taking a lock even if there are no
requests to process. This lock was showing up in vtune traces of
multi-threaded benchmarks.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
after performing the final OBJ_RELEASE on the request,
reset the user level variable to MPI_REQUEST_NULL.
Otherwise the c_2_f translation step in the fortran
interface fails.
Fixes issue #4807
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes a flaw in the eager limit check in pml/ob1. The
check was incorrectly checking if RDMA-only BTLs (BTLs without the
send flag) has a valid eager limit. This commit fixes the check by
adding an additional check for the send flag on the BTL module.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue #4523.
Extend the common monitoring API.
* Now it's possible to translate from local rank to world rank from
both the communicator and the group.
* Remove useless hashtable as we directly use the w_group contained
in window structure.
Add automatic generation at config time.
The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.
Only keep a generated argv-style array.
Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.
Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].
Thanks Martin Binder for the report.
Refs open-mpi/ompi#4770
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.
Fixes#4745
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
ompio has the unique problem, that mca parameters set in the io/ompio component
have to be accessible from other frameworks as well. This is mostly done to avoid
a replication in the parameter names and to reduce the number of mca parameters that
and end-user has to worry about.
This commit introduces a generic function to retrieve ompio mca parameters, the function pointer
is stored on the file handle. It replaces two functions that used the same concept already for
one parameter each.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes the following bugs:
- Allow a btl to be used for communication if it can communicate with
all non-self peers and it supports global atomic visibility. In
this case CPU atomics can be used for self and the btl for any
other peer.
- It was possible to get into a state where different threads of an
MPI process could issue conflicting accumulate operations to a
remote peer. To eliminate this race we now update the peer flags
atomically.
- Queue up and re-issue put operations that failed during a BTL
callback. This can occur during an accumulate operation. This was
an unhandled error case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Cleanup several places where abstraction violations crept into OMPI layer (direct reference of ORTE). Add some missing includes that were exposed by this change.
Note that this compiles, but I haven't tested it for execution yet. Handing it over to Noah Evans for completion
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Turns out there are edge cases where an MTL's isend
method may end up marking a send request complete prior
to returning to the CM code. The would end up causing
problems in the bsend path since the ompi_request_complete
would end up getting invoked a second time on this request.
This ended up causing segfaults, etc. in ompi_request_complete .
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
At least with some providers (sockets and GNI), the mprobe/mrecv
ofi mtl methods were incorrect. For these two providers at least
one must supply the original tag and mask bits used with the
prior FI_PEEK | FI_CLAIM request that had been used to probe for
the message.
These providers take a strict interpretation of the following sentence
from the libfabric fi_tagged man page:
```
Claimed messages can only be retrieved using a subsequent, paired receive operation with the FI_CLAIM flag set.
```
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
ompio has historically changed the WRONLY flag provided by the applicaiton
to RDWR to allow for the data sieving optimization within the two-phase I/O
fcoll component. This change did not have a performance impact
on regular UNIX file systems, but seems to hurt performance on NFS (and maybe Lustre?)
So provide an option that allows to keep the WRONLY option, and raise an error
if tha fcoll/two-phase would actually like to use the data sieving.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
gcc 5.2 complains:
```
mtl_ofi_component.c: In function ‘ompi_mtl_ofi_finalize’:
mtl_ofi_component.c:613:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
if (ret = fi_close((fid_t)ompi_mtl_ofi.fabric)) {
^
```
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Before this commit, the presence of usNIC devices -- which will
(currently) return no data when fi_getinfo() is queried for tagged
matching providers -- would cause an error message to be displayed.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The value of ret is negative (e.g., -61), but it is displayed in the
help message as `%zd`, which renders as unsigned (i.e., a giant
positive value). So make sure to negate the negative value before
rendering it (e.g., so we display "61", not "4294967235").
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
higher priority than rdma and default to psm2.
Context: the Intel Omni-path driver (hfi1) has verbs support, so the openib
btl is available to use. However, at a bad performance. Without this
change osc rdma using btl openib is the default choice when running on Intel
Omni-path, with a lower performance than osc pt2pt over mtl psm2.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
Rework the logic to handle the out-of-sequence fragments on the receiver
side. A large number of OOS messages are still arriving even in single
threaded scenarios.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code.
Refs. open-mpi/ompi#4443
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This is a bug fix based on a problem reported on the mailing list.
For very large read/write operations, ompio breaks the operation
down into multiple cycles. The problem was that
one of the variables required to maintain its values
across the different cycles did not do that, and because
of that the calculations of the memory offsets was wrong.
Fixes issue #4453
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the fs/lustre component has missed out on a number of updates to the fs/ufs component.
This commit tries to import all the changes performed on the fs/ufs component
w.r.t to the file_open operation, including updates on how the amode is set,
error is propegated and setting the fs_block_size value (which is required for
locking purposes).
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code. Fixes an issue reported on the mailing by Wei-keng Liao
Fixes Issue #4443
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If not the pvars will remain valid after the OB1 PML is unloaded, and
any access will segfault (the callbacks associated with the pvar will
point to the memory of the dlclosed module).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Currently, the progress function is incorrectly interpreting any error
value other than a positive value or -FI_EAVAIL to mean CQ is empty.
CQ is empty only if fi_cq_read() call returned -EAGAIN error
code. Fix that here.
While at it, fix help text output for calls made to OFI API.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
In multithreaded case, it is expensive to release the lock, call the slow match
and retake the lock again just to queue the frag. This patch will eliminate number of
lock taken by queueing the frag right away and return.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
The messages should be printed only in the event of CUDA builds and in the
presence of supporting hardware and when PSM2 MTL has actually been selected
for use. To this end, move help text output to component init phase.
Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
This commit looks large, but its really mostly a cleanup step.
1. introduce proper error handling for the return values of fcntl and the fbtl_posix_lock function
2. rename a parameter to more accurately reflect what it does
3. introduce an mca parameter in the fs/ufs component that allows to control
what the level of locking the user would like to enforce
4. move the initialization of the fs_block_size parameter from fs/ufs into the
common/ompio component. An fs component might be allowed to overwrite this
value, but none of the actual fs components do that.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>