Changes of nonblocking collectives in e98d794e8b and f750c6932c
are applied to persistent collectives.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 357531847e)
Following the commit f750c6932c, I compared
`ompi/mpi/fortran/use-mpi-f08/*.F90` and
`ompi/mpi/fortran/use-mpi-f08/profile/p*.F90`, and
`ompi/mpi/fortran/use-mpi-f08/mod/mpi-f08-interfaces.F90` and
`ompi/mpi/fortran/use-mpi-f08/mod/pmpi-f08-interfaces.F90`.
There are many differences. Some are bugs of `MPI_*`, some are
bugs of `PMPI_*`. I'm not sure how these bugs affect applications.
To make it easy to compare these files future, I also removed
editorial differences.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit cf6d28cb66)
When --enable-mpi1-compatibility was specified, the ompi_mpi_ub/lb
symbols were #if'ed out of mpi.h. But the #defines for MPI_UB/LB
still remained. This commit also #if's out the MPI_UB/LB macros when
--enable-mpi1-compatibility is specified.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7223334d4d)
Change default provider selection logic for the OFI MTL. The
old logic was whitelist-only, so any new HPC NIC provider would
have to ask users to do extra work or wait for an OMPI release
to be whitelisted. The reason for the logic was to avoid
selecting a "generic" provider like sockets or shm that would
frequently have worse performance than the optimized BTL options
Open MPI supports.
With the change, we blacklist the (small, relatively static) list
of providers that duplicate internal capabilities. Users can use
one of thse blacklisted providers in two ways: first, they can
explicitly request the provider in the include list (which will
override the default exclude list) and second, the can set a new
empty exclude list.
Since most HPC networks require special libraries and therefore
an explicit build of libfabric, it is highly unlikely that this
change will cause users to use libfabric when they didn't want to
do so. It does, however, solve the whitelisting problem.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit c5eaa38491)
Corrected the signatures of the collectives used by the Fortran 2008
interface to state correct intent for inout arguments and use the
ASYNCHRONOUS attribute in non-blocking collective calls.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit f750c6932c)
Corrected the signatures of the collectives used by the Fortran 2008
interface to state correct intent for inout arguments and use the
ASYNCHRONOUS attribute in non-blocking collective calls. Also corrected
the C-bindings in Fortran accordingly
Signed-off-by: Philipp Otte <philipp.j.otte@googlemail.com>
(cherry picked from commit e98d794e8b)
Having the "make_manpage.pl" script in the ompi/ tree broke
"./autogen.pl --no-ompi" (specifically: "make distcheck" of --no-ompi
builds would break).
(cherry picked from commit 89773c41)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
gcc complains about ret possibly being used uninitialized. That will
never happen but we should still quiet the warning. This commit sets
ret to a valid value.
Fixes#5513
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The aggregation code in osc/rdma is currently broken and will likely
not be reused. This commit cleans it out.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
- do not generate bindings for pompi_FOO_f symbols
(they are simply not used anywhere)
- move ompi_FOO_f bindings out of mpi_f08.mod into
ompi_mpifh_bindings.mod that is only used at build time
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@c6070fd2e0)
We previously needed to have empty targets because AM couldn't handle
having an AM_CONDITIONAL was targets in the "if" statement but not in the
"else". :-(
That now appears as an old automake bug that has been fixed,
so cleanup some Makefile.am
Thanks Jeff for the pointer.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@6e04b2a66a)
We were assuming that the array_of_indices has the same size as the
number of requests (incount), instead of the numberr of actually
active requests. While the patch is trivial, the question of the
size of the array_of_indices should be clarified in the MPI Forum.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a5fbfa476a)
The parameter passed to NBC_Return_handle() was incorrectly casted
and not dereferenced.
Thanks Yossi for the bug report.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit open-mpi/ompi@8b51862fb2)
We need to return a persistent request.
`ompi_request_empty` is not a persistent request.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 69901a5156)
As agreed on #4574, where removed in past release branches
to avoid perfomance impacts in the default values for
some paramters.
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
- added synonim to common ucx variables to allow
to print it in opal_info -a
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit e00f7a68ba)
Since progress entries array is globally allocated, it is susceptible
to race conditions when using multi-threaded applications. Allocating it
on the stack resolves any potential races as it is thread local by default.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit ed2343034d)
file_delete triggers underneath the hood the full component selection
logic, since we do not have a file handle, just a file name.
As part of the selection logic, we have to however initiate the
framework-open of the fs component in case of ompio, since ompio
will call the delete function of the selected fs componentn, which
is based on the file system where the file is located.
This was not handled correctly so far. The problem however only
shows up if the first I/O operatin to be executed is a file_delete,
other wise the file_open will lead to the correct opening and initialization
of the fs framework. This commit ensures that we do the right thing
even if file_delete is the first file I/O operation in the application.
Fixes issue #5611
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If an application opens a file for reading from multiple processes
using MPI_COMM_SELF (or another communicator that has distinct
process groups but the same comm-id, as can happen as the result
of comm_split), the naming chosen for the lockedfile or the mmapped
file used by the sharedfp/sm component would collide. This patch
ensures that the filename is different by integrating the process id
of rank 0 for each sub-communicator.
This fixes one aspect of the problem reported in github issue 5593
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the sharedfp component has to be selected and opened before
we set the default file view during file_open. Otherwise
there is a sperious error message from the sharefp_file_seek
operation that is called during the file_set_view.
Fixes Issue #5560
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
In the cleanup phase, it is possible for PtlMEUnlink() to return
PTL_IN_USE if the NIC is not done with the ME. This should not
be considered an error. This commit adds a retry loop around
PtlMEUnlink().
In some cases, the return value of PtlMEUnlink() and PtlCTFree()
was not checked at all. Check them with the same retry loop as
above.
Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit f3f2a826b4)
-Updated blocking send to directly call functionality and
set completion events expected to 0 initally. This allows for optimization for
providers that support fi_tinject up to larger sizes. This also reduces
latency on running the OFI mtl with smaller sizes without requiring
calls to progress given fi_tinject is required to complete the messaging
before returning and will not create any events in the Completion Queue.
-Updated non-blocking send to directly call fi_tsend and avoid calling
fi_tinject as the functionality should not wait on completions. This
resolves a bug where applications calling MPI_Isend can overrun the
TX buffer with small (inject) messages causing a deadlock. In addition
this improves performance in message rates by preventing
waiting on any size message to complete in non-blocking send messages.
-Created common ompi_mtl_ofi_ssend_recv function to post the ssend recv
which is common between isend and send code paths.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 7dc8c8ba3f)
When romio314 was first pulled in an extra patch was applied to it, see commit
92f6c7c1e2. Most of that patch is already present
in vanilla romio321, but the fix for MPIO_DATATYPE_ISCOMMITTED() isn't.
If that macro doesn't set err_ then some paths end up with a variable being used
uninitialized. In particular you can trace through romio321/romio/mpi-io/read.c
to see what happens with error_code. It's an uninitialized stack variable that goes
through three MPIO_CHECK_* macros none of which set it. The macros consistently set
error_code to a failure if they see something wrong, but they don't consistently
set it to success when things are fine.
And then in the last macro MPIO_CHECK_DATATYPE it tries to look at the value
of error_code that was never set.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit f413ef6b14)
- If a message for a recv that is being cancelled gets completed after
the call to fi_cancel, then the OFI mtl will enter a deadlock state
waiting for ofi_req->super.ompi_req->req_status._cancelled which will
never happen since the recv was successfully finished.
- To resolve this issue, the OFI mtl now checks ofi_req->req_started
to see if the request has been started within the loop waiting for the
event to be cancelled. If the request is being completed, then the loop
is broken and fi_cancel exits setting
ofi_req->super.ompi_req->req_status._cancelled = false;
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 767135c580)
- to avoid value mix used C99 style of object initializations
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 2806504290)
The call of MPI_Allgather with sendbuf and sendtype parameters equal to MPI_IN_PLACE and NULL correspondingly, produces the segmentation fault.
The problem is that sendtype is used even when sendbuf value is MPI_IN_PLACE. But according to the standard, sendtype and sendcount parameters should be ignored in this case.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
(cherry picked from commit 540c2d1)
- Added support in MTL_OFI_RETRY_UNTIL_DONE to handle -FI_EAGAIN
from the provider and correctly attempt to progress the OFI Completion
queue by calling ompi_mtl_ofi_progress.
- If events were pending that blocked OFI operations from being enqueued
they will be completed and the OFI operation will be retried once
ompi_mtl_ofi_progress has successfully completed.
- Updated MTL_OFI_RETRY_UNTIL_DONE to take a RETURN variable instead of
requiring the existance of a "ret" variable to pass back the return
value from completing the OFI operation.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit d4f408a7f8)
- in sine cases persistent request was deleted during completion
callback, this cause double free of linked UCX request (assert
in debug build or hang in release build)
- UCX request is freed prior completion callback
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit 6fe0a73861)
- there worked progress was missed on startup which caused hang
on one of ranks
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a081fba046)
OMPIO now uses the correct delete function depending on the fs
mca_common_ompio_file_delete now works this way instead
of calling POSIX unlink:
- create a minimal file handle with the given file name
- select the best fs component using this file handle
- call the component-specific file delete function
Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
It is needed because the fs components might be queried due to a MPI_File_delete call.
And in this case, we don't have a communicator value.
Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
instead of invoking ompi_request_test_all(), that will end up
calling opal_progress() recursively, manually check the status
of the requests.
the same method is used in ompi_comm_request_progress()
Refs open-mpi/ompi#3901
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Blocking `MPI_SCATTER` and `MPI_SCATTERV` were fixed in 506d0e96f4
but noblocking `MPI_ISCATTER` and `MPI_ISCATTERV` were not fixed yet.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Move memory hooks init (for request based operation) in osc ucx to window
creation time, to avoid performance issue in MPI initialization.
Signed-off-by: Xin Zhao <xinz@mellanox.com>
This commit fixes a crash that occurs when using btl/vader as an RDMA
btl. This btl supports using CPU atomics and does not support using
the btl for self communication so we must use the local memory
optimizations in osc/rdma.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Data transferred by `MPI_BSEND` may corrupt if all of the following
conditions are met.
- The message size is less than the eager limit.
- The `btl_alloc` function in the BTL interface returns `NULL`
for some reason.
- The MPI program overwrites the send buffer after `MPI_BSEND`
returns.
The problem is in the way of pending a send request in ob1 PML.
The `mca_pml_ob1_send_request_start_copy` function retruns
`OMPI_ERR_OUT_OF_RESOURCE` if `mca_bml_base_alloc` function returns
`des = NULL`. In this case, the send request is added to the
`send_pending` list and `MPI_BSEND` returns immediately. Next time
the `mca_pml_ob1_send_request_start_copy` function tries sending,
the user buffer may have been overwritten by the MPI program.
Call hierarchy of `MPI_BSEND`:
```
MPI_Bsend
mca_pml_ob1_send
if (MCA_PML_BASE_SEND_BUFFERED == sendmode)
mca_pml_ob1_isend
MCA_PML_OB1_SEND_REQUEST_START_W_SEQ
mca_pml_ob1_send_request_start_seq
mca_pml_ob1_send_request_start_btl
if (size <= eager_limit)
if (req_send_mode == MCA_PML_BASE_SEND_BUFFERED)
mca_pml_ob1_send_request_start_copy
mca_bml_base_alloc
btl_alloc
if (OMPI_ERR_OUT_OF_RESOURCE == rc)
add_request_to_send_pending
ompi_request_free
```
To solve this problem, we should save the data to the buffer
attached by `MPI_BUFFER_ATTACH` before leaving `MPI_BSEND`.
This problem was introduced by ob1 optimization (commits 2b57f422
and a06e491c) in v1.8 series.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This commit fixes an issue identified by MTT where we can have two
different sets of processes on the same node creating a shared memory
window with communicators sharing the same CID. To avoid this issue
the temporary filename now includes the creating processes vpid.
References #5363
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Use internal pack/unpack subroutines that operate on MPI_Aint instead of int
and hence solve some integer overflows.
Thanks Clyde Stanfield for reporting this issue.
Refs open-mpi/ompi#5383
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Current implementation of `coll/base/MPI_Scatter` is based on in-order binomial tree. This tree is right skewed and it provides good performance for a MPI_Gather operation. But for a MPI_Scatter operation left skewed binomial tree is effective.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
-Updated the design for sync send MPI calls to use 2 protocol bits for
denoting "sync_send" or "sync_send_ack".
-"Sync_send" is added to the send tag only and is masked out in receives
such that it can be read by the original Recv posted in the send/recv
operation.
-"Sync_send_ack" is sent from the recv callback to the send side. This 0
byte send does not generate a completion entry and instead sends the
message and immediately completes the opal completion in the recv.
-Tag formats ofi_tag_1 and ofi_tag_2 have been updated to include 2
more tag bits per format type due to the reduced protocal bits required
by OMPI.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
The call of MPI_Gather with sendbuf and sendtype parameters equal to MPI_IN_PLACE and NULL correspondingly, produces the segmentation fault in the root process.
The problem is that sendtype is used even when sendbuf value is MPI_IN_PLACE. But according to the standard (page 150, line 37), sendtype and sendcount parameters should be ignored in this case.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
- added common logging infrastructure for all
UCX modules
- all UCX modules are switched to new infra
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
- some common functionality of del_procs calls is moved into
mca_common module
- blocking ucp_put call is replaced by non-blocking routine
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
moving some code from fs/ufs into fs/base. The benefit of this approach is
that fs components that are fundamentally based on posix I/O (and only differ
in some non-posix functionality such as setting stripe size, or which hints
are being supported) can avoid having to replicate the same code over and
over again. First beneficiary is the lustre fs component, but more
are to follow soon.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Intentionally do not mark some MPI-1 function pointer typedefs as
`__mpi_interface_removed__` because we have to use them in prototyping
some MPI-1 functions when `--enable-mpi1-compatibility` is used.
Fixes#5357.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a typo where a bcast is used instead of the intended
collective (barrier).
References #5262
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
so latest ROM-IO can be used with Open MPI.
Note this first and naive implementation does not use the wait_fn callback.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The osc/rdma module did not wait for all pending atomics to complete
before tearing down. This could lead to weird issues as the target
location may no longer be registered or allocated.
This commit also fixes an offset calculation issue in
ompi_osc_get_data_blocking ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
a bug sneaked into constructing the list of aggregators
processes when using the fileview based grouping options
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
handle the situation where the user requests a non-zero amount
of data but has a zero-size fileview. My instrinct would have been
to return an error code, but according to the test that I used
it should be MPI_SUCCESS and zero bytes. It is definitely better
than segfaulting :-)
THis makes another test from the IBM testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the seek offset calculation did not treat the offset as a multiple
of the etype provided. Fixing this makes some more ibm tests pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
use an allocator to manage temporary buffers when copying
unmanaged data from GPU buffer to host. This is necessary,
since the buffers have to be pinned for better performance,
which is an expensive operation.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the two_phase compoment does not work with some collective I/O
operations on CUDA buffers due to the data sieving (i.e.
both read and write operations) executed on some buffers, which are
not anticipated in the GPU buffer management of the code.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit adds the initial support for cuda buffers in ompio, for blocking
and non-blocking individual read and write operations.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
two minor updates:
- in all components: use the fh->f_bytes_per_agg value
(which might have been set by an info object) instead
of re-reading the mca parameter
- vulcan and dynamic_gen2: replace one allgather operation
by an allreduce, since it is used to determine the sum
of an array.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@84765001aa)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(back-ported from commit open-mpi/ompi@4d876ec6fe)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
add support for the info objects cb_buffer_size and collective_buffering.
Also, introduce a new mca parameter that allows to give feedback
on whether an info object is recognized (and honored).
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this component can only be used in very specific scenarios. However, since some file systems do not support file locking and processes might be distributed over multiple nodes (hence the sm sharedfp component is also inelligible), the component might be selected in some scenarios, even if an application does not intend to use shared file pointers.
Since the fseek_shared function is involved as part of the File_set_view operation, only complain about the inability to perform the seek_shared operation if actual shared file pointer operations are being used. This avoid spurious error values being returned.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
this commit revamps the internal operations of the sharedfp components.
Specifically, it is focused around removing the second file_open
operation for shared file pointers. This makes the code more efficient.
Because of that, there is no necessity anymore for the sharedfp_lazy_open
mca parameter.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This function is only used in ompi_spc.c and is hence declared as static.
Remove its prototype from the header file in order to silence compiler warnings who will typically consider ompi_spc_get_count() as a declared but not defined function.
Fixesopen-mpi/ompi#5279Fixesopen-mpi/ompi#5273
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA. Add README file to OFI MTL
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
The `nbc_i*` functions don't start communication, but create a request.
`nbc_*_init` are appropriate names for them.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Persistent operation for `NBC_A2A_DISS` is not supported currently.
Though the algorithm is not selected at all currently, I put an
assertion not to select it by mistake.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
`NBC_Copy` shoud not be called in `MPI_*_INIT`.
`NBC_Sched_copy` should be called instead.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Because a persistent reuqest does not free its `schedule` object
when the communication completes, the `NBC_Progress` function cannot
determine the completion using `schedule`.
Without this change, a hang occurs when the `NBC_Progress` function
is called recursively through the `NBC_Start_round` function.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Until the MPI Forum decides to add the persistent collective
communication request feature to the MPI Standard, these functions
are supported through MPI extensions with the `MPIX_` prefix.
Only C bindings are supported currently.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Now libnbc COLL supports persistent collectives and all `*_init`
functions of the COLL interface are available. So let's enable the
check of availability of those functions on a communicator creation.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
prepare the upcoming persistent collectives by pre-factoring some code
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
fixup 808c3c62cd9475edd91ecde9d2d53b12e28b2c04
now that we have a shiny new fcoll component, no need
to keep the static component around. No use for it anymore.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
check for pending I/O operations and invalid modes
and return proper error codes before executing MPI_File_sync
makes the e_sync_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the interface if file_get_type_extent did not check
whether the input datatype is valid or not.
Makes the e_get_type_extend_2 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
in case the user opened a file using the DELETE_ON_CLOSE flag,
return the error code generated in the delete operation.
Note, that this is however just a partial fix to the e_close_1 test
from the ibm testsuite, since the object destructor that triggers
the file_close function does not have a mechanism right now to recognize
and return an error code.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
in file_get_byte_offset, return an error code if the offset
leads to an invalid position in file.
Makes the e_get_byte_offset_1 test from the ibm testsuite pass.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
and some internal structure elements/components. Along the way,
add support for the cb_nodes Info object.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
since the request code is now being accessed also from the vulcan fcoll
component, the request code was relocated into the common/ompio
directory to avoid ld load problems.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
We introduced a new mca_vulcan parameter that specify the I/O synchronization
type (Async/sync I/O) applied within the collective write operation.
The user can explicitly choose to use async or sync write operation or make
the choice automatically made.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
For very large offsets, the data chunk size to be written by each aggregator
exceeds the capacity of an integer variable. Besides, some variables were
not large enough to hold intermediate values.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Identify the index of each aggregator process in order to restrict the call to write_init function by the specific aggregator.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
import of the new vulcan component. It is an enhanced version
of the two_phase component, which uses however the ompio internal
codes/loops to assemble the data arrays. It is therefore more inline
with the dynamic and dynamic_gen2 component, and will be easier to
maintain.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If a subroutine of the Fortran `use-mpi-f08` binding in an MPI extension
have a `LOGICAL` parameter and no `TYPE(MPI_Status)` parameter,
it needs to use the `mpi_ext` module and call its corresponding subroutine
in the `mpif-h` directory, as explained in
`ompi/mpi/fortran/use-mpi-f08/mpi-f-interfaces-bind.h`.
However, as shown in the figure below, the required directories are dependent
on each other, and "Can't open module file" error occurs at build time.
ompi/mpiext/{extension name}/use-mpi-f08
A |
| |
| V
ompi/mpi/fortran/use-mpi-f08 <--- ompi/mpi/fortran/mpiext (mpi_ext.mod)
In order to solve this problem, change the configuration and the build order.
- divide Fortran extension directory (`ompi/mpi/fortran/mpiext`)
into the directories for `use-mpi` and for `use-mpi-08`
- `ompi/mpi/fortran/mpiext-use-mpi` : for `use-mpi` (mpi_ext.mod)
- `ompi/mpi/fortran/mpiext-use-mpi-f08` : for `use-mpi-08` (mpi_f08_ext.mod)
- change to the following build order about Fortran `use-mpi` and
`use-mpi-f08` bindings in `ompi`
1. mpi_ext bindings of MPI extensions (`mpiext/{extension name}/use-mpi` directory)
2. Fortran use-mpi (`mpi/fortran/use-mpi-[ignore-]tkr` directory)
3. Fortran extension for use-mpi (`mpi/fortran/mpiext-use-mpi` directory)
4. Fortran use-mpi-f08 modules only (`mpi/fortran/use-mpi-f08/mod` directory)
5. mpi_f08_ext bindings of MPI extensions (`mpiext/{extension name}/use-mpi-f08` directory)
6. Fortran use-mpi-f08 (`mpi/fortran/use-mpi-f08` directory)
7. Fortran extension for use-mpi-f08 (`mpi/fortran/mpiext-use-mpi-f08` directory)
Signed-off-by: Kurita, Takehiro <fj6370fp@aa.jp.fujitsu.com>
There was a race condition in 35438ae9b5: if multiple threads invoked
ompi_mpi_init() simultaneously (which could happen from both MPI and
OSHMEM), the code did not catch this condition -- Bad Things would
happen.
Now use an atomic cmp/set to ensure that only one thread is able to
advance ompi_mpi_init from NOT_INITIALIZED to INIT_STARTED.
Additionally, change the prototype of ompi_mpi_init() so that
oshmem_init() can safely invoke ompi_mpi_init() multiple times (as
long as MPI_FINALIZE has not started) without displaying an error. If
multiple threads invoke oshmem_init() simultaneously, one of them will
actually do the initialization, and the rest will loop waiting for it
to complete.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Implements butterfly algorithm for MPI_Reduce_scatter.
The algorithm can be used both by commutative and non-commutative operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:
* autogen --no-orte
* with external libevent, external hwloc, and external PMIx master
* configuring PMIx master with the same libevent and hwloc
* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun
Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
This checkin mainly concerns our internal info keys that are registering
for callbacks via opal_infosubscribe_subscribe(). Those keys need to have
an extra __IN_<key>/val stored to preserve their pre-callback value. So
that means our internal keys are limited to 5 chars shorter than the usual
key length limit.
The code previously would have been silently inactive if a large key happened
to come in, now it warns and also uses snprintf() to avoid compiler warnings.
I'm also making the top-level MPI_Info_set warn if the user uses our reserved
"__IN_" prefix. I had wanted the feature to be more invisible than that, but
it would require a more sophisticated approach to change that.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".
Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See
https://github.com/open-mpi/ompi/issues/5084.
This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:
- not initialized
- init started
- init completed
- finalize started
- finalize past COMM_SELF destruction
- finalize completed
The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.
Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.
Thanks to @AndrewGaspar for reporting the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
for very large offsets, ome ariables used in the fcoll/dynamic_gen2
code base were under certain circumstances not large enough to hold
intermediate values. This issue was more detected in the vulcan component
but could happen in the dynamic_gen2 component as well.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
appends MPI1 compatible source files instead of redefining all the source files
fix a typo from open-mpi/ompi@89da9651bb
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit adds a new configure option: --enable-mpi1-compat. Without
this option we will no longer provide APIs, typedefs, and defines that
were removed from the standard in MPI-3.0. This option will exist for
one major release (Open MPI v4.x.x) and then the option and associated
code will be removed in Open MPI v5.x.x. Open MPI has already
internally prepared for this change. Please prepare your codes
accordingly.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_sm_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per discussion at
https://github.com/open-mpi/ompi/issues/2614#issuecomment-392815654,
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active. Print a helpful message and
return a not-supported error.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit d0ffd660841623c02d1dfa3151e7f7afd3327698)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Implements butterfly algorithm for MPI_Reduce_scatter_block.
The algorithm can be used both by commutative and non-commutative
operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
get_dynamic_win_info() at osc_ucx_comm.c
In file included from /usr/include/string.h:494:0,
from ../../../../ompi/info/info.h:29,
from ../../../../ompi/mca/osc/base/base.h:24,
from osc_ucx_comm.c:13:
In function 'memcpy',
inlined from 'get_dynamic_win_info' at osc_ucx_comm.c:359:5,
inlined from 'ompi_osc_ucx_put' at osc_ucx_comm.c:401:18:
/usr/include/bits/string_fortified.h:34:10: warning: '__builtin___memcpy_chk' writing 8 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=]
return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is caused by a type size mismatch in a call to memcpy
This fix corrects the type definition of the win_count variable.
Signed-off-by: John Jolly <jjolly@suse.com>
Remove the MXM MTL, which has been deprecated in preference for
the Yalla PML. This was discussed at the last developers meeting
and somehow I ended up with the action item to do the removal.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The Java configury is split into two parts:
1. Determine if we want MPI Java bindings.
2. Find the Java compiler (and related).
This commit does a few things:
- Move the "Find the Java compiler" step from OPAL to OMPI (because
there is no Java in OPAL, and there doesn't appear to be any
immanent danger that there will be).
- As a direct consequence, remove the --enable-java CLI option
(--enable-mpi-java still remains). Enabling the MPI Java bindings
and enabling Java are now considered the same thing (since there
is no Java elsewhere in the code base, the different was
meaningless).
- Only invoke the "Find the Java compiler" step if we actually want
the MPI Java bindings.
- A few miscellaneous Java-related cleanups in configury (E.g., change
testing "$foo" == "1" to $foo -eq 1, etc.
This commit is mostly s/opal/ompi/gi in many places in configury and
shifting code around. But it looks bigger than it actually is because
of two reasons:
1. Some files were renamed:
* ompi_setup_java.m4 -> ompi_setup_mpi_java.m4 (setup MPI Java bindings)
* opal_setup_java.m4 -> ompi_setup_java.m4 (setup Java compiler)
2. Indenting level changed in (the new) ompi_setup_java.m4.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>