Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".
Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See
https://github.com/open-mpi/ompi/issues/5084.
This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:
- not initialized
- init started
- init completed
- finalize started
- finalize past COMM_SELF destruction
- finalize completed
The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.
Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.
Thanks to @AndrewGaspar for reporting the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
for very large offsets, ome ariables used in the fcoll/dynamic_gen2
code base were under certain circumstances not large enough to hold
intermediate values. This issue was more detected in the vulcan component
but could happen in the dynamic_gen2 component as well.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
appends MPI1 compatible source files instead of redefining all the source files
fix a typo from open-mpi/ompi@89da9651bb
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit adds a new configure option: --enable-mpi1-compat. Without
this option we will no longer provide APIs, typedefs, and defines that
were removed from the standard in MPI-3.0. This option will exist for
one major release (Open MPI v4.x.x) and then the option and associated
code will be removed in Open MPI v5.x.x. Open MPI has already
internally prepared for this change. Please prepare your codes
accordingly.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_sm_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new MCA variable to set the location of the backing
store: osc_rdma_backing_directory. The default on Linux has been
changed to use /dev/shm to improve performance in cases where /tmp is
not a tmpfs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per discussion at
https://github.com/open-mpi/ompi/issues/2614#issuecomment-392815654,
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active. Print a helpful message and
return a not-supported error.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit d0ffd660841623c02d1dfa3151e7f7afd3327698)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Implements butterfly algorithm for MPI_Reduce_scatter_block.
The algorithm can be used both by commutative and non-commutative
operations, for power-of-two and non-power-of-two number of processes.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
get_dynamic_win_info() at osc_ucx_comm.c
In file included from /usr/include/string.h:494:0,
from ../../../../ompi/info/info.h:29,
from ../../../../ompi/mca/osc/base/base.h:24,
from osc_ucx_comm.c:13:
In function 'memcpy',
inlined from 'get_dynamic_win_info' at osc_ucx_comm.c:359:5,
inlined from 'ompi_osc_ucx_put' at osc_ucx_comm.c:401:18:
/usr/include/bits/string_fortified.h:34:10: warning: '__builtin___memcpy_chk' writing 8 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=]
return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is caused by a type size mismatch in a call to memcpy
This fix corrects the type definition of the win_count variable.
Signed-off-by: John Jolly <jjolly@suse.com>
Remove the MXM MTL, which has been deprecated in preference for
the Yalla PML. This was discussed at the last developers meeting
and somehow I ended up with the action item to do the removal.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The Java configury is split into two parts:
1. Determine if we want MPI Java bindings.
2. Find the Java compiler (and related).
This commit does a few things:
- Move the "Find the Java compiler" step from OPAL to OMPI (because
there is no Java in OPAL, and there doesn't appear to be any
immanent danger that there will be).
- As a direct consequence, remove the --enable-java CLI option
(--enable-mpi-java still remains). Enabling the MPI Java bindings
and enabling Java are now considered the same thing (since there
is no Java elsewhere in the code base, the different was
meaningless).
- Only invoke the "Find the Java compiler" step if we actually want
the MPI Java bindings.
- A few miscellaneous Java-related cleanups in configury (E.g., change
testing "$foo" == "1" to $foo -eq 1, etc.
This commit is mostly s/opal/ompi/gi in many places in configury and
shifting code around. But it looks bigger than it actually is because
of two reasons:
1. Some files were renamed:
* ompi_setup_java.m4 -> ompi_setup_mpi_java.m4 (setup MPI Java bindings)
* opal_setup_java.m4 -> ompi_setup_java.m4 (setup Java compiler)
2. Indenting level changed in (the new) ompi_setup_java.m4.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
- rename ompi_coll_base_reduce_scatter_block_basic to
more self descriptive ompi_coll_base_reduce_scatter_block_basic_linear
- fix the description of the coll_tuned_reduce_scatter_block_algorithm
MCA param
this fixes and documents previous open-mpi/ompi@0e8b35b615
MPI_Reduce_scatter_block used to be implemented by the coll/basic module only.
A new algo (recursive doubling) was recently introduced and can be used via the coll/tuned module,
but we never intended to make it the default algo.
In order to "restore" the previous default, the initial algo was moved from coll/basic to coll/base,
and is now used by default by coll/tuned.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The MPI spec defines that the "mpi" and "mpi_f08" module Fortran
bindings support passing by parameters by name. Hence, we need to use
the MPI-spec-defined parameter names ("dummy variables", in Fortran
parlance) for the "mpi" and "mpi_f08" modules.
Specifically, Fortran allows calls to procedures to be written with
keyword arguments, e.g., "call mpi_sizeof(x=x,size=rsize,ierror=ier)"
An "explicit interface" for the procedure must be in scope for this to
be allowed in a Fortran program unit. Therefore, the explicit
interface blocks we provide in the "mpi" and "mpi_f08" modules must
match the MPI published standard, including the names of the dummy
variables (i.e., parameter names), as that is how Fortran programs may
call them.
Note that we didn't find this issue previously because even though the
MPI spec *allows* for name-based parameter passing, not many people
actually use it. I suspect that we might have some more incorrect
parameter names -- we should probably do a full "mpi" / "mpi_f08"
module parameter name audit someday.
Thanks to Themos Tsikas for reporting the issue and supplying the
initial fix.
Signed-off-by: themos.tsikas@nag.co.uk
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The files array was also storing $phase.prof. This was leading to
$phase.prof's output getting dumped into itself again and again. Updated
code to initialise files array with files other than $phase.prof.
Signed-off-by: Ninad Prabhukhanolkar <ninadchess96@gmail.com>
Set mpi.lo dependencies outside of AM conditionals. Per
https://github.com/open-mpi/ompi/issues/5085, make mpi.lo depend on
whatever we decide the sizeof source files are (which may be empty, if
this compiler does not support the Right Stuff for MPI_SIZEOF, or it
may be mpi-tkr-sizeof.[h|f90]).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
whenever possible.
Add the missing OMPI_FORTRAN_BUILD_SIZEOF macro to Fortran
and add a missing dependency.
Thanks Themos Tsikas for reporting this issue.
Refs open-mpi/ompi#5085
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
keep track of the sizeof the blocklen_per_process and displs_per_process
on the aggregator datastructure to minimze the number of realloc function
calls required in the shuffle_init operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
The F08 bindings for MPI_STATUS_SET_CANCELLED incorrectly had the
"flag" dummy parameter set as INTENT(OUT) when it really should be
INTENT(IN).
On the one hand, this is technically an ABI change. On the other
hand, this is an incorrect MPI binding. On the other hand (that's 3
hands for you fans counting at home), this is such a rarely-used API
-- even in the C bindings -- that I'm guessing no one is using this
API, and therefore no one has noticed this error and it isn't worth
porting back to the release branches.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit attempts to update the romio io component to not use
functions removed in MPI-3.0 (2012). This is a first cut and will
probably need to be reviewed for correctness.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
romio assumes that all predefined datatypes are contiguous. Because of
the (terribly named) composed datatypes MPI_SHORT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, etc this is an incorrect assumption. The simplest way to
fix this is to override the MPI_Type_get_envelope and
MPI_Type_get_contents calls with calls that will work on these
datatypes. Note that not all calls to these MPI functions are
replaced, only the ones used when flattening a non-contiguous
datatype.
References #5009
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes issue #5069, which relates a BigMPI bug with the use of
MPI_Type_vectpor to construct very large datatypes (>2GB).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit fixes a segfault in mtl-portals4 finalize(). The segfault
occurs if finalize() is called without any calls to add_procs(). This
commit resolves the segfault by skipping the progress() loop in
finalize() if the Portals was not initialized.
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
Per 0ab6b201fe, note in the MPI_Comm_spawn_multiple.3in man page that
the array_of_commands does not need to be terminated -- it just need
to have exactly "count" entries. In the Fortran binding, at least,
this is different than in prior released versions of Open MPI (it's
not a backwards incompatibility, since prior versions of Open MPI
required array_of_commands to be blank-string-terminated in Fortran --
this change makes Open MPI be *less* restrictive, and therefore still
backwards compatible).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
MPI defines the "argv" param to Fortran MPI_COMM_SPAWN as being
terminated by a blank string. While not precisely defined (except
through a non-binding example, Example 10.2, MPI-3.1 p382:6-29), one
can infer that the "array_of_argv" param to Fortran
MPI_COMM_SPAWN_MULTIPLE is also a set of argv, each of which are
terminated by a blank line.
The "array_of_commands" argument to Fortran MPI_COMM_SPAWN_MULTIPLE is
a little less well-defined. It is *assumed* to be of length "count"
(another parameter to MPI_COMM_SPAWN_MULTIPLE) -- and *not* be
terminated by a blank string. This is also given credence by the same
example 10.2 in MPI-3.1.
The previous code assumed that "array_of_commands" should also be
terminated by a blank line -- but per the above, this is incorrect.
Instead, we should just parse our "count" number of strings from
"array_of_commands" and *not* look for a blank line termination.
This commit separates these two cases:
* ompi_fortran_argv_blank_f2c(): parse a Fortran array of strings out
and stop when reaching a blank string.
* ompi_fortran_argv_count_f2c(): parse a Fortran array of strings out
and stop when "count" number of strings have been parsed.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
javah is no more available from Java 10, so try
javac -h first (available since Java 8) and fallback on javah
Refs. open-mpi/ompi#5000
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
the ompio module resets the amode from WRONLY to RDWR in order
to accoomodate data sieving in the two-phase fcoll componet. This
leads however to an error if MPI_MODE_SEQUENTIAL has been requested
by the user, since MODE_SEQUENTIAL is incompatible with MODE_RDWR.
SInce the change to the amode was done after opening the file for
individual file pointers but before opening the file for shared filepointers,
this lead to an error message in the sharedfp component.
Note, that data sieving is never necessary if MODE_SEQUENTIAL is set,
so this should not be a problem for any scenario.
Fixes#4991
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
instead of using a temporary buffer and copy data into the temp buffer before sending, use a derived datatype to describe the data that needs to be sent during a cycle in the collective I/O operation.
Signed-off-by: raafatfeki <fekiraafat@gmail.com>
Implements recursive doubling algorithm for MPI_Scan and MPI_Exscan.
The algorithm preserves order of operations so it can be used both
by commutative and non-commutative operations.
Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
This commit fixes a bug is osc/rdma that can occur if the total size
of the shared memory segment gets larger than 4 GiB. The bug was
caused by a typo. The type of my_base_offset should have been size_t
not int.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This will almost certainly never happen, but be defensive and
guarantee that we never return an uninitialized variable.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Ensure to initialized ret_code. This problem will likely never occur
in practice, but we might as well be defensive about it.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The various RMA functions need to have the asynchronous property on
all buffers. This property was missing and some buffers were
incorrectly marked as intent(in). This commit fixes the function
signatures.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
somehow the flag indicating to gather performance data
on collective io operations has changed to 1 accidentally.
Should be 0 ( false) by default.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
never got to move this sharedfp component into anything
usable. Can easily be restored if necessary.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
plfs components are at this point not utilized by anybody as far as I know.
Easy to bring back if we want to.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit is a large update to the osc/rdma component. Included in
this commit:
- Add support for using hardware atomics for fetch-and-op and single
count accumulate when using the accumulate lock. This will improve
the performance of these operations even when not setting the
single intrinsic info key.
- Rework how large accumulates are done. They now block on the get
operation to fix some bugs discovered by an IBM one-sided test. I
may roll back some of the changes if the underlying bug in the
original design is discovered. There appear to be no real
difference (on the hardware this was tested with) in performance so
its probably a non-issue. References #2530.
- Add support for an additional lock-all algorithm: on-demand. The
on-demand algorithm will attempt to acquire the peer lock when
starting an RMA operation. The lock algorithm default has not
changed. The algorithm can be selected by setting the
osc_rdma_locking_mode MCA variable. The valid values are two_level
and on_demand.
- Make use of the btl_flush function if available. This can improve
performance with some btls.
- When using btl_flush do not keep track of the number of put
operations. This reduces the number of atomic operations in the
critical path.
- Make the window buffers more friendly to multi-threaded
applications. This was done by dropping support for multiple
buffers per MPI window. I intend to re-add that support once the
underlying performance bug under the old buffering scheme is
fixed.
- Fix a bug in request completion in the accumulate, get, and put
paths. This also helps with #2530.
- General code cleanup and fixes.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
data sieving has to occur for any offset provided that is larger
or equal zero for this implementation to work correctly.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Allowing MPI_PROC_NULL as a neighbor in any topology allows us to add
gaps on the send and recv buffers. This does make the traditional
neighbor collective have a similar behavior as the V version, but in
same time it allows the users to skip the step where they prepare the
counts and the displacement array.
For more info please take a look at issue #4675.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
this commit fixes an issue observed with romio314 and the hdf5 1.10.x testsuite.
The ADIOI_Datatype_iscontig() routine in romio314/src/io_romio314_module.c
will now return for a datatype of size 0 that it is contiguous, even if the extent
of the datatype is non-zero. This avoids a segmentation fault observed in the
ADIOI_Flatten routine, and fixes this particular with the hdf5 1.10.x testsuite in
OpenMPI with romio314.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
As discussed on https://github.com/mpi-forum/mpi-issues/issues/77#issuecomment-369663119
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit fixes#4795
- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
sequence number from uint16_t.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
For some of our configuration this flag increases per-process contribution
by ~20% while it is not being used currently.
The consumer of this flag was communicator ID calculation logic, but it was
changed in 0bf06de3f1.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
-Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a
threshold that can be set by the Open MPI User.
-Users can adjust the number of events that can be handled in the
ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #".
-The default value for the the number of CQ events that can be read in a
single call to ofi progress is 100 which is an average
based off workload usecase anaylsis showing 70-128 as the range of
multiple events returned during ofi progress.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
This commit fixes a flaw in the progress function for libnbc. The
function was unconditionally taking a lock even if there are no
requests to process. This lock was showing up in vtune traces of
multi-threaded benchmarks.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
after performing the final OBJ_RELEASE on the request,
reset the user level variable to MPI_REQUEST_NULL.
Otherwise the c_2_f translation step in the fortran
interface fails.
Fixes issue #4807
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit fixes a flaw in the eager limit check in pml/ob1. The
check was incorrectly checking if RDMA-only BTLs (BTLs without the
send flag) has a valid eager limit. This commit fixes the check by
adding an additional check for the send flag on the BTL module.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue #4523.
Extend the common monitoring API.
* Now it's possible to translate from local rank to world rank from
both the communicator and the group.
* Remove useless hashtable as we directly use the w_group contained
in window structure.
Add automatic generation at config time.
The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.
Only keep a generated argv-style array.
Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.
Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].
Thanks Martin Binder for the report.
Refs open-mpi/ompi#4770
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.
Fixes#4745
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Per MPI 3.1 chapter 13.3 :
"Derived etypes can be constructed by using any of the MPI
datatype constructor routines, provided all resulting typemap
displacements are non-negative and monotonically nondecreasing."
Same restriction applies to ftypes.
add the OMPI_DATATYPE_CHECK_FOR_VIEW() macro that is
check the underlying opal_datatype_t is monotonic, on top
of all checks performed in OMPI_DATATYPE_CHECK_FOR_RECV().
Since checking monotoniciy is expensive, check is only performed
when needed, but the result is cached by ompi_datatype_is_monotonic().
Thanks Wei-keng Liao for the valuable feedback.
Thanks George for the guidance.
Refs. open-mpi/ompi#4682
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
set grp_local_rank as MPI_UNDEFINED before invoking
ompi_comm_nexcid() in order to benefit from the optimizations
introduced in open-mpi/ompi@68167ec879
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This change makes comparison of `mpi-f08-interfaces.F90` and
`pmpi-f08-interfaces.F90` easier.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
They were incorrectly changed to subroutines in only `pmpi`
in 258d1aa160.
Strictly speaking, this change involves binary incompatibility.
But nobody used these subroutines and nobody will be affected because
these subroutines were useless (didn't return a calculated value).
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
OMPI_FORTRAN_USEMPIF08_MOD macro was removed in open-mpi/ompi@791bcee6c0
so this macro is now manually expanded to mpi/fortran/use-mpi-f08/mod
Thanks to Nathan T. Weeks for reporting
Refs open-mpi/ompi#3605
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ompio has the unique problem, that mca parameters set in the io/ompio component
have to be accessible from other frameworks as well. This is mostly done to avoid
a replication in the parameter names and to reduce the number of mca parameters that
and end-user has to worry about.
This commit introduces a generic function to retrieve ompio mca parameters, the function pointer
is stored on the file handle. It replaces two functions that used the same concept already for
one parameter each.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It should have always been #define'd in order to correctly handle the
multi-threaded case.
Also fix indentation in ompi/mpi/c/comm_get_errhandler.c
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit fixes the following bugs:
- Allow a btl to be used for communication if it can communicate with
all non-self peers and it supports global atomic visibility. In
this case CPU atomics can be used for self and the btl for any
other peer.
- It was possible to get into a state where different threads of an
MPI process could issue conflicting accumulate operations to a
remote peer. To eliminate this race we now update the peer flags
atomically.
- Queue up and re-issue put operations that failed during a BTL
callback. This can occur during an accumulate operation. This was
an unhandled error case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Cleanup several places where abstraction violations crept into OMPI layer (direct reference of ORTE). Add some missing includes that were exposed by this change.
Note that this compiles, but I haven't tested it for execution yet. Handing it over to Noah Evans for completion
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The current versions of these functions have a fatal flaw. If a
errhandler set and free call is made by another thread while the
thread calling get is between the cmpset and retain then we will
retain an invalid object. Fixing this by just using locking. This is
not a critical path so this should be ok.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Turns out there are edge cases where an MTL's isend
method may end up marking a send request complete prior
to returning to the CM code. The would end up causing
problems in the bsend path since the ompi_request_complete
would end up getting invoked a second time on this request.
This ended up causing segfaults, etc. in ompi_request_complete .
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
At least with some providers (sockets and GNI), the mprobe/mrecv
ofi mtl methods were incorrect. For these two providers at least
one must supply the original tag and mask bits used with the
prior FI_PEEK | FI_CLAIM request that had been used to probe for
the message.
These providers take a strict interpretation of the following sentence
from the libfabric fi_tagged man page:
```
Claimed messages can only be retrieved using a subsequent, paired receive operation with the FI_CLAIM flag set.
```
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
ompio has historically changed the WRONLY flag provided by the applicaiton
to RDWR to allow for the data sieving optimization within the two-phase I/O
fcoll component. This change did not have a performance impact
on regular UNIX file systems, but seems to hurt performance on NFS (and maybe Lustre?)
So provide an option that allows to keep the WRONLY option, and raise an error
if tha fcoll/two-phase would actually like to use the data sieving.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
gcc 5.2 complains:
```
mtl_ofi_component.c: In function ‘ompi_mtl_ofi_finalize’:
mtl_ofi_component.c:613:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
if (ret = fi_close((fid_t)ompi_mtl_ofi.fabric)) {
^
```
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Before this commit, the presence of usNIC devices -- which will
(currently) return no data when fi_getinfo() is queried for tagged
matching providers -- would cause an error message to be displayed.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The value of ret is negative (e.g., -61), but it is displayed in the
help message as `%zd`, which renders as unsigned (i.e., a giant
positive value). So make sure to negate the negative value before
rendering it (e.g., so we display "61", not "4294967235").
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
higher priority than rdma and default to psm2.
Context: the Intel Omni-path driver (hfi1) has verbs support, so the openib
btl is available to use. However, at a bad performance. Without this
change osc rdma using btl openib is the default choice when running on Intel
Omni-path, with a lower performance than osc pt2pt over mtl psm2.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
Rework the logic to handle the out-of-sequence fragments on the receiver
side. A large number of OOS messages are still arriving even in single
threaded scenarios.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code.
Refs. open-mpi/ompi#4443
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This is a bug fix based on a problem reported on the mailing list.
For very large read/write operations, ompio breaks the operation
down into multiple cycles. The problem was that
one of the variables required to maintain its values
across the different cycles did not do that, and because
of that the calculations of the memory offsets was wrong.
Fixes issue #4453
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the fs/lustre component has missed out on a number of updates to the fs/ufs component.
This commit tries to import all the changes performed on the fs/ufs component
w.r.t to the file_open operation, including updates on how the amode is set,
error is propegated and setting the fs_block_size value (which is required for
locking purposes).
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code. Fixes an issue reported on the mailing by Wei-keng Liao
Fixes Issue #4443
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If not the pvars will remain valid after the OB1 PML is unloaded, and
any access will segfault (the callbacks associated with the pvar will
point to the memory of the dlclosed module).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock.
However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct.
Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Currently, the progress function is incorrectly interpreting any error
value other than a positive value or -FI_EAVAIL to mean CQ is empty.
CQ is empty only if fi_cq_read() call returned -EAGAIN error
code. Fix that here.
While at it, fix help text output for calls made to OFI API.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
In multithreaded case, it is expensive to release the lock, call the slow match
and retake the lock again just to queue the frag. This patch will eliminate number of
lock taken by queueing the frag right away and return.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
The messages should be printed only in the event of CUDA builds and in the
presence of supporting hardware and when PSM2 MTL has actually been selected
for use. To this end, move help text output to component init phase.
Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
This commit looks large, but its really mostly a cleanup step.
1. introduce proper error handling for the return values of fcntl and the fbtl_posix_lock function
2. rename a parameter to more accurately reflect what it does
3. introduce an mca parameter in the fs/ufs component that allows to control
what the level of locking the user would like to enforce
4. move the initialization of the fs_block_size parameter from fs/ufs into the
common/ompio component. An fs component might be allowed to overwrite this
value, but none of the actual fs components do that.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.
Fixes issue: #4336
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
sharedfp/sm: unlink only needs to be called by one process
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
If hcoll fails to create mpi derived type let's set zero_dte on this dtype.
This will save cycles on subsequent collective calls with the same derived
type since we will not try to create hcoll type again.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
The ompi_datatype_get_single_predefined_type_from_args() recurses down
into a constructed type to identify what base datatype it's built from
if it's built from a single type. But if the type has MPI_LB/MPI_UB,
for example
lens[0] = 1;
lens[1] = 1;
disps[0] = 0;
disps[1] = 0;
types[0] = MPI_LB;
types[1] = MPI_INT;
MPI_Type_create_struct(2, lens, disps, types, &mydt);
then this function will see the base type MPI_LB as differing from MPI_INT
and will identify mydt as not being constructed from a single base type, so
the type will be rejected for calls like MPI_Accumulate.
I think those "meta data" types shouldn't result in rejection like that, and
the above mydt should still be identified as having a single base type
of MPI_INT.
Addition: boslica wanted another change discussed here
https://github.com/open-mpi/ompi/pull/3609
relating to the calculation for "count" after identifying the
predefined_type that was being used.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
romio314 is a just a component that does not require Fortran bindings,
so simply disable Fortran support to prevent warnings about deprecated flags
Fixesopen-mpi/ompi#4281
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
If Open MPI is configured with CUDA, then user also should be using a CUDA build of
PSM2 and therefore be setting PSM2_CUDA environment variable to 1 while using
CUDA buffers for transfers. If we detect this setting to be missing, force set
it. If user wants to use this build for regular (Host buffer) transfers, we
allow the option of setting PSM2_CUDA=0, but print a warning
message to user that it is not a recommended usage scenario.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
Replaced matching array with k and bcast with scatter.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Guillaume Mercier <mercier@labri.fr>
As the reordering is an optional step, if any operation during the
reorder fails we can return the duplicata of the original communicator
associated with the topology information.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This allows mtl_ofi_provider_include to work with layered providers as well.
e.g. --mca mtl_ofi_provider_include "providerX;ofi_rxm"
Signed-off-by: yohann <yohann.burette@intel.com>
the new grouping option simple+ performs all calculations used
for the aggregator selection as if the default file view would be used,
thus avoiding communication in file_set_view all together. This mode
is useful for applications that do not set a file view, but use
explicit offset operations on the default file view.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
Still in the "needs to be done" category:
* mapping/ranking/binding options aren't correctly supported
* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Remove two of the three instances of components requiring
64 bit atomics, even on 32 bit systems. The SM OSC component
also uses 64 bit atomics, but is a more complicated fix that
will follow this one. Currently, no one is testing on
platforms that don't provide 64 bit atomics (even in 32 bit
mode), but with the removal of the non-inline assembly for
IA32, the older compilers on Absoft's test systems now
result in no practical way to call cmpxchg8 in 32 bit mode.
At that point, these failures started popping up.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit fixes a compile issue on 32-bit systems that do not
support 64-bit atomic math. The active target path was using 64-bit
atomics exclusively to support PSCW. This commit updates the code to
use either 32 or 64-bit atomic math depending on what is available.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The monitoring code causes MPI_T based tools to segfault when
monitoring is disabled. This happens because the performance
variables remain registered after the common/monitoring
component is dlclosed due to a missing variable registration
flag. This commit adds the necessary flag to all the registered
performance variables.
The issue on github is #4162. Close when applied to master.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
MPI_IN_PLACE is not a valid send buffer for neighborhood collectives, so do not
invoke memchecker in this case.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.
This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
they are supposed to be unsigned, casting them to a signed
value for all atomic operations is as errorprone as handling
them as signed entities.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Per MPI-3.1, ensure to raise an MPI exception with value
MPI_ERR_INFO_NOKEY if we try to MPI_INFO_DELETE a key that does not
exist. Thanks to @dalcinl (Lisando Dalcin) for raising the issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
- change the increment used to test various no. of aggregators
to avoid using only power of two numbers
- convert some paratemers in the cost function from integers to
to floats for providing smoother and more consistent results
- set the FVIEW_IS_SET flag on the file *only* if the user
has set anything else than the default file view.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
adjust how the aggregator nodes are selected depending on whether processes
have been mapped by node or anything else.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
Check for the binding policy used. We are only interested in
whether mapby-node has been set right now (could be extended later)
and only on MPI_COMM_WORLD, since for all other sub-communicators
it is virtually impossible to identify their layout across nodes
in the most generic sense. This is used by OMPIO for deciding which
ranks to use for aggregators
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
add a new aggregator selection algorithm based on the performance
model described in:
Shweta Jha, Edgar Gabriel,
'Performance Models for Communication in Collective I/O Operations'
Proceedings of the 17th IEEE/ACM Symposium
on Cluster, Cloud and Grid Computing, Workshop on Theoretical
Approaches to Performance Evaluation, Modeling and Simulation, 2017.
Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
This module was always intended to be a proof of concept, and was far
from complete. If/when someone implemented F08 descriptor support for
the mpi_f08 module, this commit can either be restored or used as
reference material.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
in order to solve an egg and the chicken problem, in which mpiext need mpi-f08-types.mod
and/but use-mpi-f08[-desc] needs mpiext, add an extra step
- build fortran 2008 modules only
- build fortran 2008 mpi extensions
- and then build fortran 2008 bindings
Fixesopen-mpi/ompi#3605
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This fix is related to issue #1877, and prevents the OMPI library from
messing the user level random values.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
simply based on some local state. This is the second
part of the patch proposed for open-mpi/ompi#1183.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
The `ompi_comm_set` function never sets `NULL` to its first argument
`ncomm`. So `NULL` check is unnecessary in its callers. Furthermore,
`NULL` check may obscure a real return code when an error occurs
if the variable is initialized to a `NULL` value.
Also, `NULL` check is added in the `ompi_comm_set` function to
avoid segmentation fault in an out-of-memory condition.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>