ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it.
MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal.
An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object.
Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory.
Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t
The data structure changes are primarily in the following files:
communicator/communicator.h
ompi/info/info.h
ompi/win/win.h
ompi/file/file.h
The following new files were created:
opal/util/info.h
opal/util/info.c
opal/util/info_subscriber.h
opal/util/info_subscriber.c
This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info.
The new model can be seen in the following files:
ompi/mpi/c/comm_get_info.c
ompi/mpi/c/comm_set_info.c
ompi/mpi/c/file_get_info.c
ompi/mpi/c/file_set_info.c
ompi/mpi/c/win_get_info.c
ompi/mpi/c/win_set_info.c
The current subscribers where changed as follows:
mca/io/ompio/io_ompio_file_open.c
mca/io/ompio/io_ompio_module.c
mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks")
mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig")
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Conflicts:
AUTHORS
ompi/communicator/comm.c
ompi/debuggers/ompi_mpihandles_dll.c
ompi/file/file.c
ompi/file/file.h
ompi/info/info.c
ompi/mca/io/ompio/io_ompio.h
ompi/mca/io/ompio/io_ompio_file_open.c
ompi/mca/io/ompio/io_ompio_file_set_view.c
ompi/mca/osc/pt2pt/osc_pt2pt.h
ompi/mca/sharedfp/addproc/sharedfp_addproc.h
ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c
ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
ompi/mpi/c/lookup_name.c
ompi/mpi/c/publish_name.c
ompi/mpi/c/unpublish_name.c
opal/mca/mpool/base/mpool_base_alloc.c
opal/util/Makefile.am
OMPI send and receive mesages use size_t for the lenght while PSM and PSM2
psm(2)mq_send/receive use uint32_t. Type size_t is 64 bits in 64 bits arch.
Therefore, this patch adds a sanity check on the lenght of the message
and fails gracefully.
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
* Don't overflow the internal datatype count.
Change the type of the count to be a size_t (it does not alter the total
size of the internal structures, so has no impact on the ABI).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Optimize the datatype creation.
The internal array of counts of predefined types is now only created
when needed, which is either in a heterogeneous environment, or when
one call get_elements. It saves space and makes the convertor creation a
little faster in some cases.
Rearrange the fields in the datatype description structs.
The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the
static array was only partially created. All predefined types should
have the ptypes array created and initialized.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Fix the boundary computation.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* test/datatype: add test for short unpack on heteregeneous cluster
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Trying to reduce the cost of creating a convertor.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Respect the unpack boundaries.
As Gilles suggested on #2535 the opal_unpack_general_function was
unpacking based on the requested count and not on the amount of packed
data provided.
Fixes#2535.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
When we updated UFS and others we left NFS alone. HDF group would like
a fix, so here we go.
Signed-off-by: Ken Raffenetti <raffenet@mcs.anl.gov>
(back-ported from upstream commit pmodels/mpich@684df9f4c9)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
`ompi_group_t::grp_proc_pointers[i]` may have sentinel values even
for processes which reside in the local node because the array for
`MPI_COMM_WORLD` is set up before `ompi_proc_complete_init`, which
allocates `ompi_proc_t` objects for processes reside in the local
node, is called in `MPI_INIT`. So using `ompi_proc_is_sentinel`
against `ompi_group_t::grp_proc_pointers[i]` in order to determine
whether the process resides in a remote node is not appropriate.
This bug sometimes causes an `MPI_ERR_RMA_SHARED` error when
`MPI_WIN_ALLOCATE_SHARED` is called, where sm OSC uses
`ompi_group_have_remote_peers`.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
We check for liblustreapi.h in OMPI_CHECK_LUSTRE, so this code was
commented out here. Might as well fully delete it, since it's
redundant and dead.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
`ompi_group_t::grp_proc_pointers[i]` may have sentinel values even
for processes which reside in the local node because the array for
`MPI_COMM_WORLD` is set up before `ompi_proc_complete_init`, which
allocates `ompi_proc_t` objects for processes reside in the local
node, is called in `MPI_INIT`. So using `ompi_proc_is_sentinel`
against `ompi_group_t::grp_proc_pointers[i]` in order to determine
whether the process resides in a remote node is not appropriate.
This bug sometimes causes an `MPI_ERR_RMA_SHARED` error when
`MPI_WIN_ALLOCATE_SHARED` is called, where sm OSC uses
`ompi_group_have_remote_peers`.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Fortran constants `MPI_ARGV_NULL` and `MPI_ARGVS_NULL` are defined
in MPI-3.1 p.680 as below.
> `MPI_ARGVS_NULL`
> 2-dim. array of `CHARACTER*(*)`
> `MPI_ARGV_NULL`
> array of `CHARACTER*(*)`
`MPI_ARGV_NULL` and `MPI_ARGVS_NULL` are used as an argument of
`MPI_COMM_SPAWN` and `MPI_COMM_SPAWN_MULTIPLE` respectively and
their argument `argv` and `array_of_argv` are defined as below
for `USE mpi_f08` binding in MPI-3.1.
```
CHARACTER(LEN=*), INTENT(IN) :: argv(*)
CHARACTER(LEN=*), INTENT(IN) :: array_of_argv(count, *)
```
Defining them as `INTEGER` in `mpi_f08` module will cause
a compilation error of user programs like
"There is no specific subroutine for the generic 'mpi_comm_spawn'".
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
MPI_AINT_ADD and MPI_AINT_DIFF are functions and must be declared as
externals with the proper return type. This is already done properly
in the mpi and mpi_f08 modules; these declarations for these functions
were only missing from mpif.h (i.e., mpif-externals.h).
Thanks to Aboorva Devarajan (@AboorvaDevarajan) for the bug report.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called.
Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim.
This PR changes the default settings of a few relevant params to make "background modex" the default behavior:
* pmix_base_async_modex -> defaults to true
* pmix_base_collect_data -> continues to default to true (no change)
* async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything.
The logic in MPI_Init is:
* if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation
* if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed
* if async_modex is not set, then we block until the fence completes (regardless of collecting data or not)
* if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation.
* if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case.
HTH
Ralph
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Force only procs that are participating in the ne Comm to decide what
CID is appropriate. This will have 2 advantages:
* Speedup Comm creation for small communicators: non-participating procs
will not interfere
* Reduce CID fragmentation: non-overlaping groups will be allowed to use
same CID.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
This PR renames the common library for OFI libfabric from
libfabric to ofi. There are a number of reasons this
is good to do:
1) its shorter and replaces 9 characters with three for
function names for what may eventually be a fairly extensive interface
2) OFI is the term used for MTL and RML components that use
the OFI libfabric interface
3) A planned OSC component will also use the OFI term.
4) Other HPC libraries that can use OFI libfabric tend to use
the term "ofi" internally and also in their configure options
relevant to OFI libfabric (i.e. MPICH/CH4, Intel MPI, Sandia SHMEM)
There seem to be comments in places in the Open MPI source
code that indicate that this common library will be going away.
Far from it as we will want to be able to share things like
AV objects between OMPI and possibly OSHMEM components that
use the OFI libfabric interface.
This PR also adds a synonym to the --with-libfabric(-libdir)
configury options: --with-ofi and with-ofi-libdir.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
since Open MPI now requires a C99, and ptrdiff_t type is part of C99,
there is no more need for the abstract OPAL_PTRDIFF_TYPE type.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
* Complete rewrite of opal_pointer_array
Instead of a cache oblivious linear search use a bits array
to speed up the management of the free space. As a result we
slightly increase the memory used by the structure, but we get a
significant boost in performance.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Do not register datatypes in the f2c translation table.
The registration is now done up into the Fortran layer, by
forcing a call to MPI_Type_c2f.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Array sizes of `array_of_gsizes`, `array_of_distribs`, `array_of_dargs`,
and `array_of_psizes` parameters of the `ompi_datatype_create_darray`
function (and `MPI_TYPE_CREATE_DARRAY`) are all `ndims`.
`ndims` are `i[2]`, not `i[0]`. See MPI-3.1 p.122.
Because this function `__ompi_datatype_create_from_args` is used by
pt2pt OSC, using a datatype created by `MPI_TYPE_CREATE_DARRAY` for
`MPI_(R)(GET_)ACCUMULATE` caused a segmentation fault or something
on a target process.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>