Set the default send and receive socket buffer size to 0,
which means Open MPI will not try to set a buffer size during
startup.
The default behavior since near day one of the TCP BTL has been
to set the send and receive socket buffer sizes to 128 KiB. A
number that works great on 1 GbE, but not so great on 10 GbE
fabrics of any real size. Modern TCP stacks, particularly on
Linux, have gotten much smarter about buffer sizes and are much
less efficient if a buffer size is set (even if set to something
large).
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
`ompi_group_t::grp_proc_pointers[i]` may have sentinel values even
for processes which reside in the local node because the array for
`MPI_COMM_WORLD` is set up before `ompi_proc_complete_init`, which
allocates `ompi_proc_t` objects for processes reside in the local
node, is called in `MPI_INIT`. So using `ompi_proc_is_sentinel`
against `ompi_group_t::grp_proc_pointers[i]` in order to determine
whether the process resides in a remote node is not appropriate.
This bug sometimes causes an `MPI_ERR_RMA_SHARED` error when
`MPI_WIN_ALLOCATE_SHARED` is called, where sm OSC uses
`ompi_group_have_remote_peers`.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
MPI_AINT_ADD and MPI_AINT_DIFF are functions and must be declared as
externals with the proper return type. This is already done properly
in the mpi and mpi_f08 modules; these declarations for these functions
were only missing from mpif.h (i.e., mpif-externals.h).
Thanks to Aboorva Devarajan (@AboorvaDevarajan) for the bug report.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called.
Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim.
This PR changes the default settings of a few relevant params to make "background modex" the default behavior:
* pmix_base_async_modex -> defaults to true
* pmix_base_collect_data -> continues to default to true (no change)
* async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything.
The logic in MPI_Init is:
* if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation
* if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed
* if async_modex is not set, then we block until the fence completes (regardless of collecting data or not)
* if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation.
* if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case.
HTH
Ralph
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Force only procs that are participating in the ne Comm to decide what
CID is appropriate. This will have 2 advantages:
* Speedup Comm creation for small communicators: non-participating procs
will not interfere
* Reduce CID fragmentation: non-overlaping groups will be allowed to use
same CID.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
The usNIC BTL does not use more than 1 iov, so be sure to set it to 1
so that we don't allocate cq/rq/sq entries based on a default (i.e.,
>1) number of iovs per entry.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This PR renames the common library for OFI libfabric from
libfabric to ofi. There are a number of reasons this
is good to do:
1) its shorter and replaces 9 characters with three for
function names for what may eventually be a fairly extensive interface
2) OFI is the term used for MTL and RML components that use
the OFI libfabric interface
3) A planned OSC component will also use the OFI term.
4) Other HPC libraries that can use OFI libfabric tend to use
the term "ofi" internally and also in their configure options
relevant to OFI libfabric (i.e. MPICH/CH4, Intel MPI, Sandia SHMEM)
There seem to be comments in places in the Open MPI source
code that indicate that this common library will be going away.
Far from it as we will want to be able to share things like
AV objects between OMPI and possibly OSHMEM components that
use the OFI libfabric interface.
This PR also adds a synonym to the --with-libfabric(-libdir)
configury options: --with-ofi and with-ofi-libdir.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>