the lack of performing data sieving has been identified as a main reason for the poor performance in some instances on the Lustre file system. This commit introduces the fundamental ability to perform data sieving for read operations (which should not be controversial). The code itself is correct, what is still lacking is a) the logic when and how to activate data sieving and b) the logic to limit the size of the temporary buffer when doing data sieving.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the dynamic_gen_file_write_all component distinguishes between the amount of data communicated
to aggregators, and the amount of data written in a cycle by the aggregator (in contrary e.g. to the vulcan component).
There was a bug in calculating which chunks have to be written in a cycle by an aggregator: we added as many elements into the
io_array until we filled one stripe. Unfortuantely, the metric used was the amount of data instead of ensuring that all offsets
fall within a single stripe. This commit fixes this issue. Note, the bug did not create a correctness problem, just a performance
problem in case there were gaps in the file view.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
This has shown to be more effective in achieving overlap
of inter- and intra-node communication and reduces the inital
delay before hitting the network.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Also make coll/tuned the default for shared memory communication
as coll/sm has shown performance issues that need investigation.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
The selectable list is sorted with lowest to highest priority so the
user-defined preferences should be appended to the list.
The preference treatment should also maintain the order provided by the user
(first item has highest priority) so switch the loop order.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
- Add some missing AC_CHECK_SIZEOF's in configure.ac
- Remove some unused variables
- Initialize some variables
- Fix some parameter types
- Cast where appropriate/safe to fix warnings
- Move ompi/mca/common/monitoring Fortran bindings to a separate .c
file so that they can use different #define's than the C bindings,
and therefore compile properly / without warnings.
- Fix signedness discrepancies
- Who knew? Separated these into multiple #if's, instead:
```
// This is undefined behavior
#define HAVE_FOO defined(FOO)
#define YOW (HAVE_FOO && defined(BAR))
```
- Fix some typos in OMPI_BUILD_HOST logic
- Don't "2>/dev/null" in OMPI_BUILD_HOST logic; it just hides errors
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The total size depends on number of ranks so the usual ranges don't work.
Thus, use the average across all ranks to make a decision.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
MPI_Ialltoallw() and friends take a const MPI_Datatype types[] argument.
In order to be able to call OBJ_RELEASE(types[0]), we used to simply
drop the const modifier. This change make it right by introducing the
OBJ_RELEASE_NO_NULLIFY(object) macro that no more set object = NULL
if the object is freed.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
These selections seem harmful in my measurements and don't seem to be
motivated by previous measurement data.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
A mindless task for a lazy weekend: convert all the README and
README.txt files to Markdown. Paired with the slow conversion of all
of our man pages to Markdown, this gives a uniform language to the
Open MPI docs.
This commit moved a bunch of copyright headers out of the top-level
README.txt file, so I updated the relevant copyright header years in
the top-level LICENSE file to match what was removed from README.txt.
Additionally, this commit did (very) little to update the actual
content of the README files. A very small number of updates were made
for topics that I found blatently obvious while Markdown-izing the
content, but in general, I did not update content during this commit.
For example, there's still quite a bit of text about ORTE that was not
meaningfully updated.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
This commit removes the unnecessary call to `fi_getinfo()` when
initializing the MTL. `cq_data_size` is a domain attribute that will be
available to the MTL from the initial query itself. FI_DIRECTED_RECV is
a primary capability that has to be requested for a provider to enable
it, so adding that to the initial requirement. The redundant query was
also overwriting the contents of the prov object, which already had the
include/exclude filtering and multi-NIC logic applied to it.
Signed-off-by: Raghu Raja <craghun@amazon.com>
Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs
Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs
In all cases, the implementations will fall back to a linear implementation,
which will most likely yield the worst performance (noted for 4B bcast on 128 ranks)
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
PGI (20.4) compiler do not define this intrinsic, so only build
AVX512 support if _mm512_mullo_epi64() intrisic is defined.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
8017f12 introduced a new function to get the package rank of a process,
which had a pass-by-value signature (opal_process_info_t); and coverity
was not happy about it. This commit changes the signature to take a
reference to opal_process_info_t instead.
Signed-off-by: Raghu Raja <craghun@amazon.com>
If PMIX_PACKAGE_RANK is available, uses this value to select between multiple
NIC of equal distance between the current process. If this value is not
available, try to calculate it by getting the locality string from each local
process and assign a package_rank. If everything fails, fall back to using
process_id.rank to select the NIC. This last case is not ideal, but has a small
chance of occuring, and causes an output to be displayed to notify that this is
occuring.
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
Among many other things:
- Fix an imbalance bug in MPI_allgather
- Accept more human readable configuration files. We can now specify
the collective by name instead of a magic number, and the component
we want to use also by name.
- Add the capability to have optional arguments in the collective
communication configuration file. Right now the capability exists
for segment lengths, but is yet to be connected with the algorithms.
- Redo the initialization of all HAN collectives.
Cleanup the fallback collective support.
- In case the module is unable to deliver the expected result, it will fallback
executing the collective operation on another collective component. This change
make the support for this fallback simpler to use.
- Implement a fallback allowing a HAN module to remove itself as
potential active collective module, and instead fallback to the
next module in line.
- Completely disable the HAN modules on error. From the moment an error is
encountered they remove themselves from the communicator, and in case some
other modules calls them simply behave as a pass-through.
Communicator: provide ompi_comm_split_with_info to split and provide info at the same time
Add ompi_comm_coll_preference info key to control collective component selection
COLL HAN: use info keys instead of component-level variable to communicate topology level between abstraction layers
- The info value is a comma-separated list of entries, which are chosen with
decreasing priorities. This overrides the priority of the component,
unless the component has disqualified itself.
An entry prefixed with ^ starts the ignore-list. Any entry following this
character will be ingnored during the collective component selection for the
communicator.
Example: "sm,libnbc,^han,adapt" gives sm the highest preference, followed
by libnbc. The components han and adapt are ignored in the selection process.
- Allocate a temporary buffer for all lower-level leaders (length 2 segments)
- Fix the handling of MPI_IN_PLACE for gather and scatter.
COLL HAN: Fix topology handling
- HAN should not rely on node names to determine the ordering of ranks.
Instead, use the node leaders as identifiers and short-cut if the
node-leaders agree that ranks are consecutive. Also, error out if
the rank distribution is imbalanced for now.
Signed-off-by: Xi Luo <xluo12@vols.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
The current iprobe/improbe implementations merely checks the return
code on the posted receive operation to tell if there is a match or
not. This commit moves the check to the probe's error callback
instead. Per the semantics defined in libfabric, the peek operation is
asynchronous and the results are to be fetched from the completion
queue. If no message is found matching the tags specified in the peek
request, then a completion queue error entry with err field set to
FI_ENOMSG will be available.
Signed-off-by: Raghu Raja <craghun@amazon.com>
In multi-threaded scenarios, any thread that attempts to read a CQ
when there's a pending error CQ entry gets an -FI_EAVAIL. Without
any serialization here (which is okay, since libfabric will protect
access to critical CQ objects), all threads proceed to read from the
error CQ, but only one thread fetches the entry while others get
-FI_EAGAIN indicating an empty queue, which is not erroneous.
Signed-off-by: Raghu Raja <craghun@amazon.com>