This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.
Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:
* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally
* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test
* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).
Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it
Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.
This commit was SVN r29040.
This commit reintroduces key compression into the pmi db. This feature
compresses the keys stored into the component into a small number of
PMI keys by serializing the data and base64 encoding the result. This
will avoid issues with Cray PMI which restricts us to ~ 3 PMI keys per
rank.
This commit was SVN r28993.
This commit adds an API for registering and querying performance
variables (mca_base_pvar) in the MCA base. The existing MCA variable
system API has been updated to reflect the new API: MCA variable
groups have performance variables, and new types have been added (double,
unsigned long long) to reflect what is required by the MPI_T
interface. Additionally, the MCA variable group code has been split
into its own set of files: mca_base_var_group.[ch].
Details of the new API can be found in doxygen comments in the header:
mca_base_pvar.h.
Other changes to the variable system:
- Use an opal_hash_table to speed up variable/group lookup.
- Clean up code associated with MCA variable types.
- Registered performance variables are printed by ompi_info -a. In the
future an option should be added to control this behavior.
Changes to OMPI:
- Added full support for the MPI_T performance variable interface.
This commit was SVN r28800.
Add an option to ompi_info (-l, --level) that takes a number in the
interval (1,9). Only MCA variables up to this level will be printed.
The default level is 1.
Print the level as part of both the parsable and readable output.
This commit was SVN r28750.
sandbox team has informed me that they are getting rid of SANDBOX_PID
in the future and that using SANDBOX_ON would be preferred.
This commit was SVN r28708.
To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it.
This commit was SVN r28682.
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
in generated executables on systems that support it. Use
--disable-wrapper-rpath to disable this behavior. See text in
README about --disable-wrapper-rpath for more details.
This commit was SVN r28479.
The following Trac tickets were found above:
Ticket 376 --> https://svn.open-mpi.org/trac/ompi/ticket/376
added by hwloc's embedding so that it doesn't appear in
libhwloc_embedded.la (and therefore propogate all the way up to
libmpi.la).
Committed upstream in hwloc SVN r5588.
This commit was SVN r28457.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r5588