So (finally) consolidate these two fields into one "slots" field. Add a field in orte_job_t to indicate when all the procs for a job will be launched together, so that staged operations can know when MPI operations are allowed.
This commit was SVN r27239.
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
Remove some stale configure.m4's we no longer need.
Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.
This commit was SVN r27165.
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context
Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.
This commit was SVN r27005.
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.