= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2011 Oak Ridge National Labs. All rights reserved.
|
2013-05-10 15:06:25 +00:00
|
|
|
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
/**
|
|
|
|
* Processing for command line interface options
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
#include "rmaps_lama.h"
|
|
|
|
|
2013-04-29 17:02:37 +00:00
|
|
|
#include "opal/util/argv.h"
|
|
|
|
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
#include "orte/mca/rmaps/base/rmaps_private.h"
|
|
|
|
#include "orte/mca/rmaps/base/base.h"
|
2013-05-10 15:06:25 +00:00
|
|
|
#include "orte/util/show_help.h"
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
|
|
|
|
#include <ctype.h>
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* Local Functions
|
|
|
|
*********************************/
|
|
|
|
/*
|
|
|
|
* QSort: Integer comparison
|
|
|
|
*/
|
|
|
|
static int lama_parse_int_sort(const void *a, const void *b);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert the '-ppr' syntax from the 'ppr' component to the 'lama' '-mppr' syntax.
|
|
|
|
*/
|
|
|
|
static char * rmaps_lama_covert_ppr(char * given_ppr);
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* Parsing Functions
|
|
|
|
*********************************/
|
|
|
|
int rmaps_lama_process_alias_params(orte_job_t *jdata)
|
|
|
|
{
|
|
|
|
int exit_status = ORTE_SUCCESS;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mapping options
|
|
|
|
* Note: L1, L2, L3 are not exposed in orterun to the user, so
|
|
|
|
* there is no need to specify them here.
|
|
|
|
*/
|
|
|
|
if( NULL == rmaps_lama_cmd_map ) {
|
|
|
|
/* orte_rmaps_base.mapping */
|
|
|
|
switch( ORTE_GET_MAPPING_POLICY(jdata->map->mapping) ) {
|
|
|
|
case ORTE_MAPPING_BYNODE:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("nbNsL3L2L1ch"); */
|
|
|
|
rmaps_lama_cmd_map = strdup("nbsch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYBOARD:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("bnNsL3L2L1ch"); */
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
"by board", "mapping by board not supported by LAMA");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYNUMA:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("NbnsL3L2L1ch"); */
|
|
|
|
rmaps_lama_cmd_map = strdup("Nbnsch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYSOCKET:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("sNbnL3L2L1ch"); */
|
|
|
|
rmaps_lama_cmd_map = strdup("sbnch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYL3CACHE:
|
|
|
|
rmaps_lama_cmd_map = strdup("L3sNbnL2L1ch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYL2CACHE:
|
|
|
|
rmaps_lama_cmd_map = strdup("L2sNbnL1ch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYL1CACHE:
|
|
|
|
rmaps_lama_cmd_map = strdup("L1sNbnch");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYCORE:
|
|
|
|
case ORTE_MAPPING_BYSLOT:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("cL1L2L3sNbnh"); */
|
|
|
|
rmaps_lama_cmd_map = strdup("csbnh");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_BYHWTHREAD:
|
|
|
|
/* rmaps_lama_cmd_map = strdup("hcL1L2L3sNbn"); */
|
|
|
|
rmaps_lama_cmd_map = strdup("hcsbn");
|
|
|
|
break;
|
|
|
|
case ORTE_MAPPING_RR:
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
"round robin", "mapping by round robin not supported by LAMA");
|
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
case ORTE_MAPPING_SEQ:
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
"sequential", "mapping by sequential not supported by LAMA");
|
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
case ORTE_MAPPING_BYUSER:
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
"by user", "mapping by user not supported by LAMA");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
default:
|
|
|
|
/*
|
|
|
|
* Default is map-by core
|
|
|
|
*/
|
|
|
|
rmaps_lama_cmd_map = strdup("cL1L2L3sNbnh");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Binding Options
|
|
|
|
*/
|
|
|
|
if( NULL == rmaps_lama_cmd_bind ) {
|
|
|
|
/*
|
|
|
|
* No binding specified, use default
|
|
|
|
*/
|
|
|
|
if( !OPAL_BINDING_POLICY_IS_SET(jdata->map->binding) ||
|
|
|
|
!OPAL_BINDING_REQUIRED(opal_hwloc_binding_policy) ||
|
|
|
|
OPAL_BIND_TO_NONE == OPAL_GET_BINDING_POLICY(jdata->map->binding) ) {
|
|
|
|
rmaps_lama_cmd_bind = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch( OPAL_GET_BINDING_POLICY(jdata->map->binding) ) {
|
|
|
|
case OPAL_BIND_TO_BOARD:
|
|
|
|
/* rmaps_lama_cmd_bind = strdup("1b"); */
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
"by board", "binding to board not supported by LAMA");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_NUMA:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1N");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_SOCKET:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1s");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L3CACHE:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1L3");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L2CACHE:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1L2");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L1CACHE:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1L1");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_CORE:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1c");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_HWTHREAD:
|
|
|
|
rmaps_lama_cmd_bind = strdup("1h");
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_CPUSET:
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
"by CPU set", "binding to CPU set not supported by LAMA");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
rmaps_lama_cmd_bind = NULL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ordering (a.k.a. Ranking) Options
|
|
|
|
*/
|
|
|
|
if( NULL == rmaps_lama_cmd_ordering ) {
|
|
|
|
/* orte_rmaps_base.ranking */
|
|
|
|
switch( ORTE_GET_RANKING_POLICY(jdata->map->ranking) ) {
|
|
|
|
case ORTE_RANK_BY_SLOT:
|
|
|
|
rmaps_lama_cmd_ordering = strdup("s");
|
|
|
|
break;
|
|
|
|
case ORTE_RANK_BY_NODE:
|
|
|
|
case ORTE_RANK_BY_NUMA:
|
|
|
|
case ORTE_RANK_BY_SOCKET:
|
|
|
|
case ORTE_RANK_BY_L3CACHE:
|
|
|
|
case ORTE_RANK_BY_L2CACHE:
|
|
|
|
case ORTE_RANK_BY_L1CACHE:
|
|
|
|
case ORTE_RANK_BY_CORE:
|
|
|
|
case ORTE_RANK_BY_HWTHREAD:
|
|
|
|
rmaps_lama_cmd_ordering = strdup("n");
|
|
|
|
break;
|
|
|
|
case ORTE_RANK_BY_BOARD:
|
|
|
|
/* rmaps_lama_cmd_ordering = strdup("n"); */
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid ordering option",
|
|
|
|
true,
|
|
|
|
"by board", "ordering by board not supported by LAMA");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
rmaps_lama_cmd_ordering = strdup("n");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* MPPR
|
|
|
|
*/
|
|
|
|
if( NULL == rmaps_lama_cmd_mppr ) {
|
|
|
|
/*
|
After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
can't apply its normal precedence rules.
So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
can do. If only the old style param is given, then process it after the warning.
Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.
For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.
Refs trac:4117
This commit was SVN r30620.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
|
|
|
* The ppr is given in the map
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
*/
|
|
|
|
if( NULL != jdata->map->ppr) {
|
|
|
|
rmaps_lama_cmd_mppr = rmaps_lama_covert_ppr(jdata->map->ppr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Oversubscription
|
|
|
|
*/
|
|
|
|
if( ORTE_MAPPING_NO_OVERSUBSCRIBE & ORTE_GET_MAPPING_DIRECTIVE(jdata->map->mapping) ) {
|
|
|
|
rmaps_lama_can_oversubscribe = false;
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
rmaps_lama_can_oversubscribe = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Display revised values
|
|
|
|
*/
|
2013-03-27 21:14:43 +00:00
|
|
|
opal_output_verbose(5, orte_rmaps_base_framework.framework_output,
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
"mca:rmaps:lama: Revised Parameters -----");
|
2013-03-27 21:14:43 +00:00
|
|
|
opal_output_verbose(5, orte_rmaps_base_framework.framework_output,
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
"mca:rmaps:lama: Map : %s",
|
|
|
|
rmaps_lama_cmd_map);
|
2013-03-27 21:14:43 +00:00
|
|
|
opal_output_verbose(5, orte_rmaps_base_framework.framework_output,
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
"mca:rmaps:lama: Bind : %s",
|
|
|
|
rmaps_lama_cmd_bind);
|
2013-03-27 21:14:43 +00:00
|
|
|
opal_output_verbose(5, orte_rmaps_base_framework.framework_output,
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
"mca:rmaps:lama: MPPR : %s",
|
|
|
|
rmaps_lama_cmd_mppr);
|
2013-03-27 21:14:43 +00:00
|
|
|
opal_output_verbose(5, orte_rmaps_base_framework.framework_output,
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
"mca:rmaps:lama: Order : %s",
|
|
|
|
rmaps_lama_cmd_ordering);
|
|
|
|
|
|
|
|
cleanup:
|
|
|
|
return exit_status;
|
|
|
|
}
|
|
|
|
|
|
|
|
static char * rmaps_lama_covert_ppr(char * given_ppr)
|
|
|
|
{
|
|
|
|
return strdup(given_ppr);
|
|
|
|
}
|
|
|
|
|
|
|
|
int rmaps_lama_parse_mapping(char *layout,
|
|
|
|
rmaps_lama_level_type_t **layout_types,
|
|
|
|
rmaps_lama_level_type_t **layout_types_sorted,
|
|
|
|
int *num_types)
|
|
|
|
{
|
|
|
|
int exit_status = ORTE_SUCCESS;
|
|
|
|
char param[3];
|
|
|
|
int i, j, len;
|
|
|
|
bool found_req_param_n = false;
|
|
|
|
bool found_req_param_h = false;
|
|
|
|
bool found_req_param_bind = false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity Check:
|
|
|
|
* There is no default layout, so if we get here and nothing is specified
|
|
|
|
* then this is an error.
|
|
|
|
*/
|
|
|
|
if( NULL == layout ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"internal error",
|
|
|
|
true,
|
|
|
|
"rmaps_lama_parse_mapping",
|
|
|
|
"internal error 1");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
|
|
|
|
|
|
|
*num_types = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract and convert all the keys
|
|
|
|
*/
|
|
|
|
len = strlen(layout);
|
|
|
|
for(i = 0; i < len; ++i) {
|
|
|
|
/*
|
|
|
|
* L1 : L1 Cache
|
|
|
|
* L2 : L2 Cache
|
|
|
|
* L3 : L3 Cache
|
|
|
|
*/
|
|
|
|
if( layout[i] == 'L' ) {
|
|
|
|
param[0] = layout[i];
|
|
|
|
++i;
|
|
|
|
/*
|
|
|
|
* Check for 2 characters
|
|
|
|
*/
|
|
|
|
if( i >= len ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, "cache level missing number");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
param[1] = layout[i];
|
|
|
|
param[2] = '\0';
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* n : Machine
|
|
|
|
* b : Board
|
|
|
|
* s : Socket
|
|
|
|
* c : Core
|
|
|
|
* h : Hardware Thread
|
|
|
|
* N : NUMA Node
|
|
|
|
*/
|
|
|
|
else {
|
|
|
|
param[0] = layout[i];
|
|
|
|
param[1] = '\0';
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Append level
|
|
|
|
*/
|
|
|
|
*num_types += 1;
|
|
|
|
*layout_types = (rmaps_lama_level_type_t*)realloc(*layout_types, sizeof(rmaps_lama_level_type_t) * (*num_types));
|
|
|
|
(*layout_types)[(*num_types)-1] = lama_type_str_to_enum(param);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for duplicates and unknowns
|
|
|
|
* Copy to sorted list
|
|
|
|
*/
|
|
|
|
*layout_types_sorted = (rmaps_lama_level_type_t*)malloc(sizeof(rmaps_lama_level_type_t) * (*num_types));
|
|
|
|
for( i = 0; i < *num_types; ++i ) {
|
|
|
|
/*
|
|
|
|
* Copy for later sorting
|
|
|
|
*/
|
|
|
|
(*layout_types_sorted)[i] = (*layout_types)[i];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look for unknown and unsupported options
|
|
|
|
*/
|
|
|
|
if( LAMA_LEVEL_UNKNOWN <= (*layout_types)[i] ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "unknown mapping level at position %d", i + 1);
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
if( LAMA_LEVEL_MACHINE == (*layout_types)[i] ) {
|
|
|
|
found_req_param_n = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
if( LAMA_LEVEL_PU == (*layout_types)[i] ) {
|
|
|
|
found_req_param_h = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
if( lama_binding_level == (*layout_types)[i] ) {
|
|
|
|
found_req_param_bind = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look for duplicates
|
|
|
|
*/
|
|
|
|
for( j = i+1; j < *num_types; ++j ) {
|
|
|
|
if( (*layout_types)[i] == (*layout_types)[j] ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "duplicate mapping levels at position %d and %d",
|
|
|
|
i + 1, j + 1);
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The user is required to specify at least the:
|
|
|
|
* - machine
|
|
|
|
* - hardware thread (needed for lower bound binding) JJH: We should be able to lift this...
|
|
|
|
* - binding layer (need it to stride the mapping)
|
2013-05-10 15:06:25 +00:00
|
|
|
* Only print the error message once, for brevity.
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
*/
|
|
|
|
if( !found_req_param_n ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "missing required 'n' mapping token");
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
2013-05-10 15:06:25 +00:00
|
|
|
else if(!found_req_param_h) {
|
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "missing required 'h' mapping token");
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
2013-05-10 15:06:25 +00:00
|
|
|
} else if (!found_req_param_bind) {
|
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "missing required mapping token for the current binding level");
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mapping option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sort the items
|
|
|
|
*/
|
|
|
|
qsort((*layout_types_sorted ), (*num_types), sizeof(int), lama_parse_int_sort);
|
|
|
|
|
|
|
|
cleanup:
|
|
|
|
return exit_status;
|
|
|
|
}
|
|
|
|
|
|
|
|
int rmaps_lama_parse_binding(char *layout, rmaps_lama_level_type_t *binding_level, int *num_types)
|
|
|
|
{
|
|
|
|
int exit_status = ORTE_SUCCESS;
|
|
|
|
char param[3];
|
|
|
|
char num[MAX_BIND_DIGIT_LEN];
|
|
|
|
int i, n, p, len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Default: If nothing specified
|
|
|
|
* - Bind to machine
|
|
|
|
*/
|
|
|
|
if( NULL == layout ) {
|
|
|
|
*binding_level = LAMA_LEVEL_MACHINE;
|
|
|
|
*num_types = 1;
|
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
*num_types = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract and convert all the keys
|
|
|
|
*/
|
|
|
|
len = strlen(layout);
|
|
|
|
n = 0;
|
|
|
|
p = 0;
|
|
|
|
for(i = 0; i < len; ++i) {
|
|
|
|
/*
|
|
|
|
* Must start with a digit
|
|
|
|
*/
|
|
|
|
if( isdigit(layout[i]) ) {
|
|
|
|
/*
|
|
|
|
* Check: Digits must come first
|
|
|
|
*/
|
|
|
|
if( p != 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "missing digit(s) before binding level token");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
num[n] = layout[i];
|
|
|
|
++n;
|
|
|
|
/*
|
|
|
|
* Check: Exceed bound of number of digits
|
|
|
|
*/
|
|
|
|
if( n >= MAX_BIND_DIGIT_LEN ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "too many digits");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Extract the level
|
|
|
|
*/
|
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* Check: Digits must come first
|
|
|
|
*/
|
|
|
|
if( n == 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "missing digit(s) before binding level token");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check: Only one level allowed
|
|
|
|
*/
|
|
|
|
if( p != 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "only one binding level may be specified");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* L1 : L1 Cache
|
|
|
|
* L2 : L2 Cache
|
|
|
|
* L3 : L3 Cache
|
|
|
|
*/
|
|
|
|
if( layout[i] == 'L' ) {
|
|
|
|
param[0] = layout[i];
|
|
|
|
++i;
|
|
|
|
/*
|
|
|
|
* Check for 2 characters
|
|
|
|
*/
|
|
|
|
if( i >= len ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "only one binding level may be specified");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
param[1] = layout[i];
|
|
|
|
p = 2;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* n : Machine
|
|
|
|
* b : Board
|
|
|
|
* s : Socket
|
|
|
|
* c : Core
|
|
|
|
* h : Hardware Thread
|
|
|
|
* N : NUMA Node
|
|
|
|
*/
|
|
|
|
else {
|
|
|
|
param[0] = layout[i];
|
|
|
|
p = 1;
|
|
|
|
}
|
|
|
|
param[p] = '\0';
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check that the level was specified
|
|
|
|
*/
|
|
|
|
if( p == 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "binding specification is empty");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
num[n] = '\0';
|
|
|
|
|
|
|
|
*binding_level = lama_type_str_to_enum(param);
|
|
|
|
*num_types = atoi(num);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for unknown level
|
|
|
|
*/
|
|
|
|
if( LAMA_LEVEL_UNKNOWN <= *binding_level ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid binding option",
|
|
|
|
true,
|
|
|
|
layout, "unknown binding level");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
cleanup:
|
|
|
|
return exit_status;
|
|
|
|
}
|
|
|
|
|
|
|
|
int rmaps_lama_parse_mppr(char *layout, rmaps_lama_level_info_t **mppr_levels, int *num_types)
|
|
|
|
{
|
|
|
|
int exit_status = ORTE_SUCCESS;
|
|
|
|
char param[3];
|
|
|
|
char num[MAX_BIND_DIGIT_LEN];
|
|
|
|
char **argv = NULL;
|
|
|
|
int argc = 0;
|
|
|
|
int i, j, len;
|
|
|
|
int p, n;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Default: Unrestricted allocation
|
|
|
|
* 'oversubscribe' flag accounted for elsewhere
|
|
|
|
*/
|
|
|
|
if( NULL == layout ) {
|
|
|
|
*mppr_levels = NULL;
|
|
|
|
*num_types = 0;
|
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
*num_types = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Split by ','
|
|
|
|
* <#:level>,<#:level>,...
|
|
|
|
*/
|
|
|
|
argv = opal_argv_split(layout, ',');
|
|
|
|
argc = opal_argv_count(argv);
|
|
|
|
for(j = 0; j < argc; ++j) {
|
|
|
|
/*
|
|
|
|
* Parse <#:level>
|
|
|
|
*/
|
|
|
|
len = strlen(argv[j]);
|
|
|
|
n = 0;
|
|
|
|
p = 0;
|
|
|
|
for(i = 0; i < len; ++i) {
|
|
|
|
/*
|
|
|
|
* Skip the ':' separator and whitespace
|
|
|
|
*/
|
|
|
|
if( argv[j][i] == ':' || isblank(argv[j][i])) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Must start with a digit
|
|
|
|
*/
|
|
|
|
else if( isdigit(argv[j][i]) ) {
|
|
|
|
/*
|
|
|
|
* Check: Digits must come first
|
|
|
|
*/
|
|
|
|
if( p != 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "missing digit(s) before resource specification");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
num[n] = argv[j][i];
|
|
|
|
++n;
|
|
|
|
/*
|
|
|
|
* Check: Exceed bound of number of digits
|
|
|
|
*/
|
|
|
|
if( n >= MAX_BIND_DIGIT_LEN ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "too many digits");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Extract the level
|
|
|
|
*/
|
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* Check: Digits must come first
|
|
|
|
*/
|
|
|
|
if( n == 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "missing digit(s) before resource specification");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check: Only one level allowed
|
|
|
|
*/
|
|
|
|
if( p != 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "only one resource type may be listed per specification");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* L1 : L1 Cache
|
|
|
|
* L2 : L2 Cache
|
|
|
|
* L3 : L3 Cache
|
|
|
|
*/
|
|
|
|
if( argv[j][i] == 'L' ) {
|
|
|
|
param[0] = argv[j][i];
|
|
|
|
++i;
|
|
|
|
/*
|
|
|
|
* Check for 2 characters
|
|
|
|
*/
|
|
|
|
if( i >= len ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "cache level missing number");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
param[1] = argv[j][i];
|
|
|
|
p = 2;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* n : Machine
|
|
|
|
* b : Board
|
|
|
|
* s : Socket
|
|
|
|
* c : Core
|
|
|
|
* h : Hardware Thread
|
|
|
|
* N : NUMA Node
|
|
|
|
*/
|
|
|
|
else {
|
|
|
|
param[0] = argv[j][i];
|
|
|
|
p = 1;
|
|
|
|
}
|
|
|
|
param[p] = '\0';
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Whitespace, just skip
|
|
|
|
*/
|
|
|
|
if( n == 0 && p == 0 ) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that the level was specified
|
|
|
|
*/
|
|
|
|
if( p == 0 ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, "resource type not specified");
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
num[n] = '\0';
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Append level
|
|
|
|
*/
|
|
|
|
*num_types += 1;
|
|
|
|
*mppr_levels = (rmaps_lama_level_info_t*)realloc(*mppr_levels, sizeof(rmaps_lama_level_info_t) * (*num_types));
|
|
|
|
(*mppr_levels)[(*num_types)-1].type = lama_type_str_to_enum(param);
|
|
|
|
(*mppr_levels)[(*num_types)-1].max_resources = atoi(num);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for duplicates and unknowns
|
|
|
|
*/
|
|
|
|
for( i = 0; i < *num_types; ++i ) {
|
|
|
|
/*
|
|
|
|
* Look for unknown and unsupported options
|
|
|
|
*/
|
|
|
|
if( LAMA_LEVEL_UNKNOWN <= (*mppr_levels)[i].type ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "unknown resource type at position %d", i + 1);
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look for duplicates
|
|
|
|
*/
|
|
|
|
for( j = i+1; j < *num_types; ++j ) {
|
|
|
|
if( (*mppr_levels)[i].type == (*mppr_levels)[j].type ) {
|
2013-05-10 15:06:25 +00:00
|
|
|
char *msg;
|
|
|
|
asprintf(&msg, "duplicate resource tpyes at position %d and %d",
|
|
|
|
i + 1, j + 1);
|
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid mppr option",
|
|
|
|
true,
|
|
|
|
layout, msg);
|
|
|
|
free(msg);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
exit_status = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cleanup:
|
|
|
|
if( NULL != argv ) {
|
|
|
|
opal_argv_free(argv);
|
|
|
|
argv = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return exit_status;
|
|
|
|
}
|
|
|
|
|
|
|
|
int rmaps_lama_parse_ordering(char *layout,
|
|
|
|
rmaps_lama_order_type_t *order)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Default: Natural ordering
|
|
|
|
*/
|
|
|
|
if( NULL == layout ) {
|
|
|
|
*order = LAMA_ORDER_NATURAL;
|
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sequential Ordering
|
|
|
|
*/
|
|
|
|
if( 0 == strncmp(layout, "s", strlen("s")) ||
|
|
|
|
0 == strncmp(layout, "S", strlen("S")) ) {
|
|
|
|
*order = LAMA_ORDER_SEQ;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Natural Ordering
|
|
|
|
*/
|
|
|
|
else if( 0 == strncmp(layout, "n", strlen("n")) ||
|
|
|
|
0 == strncmp(layout, "N", strlen("N")) ) {
|
|
|
|
*order = LAMA_ORDER_NATURAL;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check for unknown options
|
|
|
|
*/
|
|
|
|
else {
|
2013-05-10 15:06:25 +00:00
|
|
|
orte_show_help("help-orte-rmaps-lama.txt",
|
|
|
|
"invalid ordering option",
|
|
|
|
true,
|
|
|
|
"unsupported ordering option", layout);
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2013-05-16 00:47:37 +00:00
|
|
|
bool rmaps_lama_ok_to_prune_level(rmaps_lama_level_type_t level)
|
= Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for( i = 0; i < lama_mapping_num_layouts; ++i ) {
|
|
|
|
if( level == lama_mapping_layout[i] ) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* Support Functions
|
|
|
|
*********************************/
|
|
|
|
static int lama_parse_int_sort(const void *a, const void *b) {
|
|
|
|
int left = *((int*)a);
|
|
|
|
int right = *((int*)b);
|
|
|
|
|
|
|
|
if( left < right ) {
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
else if( left > right ) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|