1
1
openmpi/orte/mca/rmaps/lama/Makefile.am

41 строка
1.0 KiB
Makefile
Исходник Обычный вид История

= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.
2012-08-31 23:57:53 +04:00
#
# Copyright (c) 2011 Oak Ridge National Labs. All rights reserved.
#
# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
2015-06-24 06:59:57 +03:00
#
= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.
2012-08-31 23:57:53 +04:00
# Additional copyrights may follow
2015-06-24 06:59:57 +03:00
#
= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.
2012-08-31 23:57:53 +04:00
# $HEADER$
#
dist_ortedata_DATA = help-orte-rmaps-lama.txt
= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.
2012-08-31 23:57:53 +04:00
sources = \
rmaps_lama_module.c \
rmaps_lama_max_tree.c \
rmaps_lama_params.c \
rmaps_lama.h \
rmaps_lama_component.c
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if MCA_BUILD_orte_rmaps_lama_DSO
component_noinst =
component_install = mca_rmaps_lama.la
else
component_noinst = libmca_rmaps_lama.la
component_install =
endif
mcacomponentdir = $(ortelibdir)
= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.
2012-08-31 23:57:53 +04:00
mcacomponent_LTLIBRARIES = $(component_install)
mca_rmaps_lama_la_SOURCES = $(sources)
mca_rmaps_lama_la_LDFLAGS = -module -avoid-version
noinst_LTLIBRARIES = $(component_noinst)
libmca_rmaps_lama_la_SOURCES =$(sources)
libmca_rmaps_lama_la_LDFLAGS = -module -avoid-version