From 16d88941efac2f51e7b44e0062ec9c2c5a37fac7 Mon Sep 17 00:00:00 2001 From: Jeff Squyres Date: Sat, 17 Oct 2020 14:57:24 -0400 Subject: [PATCH 1/3] orterun.1in: fix minor mistake in :PE=2 example Fix mistake in orterun(1) (i.e., mpirun(1)) with an example using the :PE=x modifier. Additionally, add some extra text with some further explanation. This is not a cherry-pick from master because PRRTE has replaced ORTE on master, and orterun.1in no longer exists in master. Signed-off-by: Jeff Squyres (cherry picked from commit 7384972e288e0037c1b5d25a08b6e54b0cfff1e1) --- orte/tools/orterun/orterun.1in | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/orte/tools/orterun/orterun.1in b/orte/tools/orterun/orterun.1in index 4d9d5665d4..9ea58712d4 100644 --- a/orte/tools/orterun/orterun.1in +++ b/orte/tools/orterun/orterun.1in @@ -1181,8 +1181,9 @@ exert detailed control over relative MCW rank location and binding. Finally, \fI--report-bindings\fP can be used to report bindings. . .PP -As an example, consider a node with two processor sockets, each comprising -four cores. We run \fImpirun\fP with \fI-np 4 --report-bindings\fP and +As an example, consider a node with two processor sockets, each +comprised of four cores, and each of those cores contains one hardware +thread. We run \fImpirun\fP with \fI-np 4 --report-bindings\fP and the following additional options: . @@ -1198,7 +1199,7 @@ the following additional options: [...] ... binding child [...,2] to socket 0 cpus 000f [...] ... binding child [...,3] to socket 1 cpus 00f0 - % mpirun ... --map-by core:PE=2 --bind-to core + % mpirun ... --map-by slot:PE=2 --bind-to core [...] ... binding child [...,0] to cpus 0003 [...] ... binding child [...,1] to cpus 000c [...] ... binding child [...,2] to cpus 0030 @@ -1212,9 +1213,20 @@ In the first case, the processes bind to successive cores as indicated by the masks 0001, 0002, 0004, and 0008. In the second case, processes bind to all cores on successive sockets as indicated by the masks 000f and 00f0. The processes cycle through the processor sockets in a round-robin fashion -as many times as are needed. In the third case, the masks show us that -2 cores have been bound per process. In the fourth case, binding is -turned off and no bindings are reported. +as many times as are needed. +. +.P +In the third case, the masks show us that 2 cores have been bound per +process. Specifically, the mapping by slot with the \fIPE=2\fP +qualifier indicated that each slot (i.e., process) should consume two +processor elements. Since \fI--use-hwthread-cpus\fP was not +specified, Open MPI defined "processor element" as "core", and +therefore the \fI--bind-to core\fP caused each process to be bound to +both of the cores to which it was mapped. +. +.P +In the fourth case, binding is turned off and no bindings are +reported. . .PP Open MPI's support for process binding depends on the underlying From 405dc6e7f2d67388e432e3751f582b9e97f45bda Mon Sep 17 00:00:00 2001 From: Jeff Squyres Date: Sat, 7 Nov 2020 14:08:58 -0500 Subject: [PATCH 2/3] orterun.1in: define "slot" and "processor element" Add descriptive definitions of "slot" and "processor element" at the top of the man page (and effectively delete / move some text from lower in the man page up into those definitions). Also add a little blurb in the --use-hwthread-cpus description about how it changes the definition of "processor element". This is not a cherry-pick from master because PRRTE has replaced ORTE on master, and orterun.1in no longer exists in master. Signed-off-by: Jeff Squyres (cherry picked from commit 07b8937d4ae8d64b6d2394f4272e705a2ec89656) --- orte/tools/orterun/orterun.1in | 127 +++++++++++++++++++++++++++++---- 1 file changed, 112 insertions(+), 15 deletions(-) diff --git a/orte/tools/orterun/orterun.1in b/orte/tools/orterun/orterun.1in index 9ea58712d4..35413dea8e 100644 --- a/orte/tools/orterun/orterun.1in +++ b/orte/tools/orterun/orterun.1in @@ -107,6 +107,106 @@ using an appropriate binding level or specific number of processing elements per application process. . .\" ************************** +.\" Definition of "slot" +.\" ************************** +.SH DEFINITION OF 'SLOT' +. +.P +The term "slot" is used extensively in the rest of this manual page. +A slot is an allocation unit for a process. The number of slots on a +node indicate how many processes can potentially execute on that node. +By default, Open MPI will allow one process per slot. +. +.P +If Open MPI is not explicitly told how many slots are available on a +node (e.g., if a hostfile is used and the number of slots is not +specified for a given node), it will determine a maximum number of +slots for that node in one of two ways: +. +.TP 3 +1. Default behavior +By default, Open MPI will attempt to discover the number of +processor cores on the node, and use that as the number of slots +available. +. +.TP 3 +2. When \fI--use-hwthread-cpus\fP is used +If \fI--use-hwthread-cpus\fP is specified on the \fImpirun\fP command +line, then Open MPI will attempt to discover the number of hardware +threads on the node, and use that as the number of slots available. +. +.P +This default behavior also occurs when specifying the \fI-host\fP +option with a single host. Thus, the command: +. +.TP 4 +mpirun --host node1 ./a.out +launches a number of processes equal to the number of cores on node node1, +whereas: +.TP 4 +mpirun --host node1 --use-hwthread-cpus ./a.out +launches a number of processes equal to the number of hardware threads +on node1. +. +.P +When Open MPI applications are invoked in an environment managed by a +resource manager (e.g., inside of a SLURM job), and Open MPI was built +with appropriate support for that resource manager, then Open MPI will +be informed of the number of slots for each node by the resource +manager. For example: +. +.TP 4 +mpirun ./a.out +launches one process for every slot (on every node) as dictated by +the resource manager job specification. +. +.P +Also note that the one-process-per-slot restriction can be overridden +in unmanaged environments (e.g., when using hostfiles without a +resource manager) if oversubscription is enabled (by default, it is +disabled). Most MPI applications and HPC environments do not +oversubscribe; for simplicity, the majority of this documentation +assumes that oversubscription is not enabled. +. +. +.SS Slots are not hardware resources +. +Slots are frequently incorrectly conflated with hardware resources. +It is important to realize that slots are an entirely different metric +than the number (and type) of hardware resources available. +. +.P +Here are some examples that may help illustrate the difference: +. +.TP 3 +1. More processor cores than slots + +Consider a resource manager job environment that tells Open MPI that +there is a single node with 20 processor cores and 2 slots available. +By default, Open MPI will only let you run up to 2 processes. + +Meaning: you run out of slots long before you run out of processor +cores. +. +.TP 3 +2. More slots than processor cores + +Consider a hostfile with a single node listed with a "slots=50" +qualification. The node has 20 processor cores. By default, Open MPI +will let you run up to 50 processes. + +Meaning: you can run many more processes than you have processor +cores. +. +. +.SH DEFINITION OF 'PROCESSOR ELEMENT' +By default, Open MPI defines that a "processing element" is a +processor core. However, if \fI--use-hwthread-cpus\fP is specified on +the \fImpirun\fP command line, then a "processing element" is a +hardware thread. +. +. +.\" ************************** .\" Options Section .\" ************************** .SH OPTIONS @@ -757,7 +857,16 @@ Terminate the DVM. . .TP .B -use-hwthread-cpus\fR,\fP --use-hwthread-cpus -Use hardware threads as independent cpus. +Use hardware threads as independent CPUs. + +Note that if a number of slots is not provided to Open MPI (e.g., via +the "slots" keyword in a hostfile or from a resource manager such as +SLURM), the use of this option changes the default calculation of +number of slots on a node. See "DEFINITION OF 'SLOT'", above. + +Also note that the use of this option changes the Open MPI's +definition of a "processor element" from a processor core to a +hardware thread. See "DEFINITION OF 'PROCESSOR ELEMENT'", above. . . .TP @@ -889,20 +998,8 @@ Or, consider the hostfile . .PP -Here, we list both the host names (aa, bb, and cc) but also how many "slots" -there are for each. Slots indicate how many processes can potentially execute -on a node. For best performance, the number of slots may be chosen to be the -number of cores on the node or the number of processor sockets. If the hostfile -does not provide slots information, Open MPI will attempt to discover the number -of cores (or hwthreads, if the use-hwthreads-as-cpus option is set) and set the -number of slots to that value. This default behavior also occurs when specifying -the \fI-host\fP option with a single hostname. Thus, the command -. -.TP 4 -mpirun -H aa ./a.out -launches a number of processes equal to the number of cores on node aa. -. -.PP +Here, we list both the host names (aa, bb, and cc) but also how many slots +there are for each. . .TP 4 mpirun -hostfile myhostfile ./a.out From df73e4a3e618ba185ee58695298edee089422e77 Mon Sep 17 00:00:00 2001 From: Jeff Squyres Date: Sat, 7 Nov 2020 14:10:20 -0500 Subject: [PATCH 3/3] orterun.1in: add some markup Add some nroff markup into the paragraph, just to clearly delineate the option names from the paragraph text. No other content changes. This is not a cherry-pick from master because PRRTE has replaced ORTE on master, and orterun.1in no longer exists in master. Signed-off-by: Jeff Squyres (cherry picked from commit 25f84bee647f01926ff938e98cb7ee92c511c962) --- orte/tools/orterun/orterun.1in | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/orte/tools/orterun/orterun.1in b/orte/tools/orterun/orterun.1in index 35413dea8e..1faaadb23f 100644 --- a/orte/tools/orterun/orterun.1in +++ b/orte/tools/orterun/orterun.1in @@ -397,15 +397,17 @@ To map processes: . .TP .B --map-by \fR\fP -Map to the specified object, defaults to \fIsocket\fP. Supported options -include slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa, -board, node, sequential, distance, and ppr. Any object can include -modifiers by adding a \fR:\fP and any combination of PE=n (bind n -processing elements to each proc), SPAN (load -balance the processes across the allocation), OVERSUBSCRIBE (allow -more processes on a node than processing elements), and NOOVERSUBSCRIBE. -This includes PPR, where the pattern would be terminated by another colon -to separate it from the modifiers. +Map to the specified object, defaults to \fIsocket\fP. Supported +options include \fIslot\fP, \fIhwthread\fP, \fIcore\fP, \fIL1cache\fP, +\fIL2cache\fP, \fIL3cache\fP, \fIsocket\fP, \fInuma\fP, \fIboard\fP, +\fInode\fP, \fIsequential\fP, \fIdistance\fP, and \fIppr\fP. Any +object can include modifiers by adding a \fI:\fP and any combination +of \fIPE=n\fP (bind n processing elements to each proc), \fISPAN\fP +(load balance the processes across the allocation), +\fIOVERSUBSCRIBE\fP (allow more processes on a node than processing +elements), and \fINOOVERSUBSCRIBE\fP. This includes \fIPPR\fP, where the +pattern would be terminated by another colon to separate it from the +modifiers. . .TP .B -bycore\fR,\fP --bycore