1
1

Merge pull request from jsquyres/pr/v4.1.x/fix-minor-mistake-in-mpirun.1in

v4.1.x: orterun.1in: fix minor mistake in :PE=2 example and add more descriptions/explanations
Этот коммит содержится в:
Jeff Squyres 2020-11-09 15:02:38 -05:00 коммит произвёл GitHub
родитель bed064f198 df73e4a3e6
Коммит 74a743fc21
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23

@ -107,6 +107,106 @@ using an appropriate binding level or specific number of processing elements per
application process.
.
.\" **************************
.\" Definition of "slot"
.\" **************************
.SH DEFINITION OF 'SLOT'
.
.P
The term "slot" is used extensively in the rest of this manual page.
A slot is an allocation unit for a process. The number of slots on a
node indicate how many processes can potentially execute on that node.
By default, Open MPI will allow one process per slot.
.
.P
If Open MPI is not explicitly told how many slots are available on a
node (e.g., if a hostfile is used and the number of slots is not
specified for a given node), it will determine a maximum number of
slots for that node in one of two ways:
.
.TP 3
1. Default behavior
By default, Open MPI will attempt to discover the number of
processor cores on the node, and use that as the number of slots
available.
.
.TP 3
2. When \fI--use-hwthread-cpus\fP is used
If \fI--use-hwthread-cpus\fP is specified on the \fImpirun\fP command
line, then Open MPI will attempt to discover the number of hardware
threads on the node, and use that as the number of slots available.
.
.P
This default behavior also occurs when specifying the \fI-host\fP
option with a single host. Thus, the command:
.
.TP 4
mpirun --host node1 ./a.out
launches a number of processes equal to the number of cores on node node1,
whereas:
.TP 4
mpirun --host node1 --use-hwthread-cpus ./a.out
launches a number of processes equal to the number of hardware threads
on node1.
.
.P
When Open MPI applications are invoked in an environment managed by a
resource manager (e.g., inside of a SLURM job), and Open MPI was built
with appropriate support for that resource manager, then Open MPI will
be informed of the number of slots for each node by the resource
manager. For example:
.
.TP 4
mpirun ./a.out
launches one process for every slot (on every node) as dictated by
the resource manager job specification.
.
.P
Also note that the one-process-per-slot restriction can be overridden
in unmanaged environments (e.g., when using hostfiles without a
resource manager) if oversubscription is enabled (by default, it is
disabled). Most MPI applications and HPC environments do not
oversubscribe; for simplicity, the majority of this documentation
assumes that oversubscription is not enabled.
.
.
.SS Slots are not hardware resources
.
Slots are frequently incorrectly conflated with hardware resources.
It is important to realize that slots are an entirely different metric
than the number (and type) of hardware resources available.
.
.P
Here are some examples that may help illustrate the difference:
.
.TP 3
1. More processor cores than slots
Consider a resource manager job environment that tells Open MPI that
there is a single node with 20 processor cores and 2 slots available.
By default, Open MPI will only let you run up to 2 processes.
Meaning: you run out of slots long before you run out of processor
cores.
.
.TP 3
2. More slots than processor cores
Consider a hostfile with a single node listed with a "slots=50"
qualification. The node has 20 processor cores. By default, Open MPI
will let you run up to 50 processes.
Meaning: you can run many more processes than you have processor
cores.
.
.
.SH DEFINITION OF 'PROCESSOR ELEMENT'
By default, Open MPI defines that a "processing element" is a
processor core. However, if \fI--use-hwthread-cpus\fP is specified on
the \fImpirun\fP command line, then a "processing element" is a
hardware thread.
.
.
.\" **************************
.\" Options Section
.\" **************************
.SH OPTIONS
@ -297,15 +397,17 @@ To map processes:
.
.TP
.B --map-by \fR<foo>\fP
Map to the specified object, defaults to \fIsocket\fP. Supported options
include slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa,
board, node, sequential, distance, and ppr. Any object can include
modifiers by adding a \fR:\fP and any combination of PE=n (bind n
processing elements to each proc), SPAN (load
balance the processes across the allocation), OVERSUBSCRIBE (allow
more processes on a node than processing elements), and NOOVERSUBSCRIBE.
This includes PPR, where the pattern would be terminated by another colon
to separate it from the modifiers.
Map to the specified object, defaults to \fIsocket\fP. Supported
options include \fIslot\fP, \fIhwthread\fP, \fIcore\fP, \fIL1cache\fP,
\fIL2cache\fP, \fIL3cache\fP, \fIsocket\fP, \fInuma\fP, \fIboard\fP,
\fInode\fP, \fIsequential\fP, \fIdistance\fP, and \fIppr\fP. Any
object can include modifiers by adding a \fI:\fP and any combination
of \fIPE=n\fP (bind n processing elements to each proc), \fISPAN\fP
(load balance the processes across the allocation),
\fIOVERSUBSCRIBE\fP (allow more processes on a node than processing
elements), and \fINOOVERSUBSCRIBE\fP. This includes \fIPPR\fP, where the
pattern would be terminated by another colon to separate it from the
modifiers.
.
.TP
.B -bycore\fR,\fP --bycore
@ -757,7 +859,16 @@ Terminate the DVM.
.
.TP
.B -use-hwthread-cpus\fR,\fP --use-hwthread-cpus
Use hardware threads as independent cpus.
Use hardware threads as independent CPUs.
Note that if a number of slots is not provided to Open MPI (e.g., via
the "slots" keyword in a hostfile or from a resource manager such as
SLURM), the use of this option changes the default calculation of
number of slots on a node. See "DEFINITION OF 'SLOT'", above.
Also note that the use of this option changes the Open MPI's
definition of a "processor element" from a processor core to a
hardware thread. See "DEFINITION OF 'PROCESSOR ELEMENT'", above.
.
.
.TP
@ -889,20 +1000,8 @@ Or, consider the hostfile
.
.PP
Here, we list both the host names (aa, bb, and cc) but also how many "slots"
there are for each. Slots indicate how many processes can potentially execute
on a node. For best performance, the number of slots may be chosen to be the
number of cores on the node or the number of processor sockets. If the hostfile
does not provide slots information, Open MPI will attempt to discover the number
of cores (or hwthreads, if the use-hwthreads-as-cpus option is set) and set the
number of slots to that value. This default behavior also occurs when specifying
the \fI-host\fP option with a single hostname. Thus, the command
.
.TP 4
mpirun -H aa ./a.out
launches a number of processes equal to the number of cores on node aa.
.
.PP
Here, we list both the host names (aa, bb, and cc) but also how many slots
there are for each.
.
.TP 4
mpirun -hostfile myhostfile ./a.out
@ -1181,8 +1280,9 @@ exert detailed control over relative MCW rank location and binding.
Finally, \fI--report-bindings\fP can be used to report bindings.
.
.PP
As an example, consider a node with two processor sockets, each comprising
four cores. We run \fImpirun\fP with \fI-np 4 --report-bindings\fP and
As an example, consider a node with two processor sockets, each
comprised of four cores, and each of those cores contains one hardware
thread. We run \fImpirun\fP with \fI-np 4 --report-bindings\fP and
the following additional options:
.
@ -1198,7 +1298,7 @@ the following additional options:
[...] ... binding child [...,2] to socket 0 cpus 000f
[...] ... binding child [...,3] to socket 1 cpus 00f0
% mpirun ... --map-by core:PE=2 --bind-to core
% mpirun ... --map-by slot:PE=2 --bind-to core
[...] ... binding child [...,0] to cpus 0003
[...] ... binding child [...,1] to cpus 000c
[...] ... binding child [...,2] to cpus 0030
@ -1212,9 +1312,20 @@ In the first case, the processes bind to successive cores as indicated by
the masks 0001, 0002, 0004, and 0008. In the second case, processes bind
to all cores on successive sockets as indicated by the masks 000f and 00f0.
The processes cycle through the processor sockets in a round-robin fashion
as many times as are needed. In the third case, the masks show us that
2 cores have been bound per process. In the fourth case, binding is
turned off and no bindings are reported.
as many times as are needed.
.
.P
In the third case, the masks show us that 2 cores have been bound per
process. Specifically, the mapping by slot with the \fIPE=2\fP
qualifier indicated that each slot (i.e., process) should consume two
processor elements. Since \fI--use-hwthread-cpus\fP was not
specified, Open MPI defined "processor element" as "core", and
therefore the \fI--bind-to core\fP caused each process to be bound to
both of the cores to which it was mapped.
.
.P
In the fourth case, binding is turned off and no bindings are
reported.
.
.PP
Open MPI's support for process binding depends on the underlying