Update the orterun man page

2014-10-16 21:04:50 -07:00 · 2014-10-16 21:04:50 -07:00 · f9d620e3a7
--- a/orte/tools/orterun/orterun.1in
+++ b/orte/tools/orterun/orterun.1in
@ -77,6 +77,24 @@ process starter, as opposed to, for example, \fIrsh\fR or \fIssh\fR,
 which require the use of a hostfile, or will default to running all X
 copies on the localhost), scheduling (by default) in a round-robin fashion by
 CPU slot.  See the rest of this page for more details.
+.P
+Please note that mpirun automatically binds processes as of the start of the
+v1.8 series. Two binding patterns are used in the absence of any further directives:
+.TP 18
+.B Bind to core:
+when the number of processes is <= 2
+.
+.
+.TP
+.B Bind to socket:
+when the number of processes is > 2
+.
+.
+.P
+If your application uses threads, then you probably want to ensure that you are
+either not bound at all (by specifying --bind-to none), or bound to multiple cores
+using an appropriate binding level or specific number of processing elements per
+application process.
 .
 .\" **************************
 .\"    Options Section
@ -128,7 +146,14 @@ cause orterun to exit.
 .
 .
 .P
-To specify which hosts (nodes) of the cluster to run on:
+Use one of the following options to specify which hosts (nodes) of the cluster to run on. Note
+that as of the start of the v1.8 release, mpirun will launch a daemon onto each host in the
+allocation (as modified by the following options) at the very beginning of execution, regardless
+of whether or not application processes will eventually be mapped to execute there. This is
+done to allow collection of hardware topology information from the remote nodes, thus allowing
+us to map processes against known topology. However, it is a change from the behavior in prior releases
+where daemons were only launched \fRafter\fP mapping was complete, and thus only occurred on
+nodes where application processes would actually be executing.
 .
 .
 .TP
@ -151,7 +176,9 @@ Synonym for \fI-hostfile\fP.
 .
 .
 .P
-To specify the number of processes to launch:
+The following options specify the number of processes to launch. Note that none
+of the options imply a particular binding policy - e.g., requesting N processes
+for each socket does not imply that the processes will be bound to the socket.
 .
 .
 .TP
@ -167,6 +194,11 @@ error (without beginning execution of the application) otherwise.
 .
 .
 .TP
+.B —map-by ppr:N:<object>
+Launch N times the number of objects of the specified type on each node.
+.
+.
+.TP
 .B -npersocket\fR,\fP --npersocket <#persocket>
 On each node, launch this many processes times the number of processor
 sockets on the node.
@ -253,7 +285,7 @@ For process binding:
 .TP
 .B --bind-to <foo>
 Bind processes to the specified object, defaults to \fIcore\fP. Supported options
-include slot, hwthread, core, socket, numa, board, and none.
+include slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none.
 .
 .TP
 .B -cpus-per-proc\fR,\fP --cpus-per-proc <#perproc>
@ -749,13 +781,13 @@ Consider the same hostfile as above, again with \fI-np\fP 6:

  mpirun                  0 1 2 3      4 5

-  mpirun -bynode          0 3          1 4          2 5
+  mpirun --map-by node    0 3          1 4          2 5

  mpirun -nolocal                      0 1 2 3      4 5
 .
 .PP
-The \fI-bynode\fP option does likewise but numbers the processes in "by node"
-in a round-robin fashion.
+The \fI--map-by node\fP option will load balance the processes across
+the available nodes, numbering each process in a round-robin fashion.
 .
 .PP
 The \fI-nolocal\fP option prevents any processes from being mapped onto the
@ -821,19 +853,32 @@ mpirun -H aa -np 1 hostname : -H bb,cc -np 2 uptime
 will launch process 0 running \fIhostname\fP on node aa and processes 1 and 2
 each running \fIuptime\fP on nodes bb and cc, respectively.
 .
-.SS Mapping Processes to Nodes:  Using Arbitrary Mappings
+.SS Mapping, Ranking, and Binding: Oh My!
 .
-The mapping of process processes to nodes can be prescribed not just
+OpenMPI employs a three-phase procedure for assigning process locations and
+ranks. The \fImapping\fP step is used to assign a default location to each process
+based on the mapper being employed. Mapping by slot, node, and sequentially results
+in the assignment of the processes to the node level. In contrast, mapping by object, allows
+the mapper to assign the process to an actual object on each node.
+.
+.PP
+\fBNote:\fP the location assigned to the process is independent of where it will be bound - the
+assignment is used solely as input to the binding algorithm.
+.
+.PP
+The mapping of process processes to nodes can be defined not just
 with general policies but also, if necessary, using arbitrary mappings
 that cannot be described by a simple policy.  One can use the "sequential
 mapper," which reads the hostfile line by line, assigning processes
 to nodes in whatever order the hostfile specifies.  Use the
 \fI-mca rmaps seq\fP option.  For example, using the same hostfile
-as before
+as before:
 .
-.TP 4
-mpirun -hostfile myhostfile ./a.out
-will launch three processes, on nodes aa, bb, and cc, respectively.
+.PP
+mpirun -hostfile myhostfile -mca rmaps seq ./a.out
+.
+.PP
+will launch three processes, one on each of nodes aa, bb, and cc, respectively.
 The slot counts don't matter;  one process is launched per line on
 whatever node is listed on the line.
 .
@ -842,9 +887,31 @@ Another way to specify arbitrary mappings is with a rankfile, which
 gives you detailed control over process binding as well.  Rankfiles
 are discussed below.
 .
-.SS Process Binding
+.PP
+The second phase focuses on the \fIranking\fP of the process within the job. OpenMPI
+separates this from the mapping procedure to allow more flexibility in the
+relative placement of MPI ranks. This is best illustrated by considering the
+following two cases where we used the —map-by ppr:2:socket option:
 .
-Processes may be bound to specific resources on a node.  This can
+.PP
+                          node aa       node bb
+
+    rank-by core         0 1 ! 2 3     4 5 ! 6 7
+
+   rank-by socket        0 2 ! 1 3     4 6 ! 5 7
+
+   rank-by socket:span   0 4 ! 1 5     2 6 ! 3 7
+.
+.PP
+Ranking by core and by slot provide the identical result - a simple progression of ranks across
+each node. Ranking by socket does a round-robin ranking within each node until all processes
+have been assigned a rank, and then progresses to the next node. Adding the \fIspan\fP
+modifier to the ranking directive causes the ranking algorithm to treat the entire allocation
+as a single entity - thus, the ranks are assigned across all sockets before circling back
+around to the beginning.
+.
+.PP
+The \fIbinding\fP phase actually binds each process to a given set of processors. This can
 improve performance if the operating system is placing processes
 suboptimally.  For example, it might oversubscribe some multi-core
 processor sockets, leaving other sockets idle;  this can lead
@ -856,20 +923,23 @@ processes excessively, regardless of how optimally those processes
 were placed to begin with.
 .
 .PP
-To bind processes, one must first associate them with the resources
-on which they should run.  For example, the \fI--map-by core\fP option
-associates the processes on a node with successive cores.  Or,
-\fI--map-by socket\fP associates the processes with successive processor sockets,
-cycling through the sockets in a round-robin fashion if necessary.
-And \fI-cpus-per-proc\fP indicates how many cores to bind per process.
+The processors to be used for binding
+can be identified in terms of topological groupings - e.g., binding to an l3cache will bind
+each process to all processors in the l3cache within their assigned location. Thus, if a process
+is assigned by the mapper to a certain socket, then a \fI—bind-to l3cache\fP directive will cause
+the process to be bound to the l3cache within that socket.
 .
 .PP
-But, such association is meaningless unless the processes are actually
-bound to those resources.  The binding option specifies the granularity
-of binding -- say, with \fI-bind-to core\fP or \fI-bind-to socket\fP.
-One can also turn binding off with \fI-bind-to none\fP, which is
-typically the default.
-.\" JMS ^^ THE ABOVE STATEMENT IS NO LONGER TRUE.
+To help balance loads, the binding directive uses a round-robin method when binding to
+levels lower than used in the mapper. For example, consider the case where a job is
+mapped to the socket level, and then bound to core. Each socket will have multiple cores,
+so if multiple processes are mapped to a given socket, the binding algorithm will assign
+each process located to a socket to a unique core in a round-robin manner.
+.
+.PP
+Alternatively, processes mapped by l2cache and then bound to socket will simply be bound
+to all the processors in the socket where they are located. In this manner, users can
+exert detailed control over relative rank location and binding.
 .
 .PP
 Finally, \fI--report-bindings\fP can be used to report bindings.
@ -921,30 +991,17 @@ Their usage is less convenient than that of \fImpirun\fP options.
 On the other hand, MCA parameters can be set not only on the \fImpirun\fP
 command line, but alternatively in a system or user mca-params.conf file
 or as environment variables, as described in the MCA section below.
-The correspondences are:
-.
-
-  mpirun option          MCA parameter key           value
-
-  --map-by core          rmaps_base_schedule_policy  core
-  --map-by socket        rmaps_base_schedule_policy  socket
-  --bind-to core         orte_process_binding        core
-  --bind-to socket       orte_process_binding        socket
-  --bind-to none         orte_process_binding        none
-.\" JMS I DON'T KNOW IF THESE ARE STILL THE RIGHT MCA PARAM NAMES
+Some examples include:
 .
 .PP
-The \fIorte_process_binding\fP value can also take on the
-\fI:if-avail\fP attribute.  This attribute means that processes
-will be bound only if this is supported on the underlying
-operating system.  Without the attribute, if there is no
-such support, the binding request results in an error.
-For example, you could have
-.
+    mpirun option          MCA parameter key         value

-  % cat $HOME/.openmpi/mca-params.conf
-  rmaps_base_schedule_policy = socket
-  orte_process_binding       = socket:if-avail
+  --map-by core          rmaps_base_mapping_policy   core
+  --map-by socket        rmaps_base_mapping_policy   socket
+  --rank-by core         rmaps_base_ranking_policy   core
+  --bind-to core         hwloc_base_binding_policy   core
+  --bind-to socket       hwloc_base_binding_policy   socket
+  --bind-to none         hwloc_base_binding_policy   none
 .
 .
 .SS Rankfiles
@ -1218,13 +1275,15 @@ is equivalent to
 .SS Exported Environment Variables
 .
 All environment variables that are named in the form OMPI_* will automatically
-be exported to new processes on the local and remote nodes.
-The \fI\-x\fP option to \fImpirun\fP can be used to export specific environment
-variables to the new processes.  While the syntax of the \fI\-x\fP
-option allows the definition of new variables, note that the parser
-for this option is currently not very sophisticated - it does not even
+be exported to new processes on the local and remote nodes. Environmental
+parameters can also be set/forwarded to the new processes using the new MCA
+parameter \fImca_base_env_list\fP. The \fI\-x\fP option to \fImpirun\fP has
+been deprecated, but the syntax of the new MCA param follows that prior
+example. While the syntax of the \fI\-x\fP option and MCA param
+allows the definition of new variables, note that the parser
+for these options are currently not very sophisticated - it does not even
 understand quoted values.  Users are advised to set variables in the
-environment and use \fI\-x\fP to export them; not to define them.
+environment and use the option to export them; not to define them.
 .
 .
 .