b4ae5d005f
Per suggestion by @bangerth, allow mpirun to execute as root if two
envars are set to specific values
Per conversation with @jsquyres, name the envars OMPI_ALLOW_RUN_AS_ROOT
and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
Fixes #4451
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7f1444d5f9
)
1764 строки
57 KiB
Plaintext
1764 строки
57 KiB
Plaintext
.\" -*- nroff -*-
|
|
.\" Copyright (c) 2009-2018 Cisco Systems, Inc. All rights reserved.
|
|
.\" Copyright (c) 2008-2009 Sun Microsystems, Inc. All rights reserved.
|
|
.\" Copyright (c) 2017-2018 Intel, Inc. All rights reserved.
|
|
.\" Copyright (c) 2017 Los Alamos National Security, LLC. All rights
|
|
.\" reserved.
|
|
.\" $COPYRIGHT$
|
|
.\"
|
|
.\" Man page for ORTE's orterun command
|
|
.\"
|
|
.\" .TH name section center-footer left-footer center-header
|
|
.TH MPIRUN 1 "#OMPI_DATE#" "#PACKAGE_VERSION#" "#PACKAGE_NAME#"
|
|
.\" **************************
|
|
.\" Name Section
|
|
.\" **************************
|
|
.SH NAME
|
|
.
|
|
orterun, mpirun, mpiexec \- Execute serial and parallel jobs in Open MPI.
|
|
oshrun, shmemrun \- Execute serial and parallel jobs in Open SHMEM.
|
|
|
|
.B Note:
|
|
\fImpirun\fP, \fImpiexec\fP, and \fIorterun\fP are all synonyms for each
|
|
other as well as \fIoshrun\fP, \fIshmemrun\fP in case Open SHMEM is installed.
|
|
Using any of the names will produce the same behavior.
|
|
.
|
|
.\" **************************
|
|
.\" Synopsis Section
|
|
.\" **************************
|
|
.SH SYNOPSIS
|
|
.
|
|
.PP
|
|
Single Process Multiple Data (SPMD) Model:
|
|
|
|
.B mpirun
|
|
[ options ]
|
|
.B <program>
|
|
[ <args> ]
|
|
.P
|
|
|
|
Multiple Instruction Multiple Data (MIMD) Model:
|
|
|
|
.B mpirun
|
|
[ global_options ]
|
|
[ local_options1 ]
|
|
.B <program1>
|
|
[ <args1> ] :
|
|
[ local_options2 ]
|
|
.B <program2>
|
|
[ <args2> ] :
|
|
... :
|
|
[ local_optionsN ]
|
|
.B <programN>
|
|
[ <argsN> ]
|
|
.P
|
|
|
|
Note that in both models, invoking \fImpirun\fP via an absolute path
|
|
name is equivalent to specifying the \fI--prefix\fP option with a
|
|
\fI<dir>\fR value equivalent to the directory where \fImpirun\fR
|
|
resides, minus its last subdirectory. For example:
|
|
|
|
\fB%\fP /usr/local/bin/mpirun ...
|
|
|
|
is equivalent to
|
|
|
|
\fB%\fP mpirun --prefix /usr/local
|
|
|
|
.
|
|
.\" **************************
|
|
.\" Quick Summary Section
|
|
.\" **************************
|
|
.SH QUICK SUMMARY
|
|
.
|
|
If you are simply looking for how to run an MPI application, you
|
|
probably want to use a command line of the following form:
|
|
|
|
\fB%\fP mpirun [ -np X ] [ --hostfile <filename> ] <program>
|
|
|
|
This will run X copies of \fI<program>\fR in your current run-time
|
|
environment (if running under a supported resource manager, Open MPI's
|
|
\fImpirun\fR will usually automatically use the corresponding resource manager
|
|
process starter, as opposed to, for example, \fIrsh\fR or \fIssh\fR,
|
|
which require the use of a hostfile, or will default to running all X
|
|
copies on the localhost), scheduling (by default) in a round-robin fashion by
|
|
CPU slot. See the rest of this page for more details.
|
|
.P
|
|
Please note that mpirun automatically binds processes as of the start of the
|
|
v1.8 series. Three binding patterns are used in the absence of any further directives:
|
|
.TP 18
|
|
.B Bind to core:
|
|
when the number of processes is <= 2
|
|
.
|
|
.
|
|
.TP
|
|
.B Bind to socket:
|
|
when the number of processes is > 2
|
|
.
|
|
.
|
|
.TP
|
|
.B Bind to none:
|
|
when oversubscribed
|
|
.
|
|
.
|
|
.P
|
|
If your application uses threads, then you probably want to ensure that you are
|
|
either not bound at all (by specifying --bind-to none), or bound to multiple cores
|
|
using an appropriate binding level or specific number of processing elements per
|
|
application process.
|
|
.
|
|
.\" **************************
|
|
.\" Options Section
|
|
.\" **************************
|
|
.SH OPTIONS
|
|
.
|
|
.I mpirun
|
|
will send the name of the directory where it was invoked on the local
|
|
node to each of the remote nodes, and attempt to change to that
|
|
directory. See the "Current Working Directory" section below for further
|
|
details.
|
|
.\"
|
|
.\" Start options listing
|
|
.\" Indent 10 characters from start of first column to start of second column
|
|
.TP 10
|
|
.B <program>
|
|
The program executable. This is identified as the first non-recognized argument
|
|
to mpirun.
|
|
.
|
|
.
|
|
.TP
|
|
.B <args>
|
|
Pass these run-time arguments to every new process. These must always
|
|
be the last arguments to \fImpirun\fP. If an app context file is used,
|
|
\fI<args>\fP will be ignored.
|
|
.
|
|
.
|
|
.TP
|
|
.B -h\fR,\fP --help
|
|
Display help for this command
|
|
.
|
|
.
|
|
.TP
|
|
.B -q\fR,\fP --quiet
|
|
Suppress informative messages from orterun during application execution.
|
|
.
|
|
.
|
|
.TP
|
|
.B -v\fR,\fP --verbose
|
|
Be verbose
|
|
.
|
|
.
|
|
.TP
|
|
.B -V\fR,\fP --version
|
|
Print version number. If no other arguments are given, this will also
|
|
cause orterun to exit.
|
|
.
|
|
.
|
|
.TP
|
|
.B -N \fR<num>\fP
|
|
.br
|
|
Launch num processes per node on all allocated nodes (synonym for npernode).
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-map\fR,\fP --display-map
|
|
Display a table showing the mapped location of each process prior to launch.
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-allocation\fR,\fP --display-allocation
|
|
Display the detected resource allocation.
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -output-proctable\fR,\fP --output-proctable
|
|
Output the debugger proctable after launch.
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -dvm\fR,\fP --dvm
|
|
Create a persistent distributed virtual machine (DVM).
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -max-vm-size\fR,\fP --max-vm-size \fR<size>\fP
|
|
Number of processes to run.
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -novm\fR,\fP --novm
|
|
Execute without creating an allocation-spanning virtual machine (only start
|
|
daemons on nodes hosting application procs).
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -hnp\fR,\fP --hnp \fR<arg0>\fP
|
|
Specify the URI of the Head Node Process (HNP), or the name of the file (specified as
|
|
file:filename) that contains that info.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
Use one of the following options to specify which hosts (nodes) of the cluster to run on. Note
|
|
that as of the start of the v1.8 release, mpirun will launch a daemon onto each host in the
|
|
allocation (as modified by the following options) at the very beginning of execution, regardless
|
|
of whether or not application processes will eventually be mapped to execute there. This is
|
|
done to allow collection of hardware topology information from the remote nodes, thus allowing
|
|
us to map processes against known topology. However, it is a change from the behavior in prior releases
|
|
where daemons were only launched \fRafter\fP mapping was complete, and thus only occurred on
|
|
nodes where application processes would actually be executing.
|
|
.
|
|
.
|
|
.TP
|
|
.B -H\fR,\fP -host\fR,\fP --host \fR<host1,host2,...,hostN>\fP
|
|
List of hosts on which to invoke processes.
|
|
.
|
|
.
|
|
.TP
|
|
.B -hostfile\fR,\fP --hostfile \fR<hostfile>\fP
|
|
Provide a hostfile to use.
|
|
.\" JJH - Should have man page for how to format a hostfile properly.
|
|
.
|
|
.
|
|
.TP
|
|
.B -default-hostfile\fR,\fP --default-hostfile \fR<hostfile>\fP
|
|
Provide a default hostfile.
|
|
.
|
|
.
|
|
.TP
|
|
.B -machinefile\fR,\fP --machinefile \fR<machinefile>\fP
|
|
Synonym for \fI-hostfile\fP.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.TP
|
|
.B -cpu-set\fR,\fP --cpu-set \fR<list>\fP
|
|
Restrict launched processes to the specified logical cpus on each node (comma-separated
|
|
list). Note that the binding options will still apply within the specified envelope - e.g.,
|
|
you can elect to bind each process to only one cpu within the specified cpu set.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
The following options specify the number of processes to launch. Note that none
|
|
of the options imply a particular binding policy - e.g., requesting N processes
|
|
for each socket does not imply that the processes will be bound to the socket.
|
|
.
|
|
.
|
|
.TP
|
|
.B -c\fR,\fP -n\fR,\fP --n\fR,\fP -np \fR<#>\fP
|
|
Run this many copies of the program on the given nodes. This option
|
|
indicates that the specified file is an executable program and not an
|
|
application context. If no value is provided for the number of copies to
|
|
execute (i.e., neither the "-np" nor its synonyms are provided on the command
|
|
line), Open MPI will automatically execute a copy of the program on
|
|
each process slot (see below for description of a "process slot"). This
|
|
feature, however, can only be used in the SPMD model and will return an
|
|
error (without beginning execution of the application) otherwise.
|
|
.
|
|
.
|
|
.TP
|
|
.B —map-by ppr:N:<object>
|
|
Launch N times the number of objects of the specified type on each node.
|
|
.
|
|
.
|
|
.TP
|
|
.B -npersocket\fR,\fP --npersocket \fR<#persocket>\fP
|
|
On each node, launch this many processes times the number of processor
|
|
sockets on the node.
|
|
The \fI-npersocket\fP option also turns on the \fI-bind-to-socket\fP option.
|
|
(deprecated in favor of --map-by ppr:n:socket)
|
|
.
|
|
.
|
|
.TP
|
|
.B -npernode\fR,\fP --npernode \fR<#pernode>\fP
|
|
On each node, launch this many processes.
|
|
(deprecated in favor of --map-by ppr:n:node)
|
|
.
|
|
.
|
|
.TP
|
|
.B -pernode\fR,\fP --pernode
|
|
On each node, launch one process -- equivalent to \fI-npernode\fP 1.
|
|
(deprecated in favor of --map-by ppr:1:node)
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To map processes:
|
|
.
|
|
.
|
|
.TP
|
|
.B --map-by \fR<foo>\fP
|
|
Map to the specified object, defaults to \fIsocket\fP. Supported options
|
|
include slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa,
|
|
board, node, sequential, distance, and ppr. Any object can include
|
|
modifiers by adding a \fR:\fP and any combination of PE=n (bind n
|
|
processing elements to each proc), SPAN (load
|
|
balance the processes across the allocation), OVERSUBSCRIBE (allow
|
|
more processes on a node than processing elements), and NOOVERSUBSCRIBE.
|
|
This includes PPR, where the pattern would be terminated by another colon
|
|
to separate it from the modifiers.
|
|
.
|
|
.TP
|
|
.B -bycore\fR,\fP --bycore
|
|
Map processes by core (deprecated in favor of --map-by core)
|
|
.
|
|
.TP
|
|
.B -byslot\fR,\fP --byslot
|
|
Map and rank processes round-robin by slot.
|
|
.
|
|
.TP
|
|
.B -nolocal\fR,\fP --nolocal
|
|
Do not run any copies of the launched application on the same node as
|
|
orterun is running. This option will override listing the localhost
|
|
with \fB--host\fR or any other host-specifying mechanism.
|
|
.
|
|
.TP
|
|
.B -nooversubscribe\fR,\fP --nooversubscribe
|
|
Do not oversubscribe any nodes; error (without starting any processes)
|
|
if the requested number of processes would cause oversubscription.
|
|
This option implicitly sets "max_slots" equal to the "slots" value for
|
|
each node. (Enabled by default).
|
|
.
|
|
.TP
|
|
.B -oversubscribe\fR,\fP --oversubscribe
|
|
Nodes are allowed to be oversubscribed, even on a managed system, and
|
|
overloading of processing elements.
|
|
.
|
|
.TP
|
|
.B -bynode\fR,\fP --bynode
|
|
Launch processes one per node, cycling by node in a round-robin
|
|
fashion. This spreads processes evenly among nodes and assigns
|
|
MPI_COMM_WORLD ranks in a round-robin, "by node" manner.
|
|
.
|
|
.TP
|
|
.B -cpu-list\fR,\fP --cpu-list \fR<cpus>\fP
|
|
Comma-delimited list of processor IDs to which to bind processes
|
|
[default=NULL]. Processor IDs are interpreted as hwloc logical core
|
|
IDs. Run the hwloc \fIlstopo(1)\fR command to see a list of available
|
|
cores and their logical IDs.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To order processes' ranks in MPI_COMM_WORLD:
|
|
.
|
|
.
|
|
.TP
|
|
.B --rank-by \fR<foo>\fP
|
|
Rank in round-robin fashion according to the specified object,
|
|
defaults to \fIslot\fP. Supported options
|
|
include slot, hwthread, core, L1cache, L2cache, L3cache,
|
|
socket, numa, board, and node.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For process binding:
|
|
.
|
|
.TP
|
|
.B --bind-to \fR<foo>\fP
|
|
Bind processes to the specified object, defaults to \fIcore\fP. Supported options
|
|
include slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, cpu-list, and none.
|
|
.
|
|
.TP
|
|
.B -cpus-per-proc\fR,\fP --cpus-per-proc \fR<#perproc>\fP
|
|
Bind each process to the specified number of cpus.
|
|
(deprecated in favor of --map-by <obj>:PE=n)
|
|
.
|
|
.TP
|
|
.B -cpus-per-rank\fR,\fP --cpus-per-rank \fR<#perrank>\fP
|
|
Alias for \fI-cpus-per-proc\fP.
|
|
(deprecated in favor of --map-by <obj>:PE=n)
|
|
.
|
|
.TP
|
|
.B -bind-to-core\fR,\fP --bind-to-core
|
|
Bind processes to cores (deprecated in favor of --bind-to core)
|
|
.
|
|
.TP
|
|
.B -bind-to-socket\fR,\fP --bind-to-socket
|
|
Bind processes to processor sockets (deprecated in favor of --bind-to socket)
|
|
.
|
|
.TP
|
|
.B -report-bindings\fR,\fP --report-bindings
|
|
Report any bindings for launched processes.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For rankfiles:
|
|
.
|
|
.
|
|
.TP
|
|
.B -rf\fR,\fP --rankfile \fR<rankfile>\fP
|
|
Provide a rankfile file.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To manage standard I/O:
|
|
.
|
|
.
|
|
.TP
|
|
.B -output-filename\fR,\fP --output-filename \fR<filename>\fP
|
|
Redirect the stdout, stderr, and stddiag of all processes to a process-unique version of
|
|
the specified filename. Any directories in the filename will automatically be created.
|
|
Each output file will consist of filename.id, where the id will be the
|
|
processes' rank in MPI_COMM_WORLD, left-filled with
|
|
zero's for correct ordering in listings. A relative path value will be converted to an
|
|
absolute path based on the cwd where mpirun is executed. Note that this \fIwill not\fP work
|
|
on environments where the file system on compute nodes differs from that where mpirun
|
|
is executed.
|
|
.
|
|
.
|
|
.TP
|
|
.B -stdin\fR,\fP --stdin\fR <rank> \fP
|
|
The MPI_COMM_WORLD rank of the process that is to receive stdin. The
|
|
default is to forward stdin to MPI_COMM_WORLD rank 0, but this option
|
|
can be used to forward stdin to any process. It is also acceptable to
|
|
specify \fInone\fP, indicating that no processes are to receive stdin.
|
|
.
|
|
.
|
|
.TP
|
|
.B -merge-stderr-to-stdout\fR,\fP --merge-stderr-to-stdout
|
|
Merge stderr to stdout for each process.
|
|
.
|
|
.
|
|
.TP
|
|
.B -tag-output\fR,\fP --tag-output
|
|
Tag each line of output to stdout, stderr, and stddiag with \fB[jobid, MCW_rank]<stdxxx>\fP
|
|
indicating the process jobid and MPI_COMM_WORLD rank of the process that generated the output,
|
|
and the channel which generated it.
|
|
.
|
|
.
|
|
.TP
|
|
.B -timestamp-output\fR,\fP --timestamp-output
|
|
Timestamp each line of output to stdout, stderr, and stddiag.
|
|
.
|
|
.
|
|
.TP
|
|
.B -xml\fR,\fP --xml
|
|
Provide all output to stdout, stderr, and stddiag in an xml format.
|
|
.
|
|
.
|
|
.TP
|
|
.B -xml-file\fR,\fP --xml-file \fR<filename>\fP
|
|
Provide all output in XML format to the specified file.
|
|
.
|
|
.
|
|
.TP
|
|
.B -xterm\fR,\fP --xterm \fR<ranks>\fP
|
|
Display the output from the processes identified by their
|
|
MPI_COMM_WORLD ranks in separate xterm windows. The ranks are specified
|
|
as a comma-separated list of ranges, with a -1 indicating all. A separate
|
|
window will be created for each specified process.
|
|
.B Note:
|
|
xterm will normally terminate the window upon termination of the process running
|
|
within it. However, by adding a "!" to the end of the list of specified ranks,
|
|
the proper options will be provided to ensure that xterm keeps the window open
|
|
\fIafter\fP the process terminates, thus allowing you to see the process' output.
|
|
Each xterm window will subsequently need to be manually closed.
|
|
.B Note:
|
|
In some environments, xterm may require that the executable be in the user's
|
|
path, or be specified in absolute or relative terms. Thus, it may be necessary
|
|
to specify a local executable as "./foo" instead of just "foo". If xterm fails to
|
|
find the executable, mpirun will hang, but still respond correctly to a ctrl-c.
|
|
If this happens, please check that the executable is being specified correctly
|
|
and try again.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To manage files and runtime environment:
|
|
.
|
|
.
|
|
.TP
|
|
.B -path\fR,\fP --path \fR<path>\fP
|
|
<path> that will be used when attempting to locate the requested
|
|
executables. This is used prior to using the local PATH setting.
|
|
.
|
|
.
|
|
.TP
|
|
.B --prefix \fR<dir>\fP
|
|
Prefix directory that will be used to set the \fIPATH\fR and
|
|
\fILD_LIBRARY_PATH\fR on the remote node before invoking Open MPI or
|
|
the target process. See the "Remote Execution" section, below.
|
|
.
|
|
.
|
|
.TP
|
|
.B --noprefix
|
|
Disable the automatic --prefix behavior
|
|
.
|
|
.
|
|
.TP
|
|
.B -s\fR,\fP --preload-binary
|
|
Copy the specified executable(s) to remote machines prior to starting remote processes. The
|
|
executables will be copied to the Open MPI session directory and will be deleted upon
|
|
completion of the job.
|
|
.
|
|
.
|
|
.TP
|
|
.B --preload-files \fR<files>\fP
|
|
Preload the comma separated list of files to the current working directory of the remote
|
|
machines where processes will be launched prior to starting those processes.
|
|
.
|
|
.
|
|
.TP
|
|
.B -set-cwd-to-session-dir\fR,\fP --set-cwd-to-session-dir
|
|
Set the working directory of the started processes to their session directory.
|
|
.
|
|
.
|
|
.TP
|
|
.B -wd \fR<dir>\fP
|
|
Synonym for \fI-wdir\fP.
|
|
.
|
|
.
|
|
.TP
|
|
.B -wdir \fR<dir>\fP
|
|
Change to the directory <dir> before the user's program executes.
|
|
See the "Current Working Directory" section for notes on relative paths.
|
|
.B Note:
|
|
If the \fI-wdir\fP option appears both on the command line and in an
|
|
application context, the context will take precedence over the command
|
|
line. Thus, if the path to the desired wdir is different
|
|
on the backend nodes, then it must be specified as an absolute path that
|
|
is correct for the backend node.
|
|
.
|
|
.
|
|
.TP
|
|
.B -x \fR<env>\fP
|
|
Export the specified environment variables to the remote nodes before
|
|
executing the program. Only one environment variable can be specified
|
|
per \fI-x\fP option. Existing environment variables can be specified
|
|
or new variable names specified with corresponding values. For
|
|
example:
|
|
\fB%\fP mpirun -x DISPLAY -x OFILE=/tmp/out ...
|
|
|
|
The parser for the \fI-x\fP option is not very sophisticated; it does
|
|
not even understand quoted values. Users are advised to set variables
|
|
in the environment, and then use \fI-x\fP to export (not define) them.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
Setting MCA parameters:
|
|
.
|
|
.
|
|
.TP
|
|
.B -gmca\fR,\fP --gmca \fR<key> <value>\fP
|
|
Pass global MCA parameters that are applicable to all contexts. \fI<key>\fP is
|
|
the parameter name; \fI<value>\fP is the parameter value.
|
|
.
|
|
.
|
|
.TP
|
|
.B -mca\fR,\fP --mca \fR<key> <value>\fP
|
|
Send arguments to various MCA modules. See the "MCA" section, below.
|
|
.
|
|
.
|
|
.TP
|
|
.B -am \fR<arg0>\fP
|
|
Aggregate MCA parameter set file list.
|
|
.
|
|
.
|
|
.TP
|
|
.B -tune\fR,\fP --tune \fR<tune_file>\fP
|
|
Specify a tune file to set arguments for various MCA modules and environment variables.
|
|
See the "Setting MCA parameters and environment variables from file" section, below.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For debugging:
|
|
.
|
|
.
|
|
.TP
|
|
.B -debug\fR,\fP --debug
|
|
Invoke the user-level debugger indicated by the \fIorte_base_user_debugger\fP
|
|
MCA parameter.
|
|
.
|
|
.
|
|
.TP
|
|
.B --get-stack-traces
|
|
When paired with the
|
|
.B --timeout
|
|
option,
|
|
.I mpirun
|
|
will obtain and print out stack traces from all launched processes
|
|
that are still alive when the timeout expires. Note that obtaining
|
|
stack traces can take a little time and produce a lot of output,
|
|
especially for large process-count jobs.
|
|
.
|
|
.
|
|
.TP
|
|
.B -debugger\fR,\fP --debugger \fR<args>\fP
|
|
Sequence of debuggers to search for when \fI--debug\fP is used (i.e.
|
|
a synonym for \fIorte_base_user_debugger\fP MCA parameter).
|
|
.
|
|
.
|
|
.TP
|
|
.B --timeout \fR<seconds>
|
|
The maximum number of seconds that
|
|
.I mpirun
|
|
(also known as
|
|
.I mpiexec\fR,\fI oshrun\fR,\fI orterun\fR,\fI
|
|
etc.)
|
|
will run. After this many seconds,
|
|
.I mpirun
|
|
will abort the launched job and exit with a non-zero exit status.
|
|
Using
|
|
.B --timeout
|
|
can be also useful when combined with the
|
|
.B --get-stack-traces
|
|
option.
|
|
.
|
|
.
|
|
.TP
|
|
.B -tv\fR,\fP --tv
|
|
Launch processes under the TotalView debugger.
|
|
Deprecated backwards compatibility flag. Synonym for \fI--debug\fP.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
There are also other options:
|
|
.
|
|
.
|
|
.TP
|
|
.B --allow-run-as-root
|
|
Allow
|
|
.I mpirun
|
|
to run when executed by the root user
|
|
.RI ( mpirun
|
|
defaults to aborting when launched as the root user). Be sure to see
|
|
the
|
|
.I Running as root
|
|
section, below, for more detail.
|
|
.
|
|
.
|
|
.TP
|
|
.B --app \fR<appfile>\fP
|
|
Provide an appfile, ignoring all other command line options.
|
|
.
|
|
.
|
|
.TP
|
|
.B -cf\fR,\fP --cartofile \fR<cartofile>\fP
|
|
Provide a cartography file.
|
|
.
|
|
.
|
|
.TP
|
|
.B -continuous\fR,\fP --continuous
|
|
Job is to run until explicitly terminated.
|
|
.
|
|
.
|
|
.TP
|
|
.B -disable-recovery\fR,\fP --disable-recovery
|
|
Disable recovery (resets all recovery options to off).
|
|
.
|
|
.
|
|
.TP
|
|
.B -do-not-launch\fR,\fP --do-not-launch
|
|
Perform all necessary operations to prepare to launch the application, but do not actually launch it.
|
|
.
|
|
.
|
|
.TP
|
|
.B -do-not-resolve\fR,\fP --do-not-resolve
|
|
Do not attempt to resolve interfaces.
|
|
.
|
|
.
|
|
.TP
|
|
.B -enable-recovery\fR,\fP --enable-recovery
|
|
Enable recovery from process failure [Default = disabled].
|
|
.
|
|
.
|
|
.TP
|
|
.B -index-argv-by-rank\fR,\fP --index-argv-by-rank
|
|
Uniquely index argv[0] for each process using its rank.
|
|
.
|
|
.
|
|
.TP
|
|
.B -leave-session-attached\fR,\fP --leave-session-attached
|
|
Do not detach OmpiRTE daemons used by this application. This allows error messages from the daemons
|
|
as well as the underlying environment (e.g., when failing to launch a daemon) to be output.
|
|
.
|
|
.
|
|
.TP
|
|
.B -max-restarts\fR,\fP --max-restarts \fR<num>\fP
|
|
Max number of times to restart a failed process.
|
|
.
|
|
.
|
|
.TP
|
|
.B -ompi-server\fR,\fP --ompi-server \fR<uri or file>\fP
|
|
Specify the URI of the Open MPI server (or the mpirun to be used as the server),
|
|
the name of the file (specified as file:filename) that contains that info, or
|
|
the PID (specified as pid:#) of the mpirun to be used as the server.
|
|
The Open MPI server is used to support multi-application data exchange via
|
|
the MPI-2 MPI_Publish_name and MPI_Lookup_name functions.
|
|
.
|
|
.
|
|
.TP
|
|
.B -personality\fR,\fP --personality \fR<list>\fP
|
|
Comma-separated list of programming model, languages, and containers being used (default="ompi").
|
|
.
|
|
.
|
|
.TP
|
|
.B --ppr \fR<list>\fP
|
|
Comma-separated list of number of processes on a given resource type [default: none].
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-child-jobs-separately\fR,\fP --report-child-jobs-separately
|
|
Return the exit status of the primary job only.
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-events\fR,\fP --report-events \fR<URI>\fP
|
|
Report events to a tool listening at the specified URI.
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-pid\fR,\fP --report-pid \fR<channel>\fP
|
|
Print out mpirun's PID during startup. The channel must be either a '-' to indicate
|
|
that the pid is to be output to stdout, a '+' to indicate that the pid is to be
|
|
output to stderr, or a filename to which the pid is to be written.
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-uri\fR,\fP --report-uri \fR<channel>\fP
|
|
Print out mpirun's URI during startup. The channel must be either a '-' to indicate
|
|
that the URI is to be output to stdout, a '+' to indicate that the URI is to be
|
|
output to stderr, or a filename to which the URI is to be written.
|
|
.
|
|
.
|
|
.TP
|
|
.B -show-progress\fR,\fP --show-progress
|
|
Output a brief periodic report on launch progress.
|
|
.
|
|
.
|
|
.TP
|
|
.B -terminate\fR,\fP --terminate
|
|
Terminate the DVM.
|
|
.
|
|
.
|
|
.TP
|
|
.B -use-hwthread-cpus\fR,\fP --use-hwthread-cpus
|
|
Use hardware threads as independent cpus.
|
|
.
|
|
.
|
|
.TP
|
|
.B -use-regexp\fR,\fP --use-regexp
|
|
Use regular expressions for launch.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
The following options are useful for developers; they are not generally
|
|
useful to most ORTE and/or MPI users:
|
|
.
|
|
.TP
|
|
.B -d\fR,\fP --debug-devel
|
|
Enable debugging of the OmpiRTE (the run-time layer in Open MPI).
|
|
This is not generally useful for most users.
|
|
.
|
|
.
|
|
.TP
|
|
.B --debug-daemons
|
|
Enable debugging of any OmpiRTE daemons used by this application.
|
|
.
|
|
.
|
|
.TP
|
|
.B --debug-daemons-file
|
|
Enable debugging of any OmpiRTE daemons used by this application, storing
|
|
output in files.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-devel-allocation\fR,\fP --display-devel-allocation
|
|
Display a detailed list of the allocation being used by this job.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-devel-map\fR,\fP --display-devel-map
|
|
Display a more detailed table showing the mapped location of each process prior to launch.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-diffable-map\fR,\fP --display-diffable-map
|
|
Display a diffable process map just before launch.
|
|
.
|
|
.
|
|
.TP
|
|
.B -display-topo\fR,\fP --display-topo
|
|
Display the topology as part of the process map just before launch.
|
|
.
|
|
.
|
|
.TP
|
|
.B -launch-agent\fR,\fP --launch-agent
|
|
Name of the executable that is to be used to start processes on the remote nodes. The default
|
|
is "orted". This option can be used to test new daemon concepts, or to pass options back to the
|
|
daemons without having mpirun itself see them. For example, specifying a launch agent of
|
|
\fRorted -mca odls_base_verbose 5\fR allows the developer to ask the orted for debugging output
|
|
without clutter from mpirun itself.
|
|
.
|
|
.
|
|
.TP
|
|
.B --report-state-on-timeout
|
|
When paired with the
|
|
.B --timeout
|
|
command line option, report the run-time subsystem state of each
|
|
process when the timeout expires.
|
|
.
|
|
.
|
|
.P
|
|
There may be other options listed with \fImpirun --help\fP.
|
|
.
|
|
.
|
|
.SS Environment Variables
|
|
.
|
|
.TP
|
|
.B MPIEXEC_TIMEOUT
|
|
Synonym for the
|
|
.B --timeout
|
|
command line option.
|
|
.
|
|
.
|
|
.\" **************************
|
|
.\" Description Section
|
|
.\" **************************
|
|
.SH DESCRIPTION
|
|
.
|
|
One invocation of \fImpirun\fP starts an MPI application running under Open
|
|
MPI. If the application is single process multiple data (SPMD), the application
|
|
can be specified on the \fImpirun\fP command line.
|
|
|
|
If the application is multiple instruction multiple data (MIMD), comprising of
|
|
multiple programs, the set of programs and argument can be specified in one of
|
|
two ways: Extended Command Line Arguments, and Application Context.
|
|
.PP
|
|
An application context describes the MIMD program set including all arguments
|
|
in a separate file.
|
|
.\" See appcontext(5) for a description of the application context syntax.
|
|
This file essentially contains multiple \fImpirun\fP command lines, less the
|
|
command name itself. The ability to specify different options for different
|
|
instantiations of a program is another reason to use an application context.
|
|
.PP
|
|
Extended command line arguments allow for the description of the application
|
|
layout on the command line using colons (\fI:\fP) to separate the specification
|
|
of programs and arguments. Some options are globally set across all specified
|
|
programs (e.g. --hostfile), while others are specific to a single program
|
|
(e.g. -np).
|
|
.
|
|
.
|
|
.
|
|
.SS Specifying Host Nodes
|
|
.
|
|
Host nodes can be identified on the \fImpirun\fP command line with the \fI-host\fP
|
|
option or in a hostfile.
|
|
.
|
|
.PP
|
|
For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,aa,bb ./a.out
|
|
launches two processes on node aa and one on bb.
|
|
.
|
|
.PP
|
|
Or, consider the hostfile
|
|
.
|
|
|
|
\fB%\fP cat myhostfile
|
|
aa slots=2
|
|
bb slots=2
|
|
cc slots=2
|
|
|
|
.
|
|
.PP
|
|
Here, we list both the host names (aa, bb, and cc) but also how many "slots"
|
|
there are for each. Slots indicate how many processes can potentially execute
|
|
on a node. For best performance, the number of slots may be chosen to be the
|
|
number of cores on the node or the number of processor sockets. If the hostfile
|
|
does not provide slots information, Open MPI will attempt to discover the number
|
|
of cores (or hwthreads, if the use-hwthreads-as-cpus option is set) and set the
|
|
number of slots to that value. This default behavior also occurs when specifying
|
|
the \fI-host\fP option with a single hostname. Thus, the command
|
|
.
|
|
.TP 4
|
|
mpirun -H aa ./a.out
|
|
launches a number of processes equal to the number of cores on node aa.
|
|
.
|
|
.PP
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile ./a.out
|
|
will launch two processes on each of the three nodes.
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -host aa ./a.out
|
|
will launch two processes, both on node aa.
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -host dd ./a.out
|
|
will find no hosts to run on and abort with an error.
|
|
That is, the specified host dd is not in the specified hostfile.
|
|
.
|
|
.PP
|
|
When running under resource managers (e.g., SLURM, Torque, etc.),
|
|
Open MPI will obtain both the hostnames and the number of slots directly
|
|
from the resource manger.
|
|
.
|
|
.SS Specifying Number of Processes
|
|
.
|
|
As we have just seen, the number of processes to run can be set using the
|
|
hostfile. Other mechanisms exist.
|
|
.
|
|
.PP
|
|
The number of processes launched can be specified as a multiple of the
|
|
number of nodes or processor sockets available. For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npersocket 2 ./a.out
|
|
launches processes 0-3 on node aa and process 4-7 on node bb,
|
|
where aa and bb are both dual-socket nodes.
|
|
The \fI-npersocket\fP option also turns on the \fI-bind-to-socket\fP option,
|
|
which is discussed in a later section.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npernode 2 ./a.out
|
|
launches processes 0-1 on node aa and processes 2-3 on node bb.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npernode 1 ./a.out
|
|
launches one process per host node.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -pernode ./a.out
|
|
is the same as \fI-npernode\fP 1.
|
|
.
|
|
.
|
|
.PP
|
|
Another alternative is to specify the number of processes with the
|
|
\fI-np\fP option. Consider now the hostfile
|
|
.
|
|
|
|
\fB%\fP cat myhostfile
|
|
aa slots=4
|
|
bb slots=4
|
|
cc slots=4
|
|
|
|
.
|
|
.PP
|
|
Now,
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 6 ./a.out
|
|
will launch processes 0-3 on node aa and processes 4-5 on node bb. The remaining
|
|
slots in the hostfile will not be used since the \fI-np\fP option indicated
|
|
that only 6 processes should be launched.
|
|
.
|
|
.SS Mapping Processes to Nodes: Using Policies
|
|
.
|
|
The examples above illustrate the default mapping of process processes
|
|
to nodes. This mapping can also be controlled with various
|
|
\fImpirun\fP options that describe mapping policies.
|
|
.
|
|
.
|
|
.PP
|
|
Consider the same hostfile as above, again with \fI-np\fP 6:
|
|
.
|
|
|
|
node aa node bb node cc
|
|
|
|
mpirun 0 1 2 3 4 5
|
|
|
|
mpirun --map-by node 0 3 1 4 2 5
|
|
|
|
mpirun -nolocal 0 1 2 3 4 5
|
|
.
|
|
.PP
|
|
The \fI--map-by node\fP option will load balance the processes across
|
|
the available nodes, numbering each process in a round-robin fashion.
|
|
.
|
|
.PP
|
|
The \fI-nolocal\fP option prevents any processes from being mapped onto the
|
|
local host (in this case node aa). While \fImpirun\fP typically consumes
|
|
few system resources, \fI-nolocal\fP can be helpful for launching very
|
|
large jobs where \fImpirun\fP may actually need to use noticeable amounts
|
|
of memory and/or processing time.
|
|
.
|
|
.PP
|
|
Just as \fI-np\fP can specify fewer processes than there are slots, it can
|
|
also oversubscribe the slots. For example, with the same hostfile:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 ./a.out
|
|
will launch processes 0-3 on node aa, 4-7 on bb, and 8-11 on cc. It will
|
|
then add the remaining two processes to whichever nodes it chooses.
|
|
.
|
|
.PP
|
|
One can also specify limits to oversubscription. For example, with the same
|
|
hostfile:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 -nooversubscribe ./a.out
|
|
will produce an error since \fI-nooversubscribe\fP prevents oversubscription.
|
|
.
|
|
.PP
|
|
Limits to oversubscription can also be specified in the hostfile itself:
|
|
.
|
|
% cat myhostfile
|
|
aa slots=4 max_slots=4
|
|
bb max_slots=4
|
|
cc slots=4
|
|
.
|
|
.PP
|
|
The \fImax_slots\fP field specifies such a limit. When it does, the
|
|
\fIslots\fP value defaults to the limit. Now:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 ./a.out
|
|
causes the first 12 processes to be launched as before, but the remaining
|
|
two processes will be forced onto node cc. The other two nodes are
|
|
protected by the hostfile against oversubscription by this job.
|
|
.
|
|
.PP
|
|
Using the \fI--nooversubscribe\fR option can be helpful since Open MPI
|
|
currently does not get "max_slots" values from the resource manager.
|
|
.
|
|
.PP
|
|
Of course, \fI-np\fP can also be used with the \fI-H\fP or \fI-host\fP
|
|
option. For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -np 8 ./a.out
|
|
launches 8 processes. Since only two hosts are specified, after the first
|
|
two processes are mapped, one to aa and one to bb, the remaining processes
|
|
oversubscribe the specified hosts.
|
|
.
|
|
.PP
|
|
And here is a MIMD example:
|
|
.
|
|
.TP 4
|
|
mpirun -H aa -np 1 hostname : -H bb,cc -np 2 uptime
|
|
will launch process 0 running \fIhostname\fP on node aa and processes 1 and 2
|
|
each running \fIuptime\fP on nodes bb and cc, respectively.
|
|
.
|
|
.SS Mapping, Ranking, and Binding: Oh My!
|
|
.
|
|
Open MPI employs a three-phase procedure for assigning process locations and
|
|
ranks:
|
|
.
|
|
.TP 10
|
|
\fBmapping\fP
|
|
Assigns a default location to each process
|
|
.
|
|
.TP 10
|
|
\fBranking\fP
|
|
Assigns an MPI_COMM_WORLD rank value to each process
|
|
.
|
|
.TP 10
|
|
\fBbinding\fP
|
|
Constrains each process to run on specific processors
|
|
.
|
|
.PP
|
|
The \fImapping\fP step is used to assign a default location to each process
|
|
based on the mapper being employed. Mapping by slot, node, and sequentially results
|
|
in the assignment of the processes to the node level. In contrast, mapping by object, allows
|
|
the mapper to assign the process to an actual object on each node.
|
|
.
|
|
.PP
|
|
\fBNote:\fP the location assigned to the process is independent of where it will be bound - the
|
|
assignment is used solely as input to the binding algorithm.
|
|
.
|
|
.PP
|
|
The mapping of process processes to nodes can be defined not just
|
|
with general policies but also, if necessary, using arbitrary mappings
|
|
that cannot be described by a simple policy. One can use the "sequential
|
|
mapper," which reads the hostfile line by line, assigning processes
|
|
to nodes in whatever order the hostfile specifies. Use the
|
|
\fI-mca rmaps seq\fP option. For example, using the same hostfile
|
|
as before:
|
|
.
|
|
.PP
|
|
mpirun -hostfile myhostfile -mca rmaps seq ./a.out
|
|
.
|
|
.PP
|
|
will launch three processes, one on each of nodes aa, bb, and cc, respectively.
|
|
The slot counts don't matter; one process is launched per line on
|
|
whatever node is listed on the line.
|
|
.
|
|
.PP
|
|
Another way to specify arbitrary mappings is with a rankfile, which
|
|
gives you detailed control over process binding as well. Rankfiles
|
|
are discussed below.
|
|
.
|
|
.PP
|
|
The second phase focuses on the \fIranking\fP of the process within
|
|
the job's MPI_COMM_WORLD. Open MPI
|
|
separates this from the mapping procedure to allow more flexibility in the
|
|
relative placement of MPI processes. This is best illustrated by considering the
|
|
following two cases where we used the —map-by ppr:2:socket option:
|
|
.
|
|
.PP
|
|
node aa node bb
|
|
|
|
rank-by core 0 1 ! 2 3 4 5 ! 6 7
|
|
|
|
rank-by socket 0 2 ! 1 3 4 6 ! 5 7
|
|
|
|
rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7
|
|
.
|
|
.PP
|
|
Ranking by core and by slot provide the identical result - a simple
|
|
progression of MPI_COMM_WORLD ranks across each node. Ranking by
|
|
socket does a round-robin ranking within each node until all processes
|
|
have been assigned an MCW rank, and then progresses to the next
|
|
node. Adding the \fIspan\fP modifier to the ranking directive causes
|
|
the ranking algorithm to treat the entire allocation as a single
|
|
entity - thus, the MCW ranks are assigned across all sockets before
|
|
circling back around to the beginning.
|
|
.
|
|
.PP
|
|
The \fIbinding\fP phase actually binds each process to a given set of processors. This can
|
|
improve performance if the operating system is placing processes
|
|
suboptimally. For example, it might oversubscribe some multi-core
|
|
processor sockets, leaving other sockets idle; this can lead
|
|
processes to contend unnecessarily for common resources. Or, it
|
|
might spread processes out too widely; this can be suboptimal if
|
|
application performance is sensitive to interprocess communication
|
|
costs. Binding can also keep the operating system from migrating
|
|
processes excessively, regardless of how optimally those processes
|
|
were placed to begin with.
|
|
.
|
|
.PP
|
|
The processors to be used for binding can be identified in terms of
|
|
topological groupings - e.g., binding to an l3cache will bind each
|
|
process to all processors within the scope of a single L3 cache within
|
|
their assigned location. Thus, if a process is assigned by the mapper
|
|
to a certain socket, then a \fI—bind-to l3cache\fP directive will
|
|
cause the process to be bound to the processors that share a single L3
|
|
cache within that socket.
|
|
.
|
|
.PP
|
|
Alternatively, processes can be assigned to processors based on their
|
|
local rank on a node using the \fI--bind-to cpu-list:ordered\fP option
|
|
with an associated \fI--cpu-list "0,2,5"\fP. In this example, the
|
|
first process on a node will be bound to cpu 0, the second process on
|
|
the node will be bound to cpu 2, and the third process on the node
|
|
will be bound to cpu 5. \fI--bind-to\fP will also accept
|
|
\fIcpulist:ortered\fP as a synonym to \fIcpu-list:ordered\fP. Note
|
|
that an error will result if more processes are assigned to a node
|
|
than cpus are provided.
|
|
.
|
|
.PP
|
|
To help balance loads, the binding directive uses a round-robin method when binding to
|
|
levels lower than used in the mapper. For example, consider the case where a job is
|
|
mapped to the socket level, and then bound to core. Each socket will have multiple cores,
|
|
so if multiple processes are mapped to a given socket, the binding algorithm will assign
|
|
each process located to a socket to a unique core in a round-robin manner.
|
|
.
|
|
.PP
|
|
Alternatively, processes mapped by l2cache and then bound to socket will simply be bound
|
|
to all the processors in the socket where they are located. In this manner, users can
|
|
exert detailed control over relative MCW rank location and binding.
|
|
.
|
|
.PP
|
|
Finally, \fI--report-bindings\fP can be used to report bindings.
|
|
.
|
|
.PP
|
|
As an example, consider a node with two processor sockets, each comprising
|
|
four cores. We run \fImpirun\fP with \fI-np 4 --report-bindings\fP and
|
|
the following additional options:
|
|
.
|
|
|
|
% mpirun ... --map-by core --bind-to core
|
|
[...] ... binding child [...,0] to cpus 0001
|
|
[...] ... binding child [...,1] to cpus 0002
|
|
[...] ... binding child [...,2] to cpus 0004
|
|
[...] ... binding child [...,3] to cpus 0008
|
|
|
|
% mpirun ... --map-by socket --bind-to socket
|
|
[...] ... binding child [...,0] to socket 0 cpus 000f
|
|
[...] ... binding child [...,1] to socket 1 cpus 00f0
|
|
[...] ... binding child [...,2] to socket 0 cpus 000f
|
|
[...] ... binding child [...,3] to socket 1 cpus 00f0
|
|
|
|
% mpirun ... --map-by core:PE=2 --bind-to core
|
|
[...] ... binding child [...,0] to cpus 0003
|
|
[...] ... binding child [...,1] to cpus 000c
|
|
[...] ... binding child [...,2] to cpus 0030
|
|
[...] ... binding child [...,3] to cpus 00c0
|
|
|
|
% mpirun ... --bind-to none
|
|
.
|
|
.PP
|
|
Here, \fI--report-bindings\fP shows the binding of each process as a mask.
|
|
In the first case, the processes bind to successive cores as indicated by
|
|
the masks 0001, 0002, 0004, and 0008. In the second case, processes bind
|
|
to all cores on successive sockets as indicated by the masks 000f and 00f0.
|
|
The processes cycle through the processor sockets in a round-robin fashion
|
|
as many times as are needed. In the third case, the masks show us that
|
|
2 cores have been bound per process. In the fourth case, binding is
|
|
turned off and no bindings are reported.
|
|
.
|
|
.PP
|
|
Open MPI's support for process binding depends on the underlying
|
|
operating system. Therefore, certain process binding options may not be available
|
|
on every system.
|
|
.
|
|
.PP
|
|
Process binding can also be set with MCA parameters.
|
|
Their usage is less convenient than that of \fImpirun\fP options.
|
|
On the other hand, MCA parameters can be set not only on the \fImpirun\fP
|
|
command line, but alternatively in a system or user mca-params.conf file
|
|
or as environment variables, as described in the MCA section below.
|
|
Some examples include:
|
|
.
|
|
.PP
|
|
mpirun option MCA parameter key value
|
|
|
|
--map-by core rmaps_base_mapping_policy core
|
|
--map-by socket rmaps_base_mapping_policy socket
|
|
--rank-by core rmaps_base_ranking_policy core
|
|
--bind-to core hwloc_base_binding_policy core
|
|
--bind-to socket hwloc_base_binding_policy socket
|
|
--bind-to none hwloc_base_binding_policy none
|
|
.
|
|
.
|
|
.SS Rankfiles
|
|
.
|
|
Rankfiles are text files that specify detailed information about how
|
|
individual processes should be mapped to nodes, and to which
|
|
processor(s) they should be bound. Each line of a rankfile specifies
|
|
the location of one process (for MPI jobs, the process' "rank" refers
|
|
to its rank in MPI_COMM_WORLD). The general form of each line in the
|
|
rankfile is:
|
|
.
|
|
|
|
rank <N>=<hostname> slot=<slot list>
|
|
.
|
|
.PP
|
|
For example:
|
|
.
|
|
|
|
$ cat myrankfile
|
|
rank 0=aa slot=1:0-2
|
|
rank 1=bb slot=0:0,1
|
|
rank 2=cc slot=1-2
|
|
$ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out
|
|
.
|
|
.PP
|
|
Means that
|
|
.
|
|
|
|
Rank 0 runs on node aa, bound to logical socket 1, cores 0-2.
|
|
Rank 1 runs on node bb, bound to logical socket 0, cores 0 and 1.
|
|
Rank 2 runs on node cc, bound to logical cores 1 and 2.
|
|
.
|
|
.PP
|
|
Rankfiles can alternatively be used to specify \fIphysical\fP processor
|
|
locations. In this case, the syntax is somewhat different. Sockets are
|
|
no longer recognized, and the slot number given must be the number of
|
|
the physical PU as most OS's do not assign a unique physical identifier
|
|
to each core in the node. Thus, a proper physical rankfile looks something
|
|
like the following:
|
|
.
|
|
|
|
$ cat myphysicalrankfile
|
|
rank 0=aa slot=1
|
|
rank 1=bb slot=8
|
|
rank 2=cc slot=6
|
|
.
|
|
.PP
|
|
This means that
|
|
.
|
|
|
|
Rank 0 will run on node aa, bound to the core that contains physical PU 1
|
|
Rank 1 will run on node bb, bound to the core that contains physical PU 8
|
|
Rank 2 will run on node cc, bound to the core that contains physical PU 6
|
|
.
|
|
.PP
|
|
Rankfiles are treated as \fIlogical\fP by default, and the MCA parameter
|
|
rmaps_rank_file_physical must be set to 1 to indicate that the rankfile
|
|
is to be considered as \fIphysical\fP.
|
|
.
|
|
.PP
|
|
The hostnames listed above are "absolute," meaning that actual
|
|
resolveable hostnames are specified. However, hostnames can also be
|
|
specified as "relative," meaning that they are specified in relation
|
|
to an externally-specified list of hostnames (e.g., by mpirun's --host
|
|
argument, a hostfile, or a job scheduler).
|
|
.
|
|
.PP
|
|
The "relative" specification is of the form "+n<X>", where X is an
|
|
integer specifying the Xth hostname in the set of all available
|
|
hostnames, indexed from 0. For example:
|
|
.
|
|
|
|
$ cat myrankfile
|
|
rank 0=+n0 slot=1:0-2
|
|
rank 1=+n1 slot=0:0,1
|
|
rank 2=+n2 slot=1-2
|
|
$ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out
|
|
.
|
|
.PP
|
|
Starting with Open MPI v1.7, all socket/core slot locations are be
|
|
specified as
|
|
.I logical
|
|
indexes (the Open MPI v1.6 series used
|
|
.I physical
|
|
indexes). You can use tools such as HWLOC's "lstopo" to find the
|
|
logical indexes of socket and cores.
|
|
.
|
|
.
|
|
.SS Application Context or Executable Program?
|
|
.
|
|
To distinguish the two different forms, \fImpirun\fP
|
|
looks on the command line for \fI--app\fP option. If
|
|
it is specified, then the file named on the command line is
|
|
assumed to be an application context. If it is not
|
|
specified, then the file is assumed to be an executable program.
|
|
.
|
|
.
|
|
.
|
|
.SS Locating Files
|
|
.
|
|
If no relative or absolute path is specified for a file, Open
|
|
MPI will first look for files by searching the directories specified
|
|
by the \fI--path\fP option. If there is no \fI--path\fP option set or
|
|
if the file is not found at the \fI--path\fP location, then Open MPI
|
|
will search the user's PATH environment variable as defined on the
|
|
source node(s).
|
|
.PP
|
|
If a relative directory is specified, it must be relative to the initial
|
|
working directory determined by the specific starter used. For example when
|
|
using the rsh or ssh starters, the initial directory is $HOME by default. Other
|
|
starters may set the initial directory to the current working directory from
|
|
the invocation of \fImpirun\fP.
|
|
.
|
|
.
|
|
.
|
|
.SS Current Working Directory
|
|
.
|
|
The \fI\-wdir\fP mpirun option (and its synonym, \fI\-wd\fP) allows
|
|
the user to change to an arbitrary directory before the program is
|
|
invoked. It can also be used in application context files to specify
|
|
working directories on specific nodes and/or for specific
|
|
applications.
|
|
.PP
|
|
If the \fI\-wdir\fP option appears both in a context file and on the
|
|
command line, the context file directory will override the command
|
|
line value.
|
|
.PP
|
|
If the \fI-wdir\fP option is specified, Open MPI will attempt to
|
|
change to the specified directory on all of the remote nodes. If this
|
|
fails, \fImpirun\fP will abort.
|
|
.PP
|
|
If the \fI-wdir\fP option is \fBnot\fP specified, Open MPI will send
|
|
the directory name where \fImpirun\fP was invoked to each of the
|
|
remote nodes. The remote nodes will try to change to that
|
|
directory. If they are unable (e.g., if the directory does not exist on
|
|
that node), then Open MPI will use the default directory determined by
|
|
the starter.
|
|
.PP
|
|
All directory changing occurs before the user's program is invoked; it
|
|
does not wait until \fIMPI_INIT\fP is called.
|
|
.
|
|
.
|
|
.
|
|
.SS Standard I/O
|
|
.
|
|
Open MPI directs UNIX standard input to /dev/null on all processes
|
|
except the MPI_COMM_WORLD rank 0 process. The MPI_COMM_WORLD rank 0 process
|
|
inherits standard input from \fImpirun\fP.
|
|
.B Note:
|
|
The node that invoked \fImpirun\fP need not be the same as the node where the
|
|
MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of
|
|
\fImpirun\fP's standard input to the rank 0 process.
|
|
.PP
|
|
Open MPI directs UNIX standard output and error from remote nodes to the node
|
|
that invoked \fImpirun\fP and prints it on the standard output/error of
|
|
\fImpirun\fP.
|
|
Local processes inherit the standard output/error of \fImpirun\fP and transfer
|
|
to it directly.
|
|
.PP
|
|
Thus it is possible to redirect standard I/O for Open MPI applications by
|
|
using the typical shell redirection procedure on \fImpirun\fP.
|
|
|
|
\fB%\fP mpirun -np 2 my_app < my_input > my_output
|
|
|
|
Note that in this example \fIonly\fP the MPI_COMM_WORLD rank 0 process will
|
|
receive the stream from \fImy_input\fP on stdin. The stdin on all the other
|
|
nodes will be tied to /dev/null. However, the stdout from all nodes will
|
|
be collected into the \fImy_output\fP file.
|
|
.
|
|
.
|
|
.
|
|
.SS Signal Propagation
|
|
.
|
|
When orterun receives a SIGTERM and SIGINT, it will attempt to kill
|
|
the entire job by sending all processes in the job a SIGTERM, waiting
|
|
a small number of seconds, then sending all processes in the job a
|
|
SIGKILL.
|
|
.
|
|
.PP
|
|
SIGUSR1 and SIGUSR2 signals received by orterun are propagated to
|
|
all processes in the job.
|
|
.
|
|
.PP
|
|
A SIGTSTOP signal to mpirun will cause a SIGSTOP signal to be sent
|
|
to all of the programs started by mpirun and likewise a SIGCONT signal
|
|
to mpirun will cause a SIGCONT sent.
|
|
.
|
|
.PP
|
|
Other signals are not currently propagated
|
|
by orterun.
|
|
.
|
|
.
|
|
.SS Process Termination / Signal Handling
|
|
.
|
|
During the run of an MPI application, if any process dies abnormally
|
|
(either exiting before invoking \fIMPI_FINALIZE\fP, or dying as the result of a
|
|
signal), \fImpirun\fP will print out an error message and kill the rest of the
|
|
MPI application.
|
|
.PP
|
|
User signal handlers should probably avoid trying to cleanup MPI state
|
|
(Open MPI is currently not async-signal-safe; see MPI_Init_thread(3)
|
|
for details about
|
|
.I MPI_THREAD_MULTIPLE
|
|
and thread safety). For example, if a segmentation fault occurs in
|
|
\fIMPI_SEND\fP (perhaps because a bad buffer was passed in) and a user
|
|
signal handler is invoked, if this user handler attempts to invoke
|
|
\fIMPI_FINALIZE\fP, Bad Things could happen since Open MPI was already
|
|
"in" MPI when the error occurred. Since \fImpirun\fP will notice that
|
|
the process died due to a signal, it is probably not necessary (and
|
|
safest) for the user to only clean up non-MPI state.
|
|
.
|
|
.
|
|
.
|
|
.SS Process Environment
|
|
.
|
|
Processes in the MPI application inherit their environment from the
|
|
Open RTE daemon upon the node on which they are running. The
|
|
environment is typically inherited from the user's shell. On remote
|
|
nodes, the exact environment is determined by the boot MCA module
|
|
used. The \fIrsh\fR launch module, for example, uses either
|
|
\fIrsh\fR/\fIssh\fR to launch the Open RTE daemon on remote nodes, and
|
|
typically executes one or more of the user's shell-setup files before
|
|
launching the Open RTE daemon. When running dynamically linked
|
|
applications which require the \fILD_LIBRARY_PATH\fR environment
|
|
variable to be set, care must be taken to ensure that it is correctly
|
|
set when booting Open MPI.
|
|
.PP
|
|
See the "Remote Execution" section for more details.
|
|
.
|
|
.
|
|
.SS Remote Execution
|
|
.
|
|
Open MPI requires that the \fIPATH\fR environment variable be set to
|
|
find executables on remote nodes (this is typically only necessary in
|
|
\fIrsh\fR- or \fIssh\fR-based environments -- batch/scheduled
|
|
environments typically copy the current environment to the execution
|
|
of remote jobs, so if the current environment has \fIPATH\fR and/or
|
|
\fILD_LIBRARY_PATH\fR set properly, the remote nodes will also have it
|
|
set properly). If Open MPI was compiled with shared library support,
|
|
it may also be necessary to have the \fILD_LIBRARY_PATH\fR environment
|
|
variable set on remote nodes as well (especially to find the shared
|
|
libraries required to run user MPI applications).
|
|
.PP
|
|
However, it is not always desirable or possible to edit shell
|
|
startup files to set \fIPATH\fR and/or \fILD_LIBRARY_PATH\fR. The
|
|
\fI--prefix\fR option is provided for some simple configurations where
|
|
this is not possible.
|
|
.PP
|
|
The \fI--prefix\fR option takes a single argument: the base directory
|
|
on the remote node where Open MPI is installed. Open MPI will use
|
|
this directory to set the remote \fIPATH\fR and \fILD_LIBRARY_PATH\fR
|
|
before executing any Open MPI or user applications. This allows
|
|
running Open MPI jobs without having pre-configured the \fIPATH\fR and
|
|
\fILD_LIBRARY_PATH\fR on the remote nodes.
|
|
.PP
|
|
Open MPI adds the basename of the current
|
|
node's "bindir" (the directory where Open MPI's executables are
|
|
installed) to the prefix and uses that to set the \fIPATH\fR on the
|
|
remote node. Similarly, Open MPI adds the basename of the current
|
|
node's "libdir" (the directory where Open MPI's libraries are
|
|
installed) to the prefix and uses that to set the
|
|
\fILD_LIBRARY_PATH\fR on the remote node. For example:
|
|
.TP 15
|
|
Local bindir:
|
|
/local/node/directory/bin
|
|
.TP
|
|
Local libdir:
|
|
/local/node/directory/lib64
|
|
.PP
|
|
If the following command line is used:
|
|
|
|
\fB%\fP mpirun --prefix /remote/node/directory
|
|
|
|
Open MPI will add "/remote/node/directory/bin" to the \fIPATH\fR
|
|
and "/remote/node/directory/lib64" to the \fILD_LIBRARY_PATH\fR on the
|
|
remote node before attempting to execute anything.
|
|
.PP
|
|
The \fI--prefix\fR option is not sufficient if the installation paths
|
|
on the remote node are different than the local node (e.g., if "/lib"
|
|
is used on the local node, but "/lib64" is used on the remote node),
|
|
or if the installation paths are something other than a subdirectory
|
|
under a common prefix.
|
|
.PP
|
|
Note that executing \fImpirun\fR via an absolute pathname is
|
|
equivalent to specifying \fI--prefix\fR without the last subdirectory
|
|
in the absolute pathname to \fImpirun\fR. For example:
|
|
|
|
\fB%\fP /usr/local/bin/mpirun ...
|
|
|
|
is equivalent to
|
|
|
|
\fB%\fP mpirun --prefix /usr/local
|
|
.
|
|
.
|
|
.
|
|
.SS Exported Environment Variables
|
|
.
|
|
All environment variables that are named in the form OMPI_* will automatically
|
|
be exported to new processes on the local and remote nodes. Environmental
|
|
parameters can also be set/forwarded to the new processes using the MCA
|
|
parameter \fImca_base_env_list\fP. The \fI\-x\fP option to \fImpirun\fP has
|
|
been deprecated, but the syntax of the MCA param follows that prior
|
|
example. While the syntax of the \fI\-x\fP option and MCA param
|
|
allows the definition of new variables, note that the parser
|
|
for these options are currently not very sophisticated - it does not even
|
|
understand quoted values. Users are advised to set variables in the
|
|
environment and use the option to export them; not to define them.
|
|
.
|
|
.
|
|
.
|
|
.SS Setting MCA Parameters
|
|
.
|
|
The \fI-mca\fP switch allows the passing of parameters to various MCA
|
|
(Modular Component Architecture) modules.
|
|
.\" Open MPI's MCA modules are described in detail in ompimca(7).
|
|
MCA modules have direct impact on MPI programs because they allow tunable
|
|
parameters to be set at run time (such as which BTL communication device driver
|
|
to use, what parameters to pass to that BTL, etc.).
|
|
.PP
|
|
The \fI-mca\fP switch takes two arguments: \fI<key>\fP and \fI<value>\fP.
|
|
The \fI<key>\fP argument generally specifies which MCA module will receive the value.
|
|
For example, the \fI<key>\fP "btl" is used to select which BTL to be used for
|
|
transporting MPI messages. The \fI<value>\fP argument is the value that is
|
|
passed.
|
|
For example:
|
|
.
|
|
.TP 4
|
|
mpirun -mca btl tcp,self -np 1 foo
|
|
Tells Open MPI to use the "tcp" and "self" BTLs, and to run a single copy of
|
|
"foo" an allocated node.
|
|
.
|
|
.TP
|
|
mpirun -mca btl self -np 1 foo
|
|
Tells Open MPI to use the "self" BTL, and to run a single copy of "foo" an
|
|
allocated node.
|
|
.\" And so on. Open MPI's BTL MCA modules are described in ompimca_btl(7).
|
|
.PP
|
|
The \fI-mca\fP switch can be used multiple times to specify different
|
|
\fI<key>\fP and/or \fI<value>\fP arguments. If the same \fI<key>\fP is
|
|
specified more than once, the \fI<value>\fPs are concatenated with a comma
|
|
(",") separating them.
|
|
.PP
|
|
Note that the \fI-mca\fP switch is simply a shortcut for setting environment variables.
|
|
The same effect may be accomplished by setting corresponding environment
|
|
variables before running \fImpirun\fP.
|
|
The form of the environment variables that Open MPI sets is:
|
|
|
|
OMPI_MCA_<key>=<value>
|
|
.PP
|
|
Thus, the \fI-mca\fP switch overrides any previously set environment
|
|
variables. The \fI-mca\fP settings similarly override MCA parameters set
|
|
in the
|
|
$OPAL_PREFIX/etc/openmpi-mca-params.conf or $HOME/.openmpi/mca-params.conf
|
|
file.
|
|
.
|
|
.PP
|
|
Unknown \fI<key>\fP arguments are still set as
|
|
environment variable -- they are not checked (by \fImpirun\fP) for correctness.
|
|
Illegal or incorrect \fI<value>\fP arguments may or may not be reported -- it
|
|
depends on the specific MCA module.
|
|
.PP
|
|
To find the available component types under the MCA architecture, or to find the
|
|
available parameters for a specific component, use the \fIompi_info\fP command.
|
|
See the \fIompi_info(1)\fP man page for detailed information on the command.
|
|
.
|
|
.
|
|
.
|
|
.SS Setting MCA parameters and environment variables from file.
|
|
The \fI-tune\fP command line option and its synonym \fI-mca mca_base_envar_file_prefix\fP allows a user
|
|
to set mca parameters and environment variables with the syntax described below.
|
|
This option requires a single file or list of files separated by "," to follow.
|
|
.PP
|
|
A valid line in the file may contain zero or many "-x", "-mca", or “--mca” arguments.
|
|
The following patterns are supported: -mca var val -mca var "val" -x var=val -x var.
|
|
If any argument is duplicated in the file, the last value read will be used.
|
|
.PP
|
|
MCA parameters and environment specified on the command line have higher precedence than variables specified in the file.
|
|
.
|
|
.
|
|
.
|
|
.SS Running as root
|
|
.
|
|
The Open MPI team strongly advises against executing
|
|
.I mpirun
|
|
as the root user. MPI applications should be run as regular
|
|
(non-root) users.
|
|
.
|
|
.PP
|
|
Reflecting this advice, mpirun will refuse to run as root by default.
|
|
To override this default, you can add the
|
|
.I --allow-run-as-root
|
|
option to the
|
|
.I mpirun
|
|
command line, or you can set the environmental parameters
|
|
.I OMPI_ALLOW_RUN_AS_ROOT=1
|
|
and
|
|
.IR OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 .
|
|
Note that it takes setting
|
|
.I two
|
|
environment variables to effect the same behavior as
|
|
.I --allow-run-as-root
|
|
in order to stress the Open MPI team's strong advice against running
|
|
as the root user. After extended discussions with communities who use
|
|
containers (where running as the root user is the default), there was
|
|
a persistent desire to be able to enable root execution of
|
|
.I mpirun
|
|
via an environmental control (vs. the existing
|
|
.I --allow-run-as-root
|
|
command line parameter). The compromise of using
|
|
.I two
|
|
environment variables was reached: it allows root execution via an
|
|
environmental control, but it conveys the Open MPI team's strong
|
|
recomendation against this behavior.
|
|
.
|
|
.SS Exit status
|
|
.
|
|
There is no standard definition for what \fImpirun\fP should return as an exit
|
|
status. After considerable discussion, we settled on the following method for
|
|
assigning the \fImpirun\fP exit status (note: in the following description,
|
|
the "primary" job is the initial application started by mpirun - all jobs that
|
|
are spawned by that job are designated "secondary" jobs):
|
|
.
|
|
.IP \[bu] 2
|
|
if all processes in the primary job normally terminate with exit status 0, we return 0
|
|
.IP \[bu]
|
|
if one or more processes in the primary job normally terminate with non-zero exit status,
|
|
we return the exit status of the process with the lowest MPI_COMM_WORLD rank to have a non-zero status
|
|
.IP \[bu]
|
|
if all processes in the primary job normally terminate with exit status 0, and one or more
|
|
processes in a secondary job normally terminate with non-zero exit status, we (a) return
|
|
the exit status of the process with the lowest MPI_COMM_WORLD rank in the lowest jobid to have a non-zero
|
|
status, and (b) output a message summarizing the exit status of the primary and all secondary jobs.
|
|
.IP \[bu]
|
|
if the cmd line option --report-child-jobs-separately is set, we will return -only- the
|
|
exit status of the primary job. Any non-zero exit status in secondary jobs will be
|
|
reported solely in a summary print statement.
|
|
.
|
|
.PP
|
|
By default, OMPI records and notes that MPI processes exited with non-zero termination status.
|
|
This is generally not considered an "abnormal termination" - i.e., OMPI will not abort an MPI
|
|
job if one or more processes return a non-zero status. Instead, the default behavior simply
|
|
reports the number of processes terminating with non-zero status upon completion of the job.
|
|
.PP
|
|
However, in some cases it can be desirable to have the job abort when any process terminates
|
|
with non-zero status. For example, a non-MPI job might detect a bad result from a calculation
|
|
and want to abort, but doesn't want to generate a core file. Or an MPI job might continue past
|
|
a call to MPI_Finalize, but indicate that all processes should abort due to some post-MPI result.
|
|
.PP
|
|
It is not anticipated that this situation will occur frequently. However, in the interest of
|
|
serving the broader community, OMPI now has a means for allowing users to direct that jobs be
|
|
aborted upon any process exiting with non-zero status. Setting the MCA parameter
|
|
"orte_abort_on_non_zero_status" to 1 will cause OMPI to abort all processes once any process
|
|
exits with non-zero status.
|
|
.PP
|
|
Terminations caused in this manner will be reported on the console as an "abnormal termination",
|
|
with the first process to so exit identified along with its exit status.
|
|
.PP
|
|
.
|
|
.\" **************************
|
|
.\" Examples Section
|
|
.\" **************************
|
|
.SH EXAMPLES
|
|
Be sure also to see the examples throughout the sections above.
|
|
.
|
|
.TP 4
|
|
mpirun -np 4 -mca btl ib,tcp,self prog1
|
|
Run 4 copies of prog1 using the "ib", "tcp", and "self" BTL's for the
|
|
transport of MPI messages.
|
|
.
|
|
.
|
|
.TP 4
|
|
mpirun -np 4 -mca btl tcp,sm,self
|
|
.br
|
|
--mca btl_tcp_if_include eth0 prog1
|
|
.br
|
|
Run 4 copies of prog1 using the "tcp", "sm" and "self" BTLs for the
|
|
transport of MPI messages, with TCP using only the eth0 interface to
|
|
communicate. Note that other BTLs have similar if_include MCA
|
|
parameters.
|
|
.
|
|
.\" **************************
|
|
.\" Diagnostics Section
|
|
.\" **************************
|
|
.
|
|
.\" .SH DIAGNOSTICS
|
|
.\" .TP 4
|
|
.\" Error Msg:
|
|
.\" Description
|
|
.
|
|
.\" **************************
|
|
.\" Return Value Section
|
|
.\" **************************
|
|
.
|
|
.SH RETURN VALUE
|
|
.
|
|
\fImpirun\fP returns 0 if all processes started by \fImpirun\fP exit after calling
|
|
MPI_FINALIZE. A non-zero value is returned if an internal error occurred in
|
|
mpirun, or one or more processes exited before calling MPI_FINALIZE. If an
|
|
internal error occurred in mpirun, the corresponding error code is returned.
|
|
In the event that one or more processes exit before calling MPI_FINALIZE, the
|
|
return value of the MPI_COMM_WORLD rank of the process that \fImpirun\fP first notices died
|
|
before calling MPI_FINALIZE will be returned. Note that, in general, this will
|
|
be the first process that died but is not guaranteed to be so.
|
|
.
|
|
.PP
|
|
If the
|
|
.B --timeout
|
|
command line option is used and the timeout expires before the job
|
|
completes (thereby forcing
|
|
.I mpirun
|
|
to kill the job)
|
|
.I mpirun
|
|
will return an exit status equivalent to the value of
|
|
.B ETIMEDOUT
|
|
(which is typically 110 on Linux and OS X systems).
|
|
|
|
.
|
|
.\" **************************
|
|
.\" See Also Section
|
|
.\" **************************
|
|
.
|
|
.SH SEE ALSO
|
|
MPI_Init_thread(3)
|