4edeb229cc
cmr=v1.7.4:reviewer=rhc This commit was SVN r30455.
1358 строки
44 KiB
Plaintext
1358 строки
44 KiB
Plaintext
.\" -*- nroff -*-
|
|
.\" Copyright (c) 2009-2010 Cisco Systems, Inc. All rights reserved.
|
|
.\" Copyright (c) 2008-2009 Sun Microsystems, Inc. All rights reserved.
|
|
.\"
|
|
.\" Man page for ORTE's orterun command
|
|
.\"
|
|
.\" .TH name section center-footer left-footer center-header
|
|
.TH MPIRUN 1 "#OMPI_DATE#" "#PACKAGE_VERSION#" "#PACKAGE_NAME#"
|
|
.\" **************************
|
|
.\" Name Section
|
|
.\" **************************
|
|
.SH NAME
|
|
.
|
|
orterun, mpirun, mpiexec \- Execute serial and parallel jobs in Open MPI.
|
|
|
|
.B Note:
|
|
\fImpirun\fP, \fImpiexec\fP, and \fIorterun\fP are all synonyms for each
|
|
other. Using any of the names will produce the same behavior.
|
|
.
|
|
.\" **************************
|
|
.\" Synopsis Section
|
|
.\" **************************
|
|
.SH SYNOPSIS
|
|
.
|
|
.PP
|
|
Single Process Multiple Data (SPMD) Model:
|
|
|
|
.B mpirun
|
|
[ options ]
|
|
.B <program>
|
|
[ <args> ]
|
|
.P
|
|
|
|
Multiple Instruction Multiple Data (MIMD) Model:
|
|
|
|
.B mpirun
|
|
[ global_options ]
|
|
[ local_options1 ]
|
|
.B <program1>
|
|
[ <args1> ] :
|
|
[ local_options2 ]
|
|
.B <program2>
|
|
[ <args2> ] :
|
|
... :
|
|
[ local_optionsN ]
|
|
.B <programN>
|
|
[ <argsN> ]
|
|
.P
|
|
|
|
Note that in both models, invoking \fImpirun\fP via an absolute path
|
|
name is equivalent to specifying the \fI--prefix\fP option with a
|
|
\fI<dir>\fR value equivalent to the directory where \fImpirun\fR
|
|
resides, minus its last subdirectory. For example:
|
|
|
|
\fB%\fP /usr/local/bin/mpirun ...
|
|
|
|
is equivalent to
|
|
|
|
\fB%\fP mpirun --prefix /usr/local
|
|
|
|
.
|
|
.\" **************************
|
|
.\" Quick Summary Section
|
|
.\" **************************
|
|
.SH QUICK SUMMARY
|
|
.
|
|
If you are simply looking for how to run an MPI application, you
|
|
probably want to use a command line of the following form:
|
|
|
|
\fB%\fP mpirun [ -np X ] [ --hostfile <filename> ] <program>
|
|
|
|
This will run X copies of \fI<program>\fR in your current run-time
|
|
environment (if running under a supported resource manager, Open MPI's
|
|
\fImpirun\fR will usually automatically use the corresponding resource manager
|
|
process starter, as opposed to, for example, \fIrsh\fR or \fIssh\fR,
|
|
which require the use of a hostfile, or will default to running all X
|
|
copies on the localhost), scheduling (by default) in a round-robin fashion by
|
|
CPU slot. See the rest of this page for more details.
|
|
.
|
|
.\" **************************
|
|
.\" Options Section
|
|
.\" **************************
|
|
.SH OPTIONS
|
|
.
|
|
.I mpirun
|
|
will send the name of the directory where it was invoked on the local
|
|
node to each of the remote nodes, and attempt to change to that
|
|
directory. See the "Current Working Directory" section below for further
|
|
details.
|
|
.\"
|
|
.\" Start options listing
|
|
.\" Indent 10 characters from start of first column to start of second column
|
|
.TP 10
|
|
.B <program>
|
|
The program executable. This is identified as the first non-recognized argument
|
|
to mpirun.
|
|
.
|
|
.
|
|
.TP
|
|
.B <args>
|
|
Pass these run-time arguments to every new process. These must always
|
|
be the last arguments to \fImpirun\fP. If an app context file is used,
|
|
\fI<args>\fP will be ignored.
|
|
.
|
|
.
|
|
.TP
|
|
.B -h\fR,\fP --help
|
|
Display help for this command
|
|
.
|
|
.
|
|
.TP
|
|
.B -q\fR,\fP --quiet
|
|
Suppress informative messages from orterun during application execution.
|
|
.
|
|
.
|
|
.TP
|
|
.B -v\fR,\fP --verbose
|
|
Be verbose
|
|
.
|
|
.
|
|
.TP
|
|
.B -V\fR,\fP --version
|
|
Print version number. If no other arguments are given, this will also
|
|
cause orterun to exit.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To specify which hosts (nodes) of the cluster to run on:
|
|
.
|
|
.
|
|
.TP
|
|
.B -H\fR,\fP -host\fR,\fP --host \fR<host1,host2,...,hostN>\fP
|
|
List of hosts on which to invoke processes.
|
|
.
|
|
.
|
|
.TP
|
|
.B
|
|
-hostfile\fR,\fP --hostfile \fR<hostfile>\fP
|
|
Provide a hostfile to use.
|
|
.\" JJH - Should have man page for how to format a hostfile properly.
|
|
.
|
|
.
|
|
.TP
|
|
.B -machinefile\fR,\fP --machinefile \fR<machinefile>\fP
|
|
Synonym for \fI-hostfile\fP.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To specify the number of processes to launch:
|
|
.
|
|
.
|
|
.TP
|
|
.B -c\fR,\fP -n\fR,\fP --n\fR,\fP -np \fR<#>\fP
|
|
Run this many copies of the program on the given nodes. This option
|
|
indicates that the specified file is an executable program and not an
|
|
application context. If no value is provided for the number of copies to
|
|
execute (i.e., neither the "-np" nor its synonyms are provided on the command
|
|
line), Open MPI will automatically execute a copy of the program on
|
|
each process slot (see below for description of a "process slot"). This
|
|
feature, however, can only be used in the SPMD model and will return an
|
|
error (without beginning execution of the application) otherwise.
|
|
.
|
|
.
|
|
.TP
|
|
.B -npersocket\fR,\fP --npersocket <#persocket>
|
|
On each node, launch this many processes times the number of processor
|
|
sockets on the node.
|
|
The \fI-npersocket\fP option also turns on the \fI-bind-to-socket\fP option.
|
|
.
|
|
.
|
|
.TP
|
|
.B -npernode\fR,\fP --npernode <#pernode>
|
|
On each node, launch this many processes.
|
|
.
|
|
.
|
|
.TP
|
|
.B -pernode\fR,\fP --pernode
|
|
On each node, launch one process -- equivalent to \fI-npernode\fP 1.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To map processes:
|
|
.
|
|
.
|
|
.TP
|
|
.B --map-by <foo>
|
|
Map to the specified object, defaults to \fIsocket\fP. Supported options
|
|
include slot, hwthread, core, socket, numa, board, and node.
|
|
.
|
|
.TP
|
|
.B -bycore\fR,\fP --bycore
|
|
Map processes by core (deprecated in favor of --map-by core)
|
|
.
|
|
.TP
|
|
.B -bysocket\fR,\fP --bysocket
|
|
Map processes by socket (deprecated in favor of --map-by socket)
|
|
.
|
|
.TP
|
|
.B -nolocal\fR,\fP --nolocal
|
|
Do not run any copies of the launched application on the same node as
|
|
orterun is running. This option will override listing the localhost
|
|
with \fB--host\fR or any other host-specifying mechanism.
|
|
.
|
|
.TP
|
|
.B -nooversubscribe\fR,\fP --nooversubscribe
|
|
Do not oversubscribe any nodes; error (without starting any processes)
|
|
if the requested number of processes would cause oversubscription.
|
|
This option implicitly sets "max_slots" equal to the "slots" value for
|
|
each node.
|
|
.
|
|
.TP
|
|
.B -bynode\fR,\fP --bynode
|
|
Launch processes one per node, cycling by node in a round-robin
|
|
fashion. This spreads processes evenly among nodes and assigns
|
|
MPI_COMM_WORLD ranks in a round-robin, "by node" manner.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To order processes' ranks in MPI_COMM_WORLD:
|
|
.
|
|
.
|
|
.TP
|
|
.B --rank-by <foo>
|
|
Rank in round-robin fashion according to the specified object,
|
|
defaults to \fIslot\fP. Supported options
|
|
include slot, hwthread, core, socket, numa, board, and node.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For process binding:
|
|
.
|
|
.TP
|
|
.B --bind-to <foo>
|
|
Bind processes to the specified object, defaults to \fIcore\fP. Supported options
|
|
include slot, hwthread, core, socket, numa, board, and none.
|
|
.
|
|
.TP
|
|
.B -cpus-per-proc\fR,\fP --cpus-per-proc <#perproc>
|
|
Bind each process to the specified number of cpus.
|
|
.
|
|
.TP
|
|
.B -cpus-per-rank\fR,\fP --cpus-per-rank <#perrank>
|
|
Alias for \fI-cpus-per-proc\fP.
|
|
.
|
|
.TP
|
|
.B -bind-to-core\fR,\fP --bind-to-core
|
|
Bind processes to cores (deprecated in favor of --bind-to core)
|
|
.
|
|
.TP
|
|
.B -bind-to-socket\fR,\fP --bind-to-socket
|
|
Bind processes to processor sockets (deprecated in favor of --bind-to socket)
|
|
.
|
|
.TP
|
|
.B -bind-to-none\fR,\fP --bind-to-none
|
|
Do not bind processes (deprecated in favor of --bind-to none)
|
|
.
|
|
.TP
|
|
.B -report-bindings\fR,\fP --report-bindings
|
|
Report any bindings for launched processes.
|
|
.
|
|
.TP
|
|
.B -slot-list\fR,\fP --slot-list <slots>
|
|
List of processor IDs to be used for binding MPI processes. The specified bindings will
|
|
be applied to all MPI processes. See explanation below for syntax.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For rankfiles:
|
|
.
|
|
.
|
|
.TP
|
|
.B -rf\fR,\fP --rankfile <rankfile>
|
|
Provide a rankfile file.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To manage standard I/O:
|
|
.
|
|
.
|
|
.TP
|
|
.B -output-filename\fR,\fP --output-filename \fR<filename>\fP
|
|
Redirect the stdout, stderr, and stddiag of all processes to a process-unique version of
|
|
the specified filename. Any directories in the filename will automatically be created.
|
|
Each output file will consist of filename.id, where the id will be the
|
|
processes' rank in MPI_COMM_WORLD, left-filled with
|
|
zero's for correct ordering in listings.
|
|
.
|
|
.
|
|
.TP
|
|
.B -stdin\fR,\fP --stdin <rank>
|
|
The MPI_COMM_WORLD rank of the process that is to receive stdin. The
|
|
default is to forward stdin to MPI_COMM_WORLD rank 0, but this option
|
|
can be used to forward stdin to any process. It is also acceptable to
|
|
specify \fInone\fP, indicating that no processes are to receive stdin.
|
|
.
|
|
.
|
|
.TP
|
|
.B -tag-output\fR,\fP --tag-output
|
|
Tag each line of output to stdout, stderr, and stddiag with \fB[jobid, MCW_rank]<stdxxx>\fP indicating the process jobid
|
|
and MPI_COMM_WORLD rank of the process that generated the output, and the channel which generated it.
|
|
.
|
|
.
|
|
.TP
|
|
.B -timestamp-output\fR,\fP --timestamp-output
|
|
Timestamp each line of output to stdout, stderr, and stddiag.
|
|
.
|
|
.
|
|
.TP
|
|
.B -xml\fR,\fP --xml
|
|
Provide all output to stdout, stderr, and stddiag in an xml format.
|
|
.
|
|
.
|
|
.TP
|
|
.B -xterm\fR,\fP --xterm \fR<ranks>\fP
|
|
Display the output from the processes identified by their
|
|
MPI_COMM_WORLD ranks in separate xterm windows. The ranks are specified
|
|
as a comma-separated list of ranges, with a -1 indicating all. A separate
|
|
window will be created for each specified process.
|
|
.B Note:
|
|
xterm will normally terminate the window upon termination of the process running
|
|
within it. However, by adding a "!" to the end of the list of specified ranks,
|
|
the proper options will be provided to ensure that xterm keeps the window open
|
|
\fIafter\fP the process terminates, thus allowing you to see the process' output.
|
|
Each xterm window will subsequently need to be manually closed.
|
|
.B Note:
|
|
In some environments, xterm may require that the executable be in the user's
|
|
path, or be specified in absolute or relative terms. Thus, it may be necessary
|
|
to specify a local executable as "./foo" instead of just "foo". If xterm fails to
|
|
find the executable, mpirun will hang, but still respond correctly to a ctrl-c.
|
|
If this happens, please check that the executable is being specified correctly
|
|
and try again.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
To manage files and runtime environment:
|
|
.
|
|
.
|
|
.TP
|
|
.B -path\fR,\fP --path \fR<path>\fP
|
|
<path> that will be used when attempting to locate the requested
|
|
executables. This is used prior to using the local PATH setting.
|
|
.
|
|
.
|
|
.TP
|
|
.B --prefix \fR<dir>\fP
|
|
Prefix directory that will be used to set the \fIPATH\fR and
|
|
\fILD_LIBRARY_PATH\fR on the remote node before invoking Open MPI or
|
|
the target process. See the "Remote Execution" section, below.
|
|
.
|
|
.
|
|
.TP
|
|
.B --preload-binary
|
|
Copy the specified executable(s) to remote machines prior to starting remote processes. The
|
|
executables will be copied to the Open MPI session directory and will be deleted upon
|
|
completion of the job.
|
|
.
|
|
.
|
|
.TP
|
|
.B --preload-files <files>
|
|
Preload the comma separated list of files to the current working directory of the remote
|
|
machines where processes will be launched prior to starting those processes.
|
|
.
|
|
.
|
|
.TP
|
|
.B --preload-files-dest-dir <path>
|
|
The destination directory to be used for preload-files, if other than the current working
|
|
directory. By default, the absolute and relative paths provided by --preload-files are used.
|
|
.
|
|
.
|
|
.TP
|
|
.B --tmpdir \fR<dir>\fP
|
|
Set the root for the session directory tree for mpirun only.
|
|
.
|
|
.
|
|
.TP
|
|
.B -wd \fR<dir>\fP
|
|
Synonym for \fI-wdir\fP.
|
|
.
|
|
.
|
|
.TP
|
|
.B -wdir \fR<dir>\fP
|
|
Change to the directory <dir> before the user's program executes.
|
|
See the "Current Working Directory" section for notes on relative paths.
|
|
.B Note:
|
|
If the \fI-wdir\fP option appears both on the command line and in an
|
|
application context, the context will take precedence over the command
|
|
line. Thus, if the path to the desired wdir is different
|
|
on the backend nodes, then it must be specified as an absolute path that
|
|
is correct for the backend node.
|
|
.
|
|
.
|
|
.TP
|
|
.B -x \fR<env>\fP
|
|
Export the specified environment variables to the remote nodes before
|
|
executing the program. Only one environment variable can be specified
|
|
per \fI-x\fP option. Existing environment variables can be specified
|
|
or new variable names specified with corresponding values. For
|
|
example:
|
|
\fB%\fP mpirun -x DISPLAY -x OFILE=/tmp/out ...
|
|
|
|
The parser for the \fI-x\fP option is not very sophisticated; it does
|
|
not even understand quoted values. Users are advised to set variables
|
|
in the environment, and then use \fI-x\fP to export (not define) them.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
Setting MCA parameters:
|
|
.
|
|
.
|
|
.TP
|
|
.B -gmca\fR,\fP --gmca \fR<key> <value>\fP
|
|
Pass global MCA parameters that are applicable to all contexts. \fI<key>\fP is
|
|
the parameter name; \fI<value>\fP is the parameter value.
|
|
.
|
|
.
|
|
.TP
|
|
.B -mca\fR,\fP --mca <key> <value>
|
|
Send arguments to various MCA modules. See the "MCA" section, below.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
For debugging:
|
|
.
|
|
.
|
|
.TP
|
|
.B -debug\fR,\fP --debug
|
|
Invoke the user-level debugger indicated by the \fIorte_base_user_debugger\fP
|
|
MCA parameter.
|
|
.
|
|
.
|
|
.TP
|
|
.B -debugger\fR,\fP --debugger
|
|
Sequence of debuggers to search for when \fI--debug\fP is used (i.e.
|
|
a synonym for \fIorte_base_user_debugger\fP MCA parameter).
|
|
.
|
|
.
|
|
.TP
|
|
.B -tv\fR,\fP --tv
|
|
Launch processes under the TotalView debugger.
|
|
Deprecated backwards compatibility flag. Synonym for \fI--debug\fP.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
There are also other options:
|
|
.
|
|
.
|
|
.TP
|
|
.B -aborted\fR,\fP --aborted \fR<#>\fP
|
|
Set the maximum number of aborted processes to display.
|
|
.
|
|
.
|
|
.TP
|
|
.B --app \fR<appfile>\fP
|
|
Provide an appfile, ignoring all other command line options.
|
|
.
|
|
.
|
|
.TP
|
|
.B -cf\fR,\fP --cartofile \fR<cartofile>\fP
|
|
Provide a cartography file.
|
|
.
|
|
.
|
|
.TP
|
|
.B --hetero
|
|
Indicates that multiple app_contexts are being provided that are a mix of 32/64-bit binaries.
|
|
.
|
|
.
|
|
.TP
|
|
.B -leave-session-attached\fR,\fP --leave-session-attached
|
|
Do not detach OmpiRTE daemons used by this application. This allows error messages from the daemons
|
|
as well as the underlying environment (e.g., when failing to launch a daemon) to be output.
|
|
.
|
|
.
|
|
.TP
|
|
.B -ompi-server\fR,\fP --ompi-server <uri or file>
|
|
Specify the URI of the Open MPI server (or the mpirun to be used as the server)
|
|
, the name
|
|
of the file (specified as file:filename) that
|
|
contains that info, or the PID (specified as pid:#) of the mpirun to be used as
|
|
the server.
|
|
The Open MPI server is used to support multi-application data exchange via
|
|
the MPI-2 MPI_Publish_name and MPI_Lookup_name functions.
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-pid\fR,\fP --report-pid <channel>
|
|
Print out mpirun's PID during startup. The channel must be either a '-' to indi
|
|
cate that
|
|
the pid is to be output to stdout, a '+' to indicate that the pid is to be outp
|
|
ut to stderr,
|
|
or a filename to which the pid is to be written.
|
|
.
|
|
.
|
|
.TP
|
|
.B -report-uri\fR,\fP --report-uri <channel>
|
|
Print out mpirun's URI during startup. The channel must be either a '-' to indi
|
|
cate that
|
|
the URI is to be output to stdout, a '+' to indicate that the URI is to be outp
|
|
ut to stderr,
|
|
or a filename to which the URI is to be written.
|
|
.
|
|
.
|
|
.TP
|
|
.B -wait-for-server\fR,\fP --wait-for-server
|
|
Pause mpirun before launching the job until ompi-server is detected. This
|
|
is useful in scripts where ompi-server may be started in the background, followed immediately by
|
|
an \fImpirun\fP command that wishes to connect to it. Mpirun will pause until either the specified
|
|
ompi-server is contacted or the server-wait-time is exceeded.
|
|
.
|
|
.
|
|
.TP
|
|
.B -server-wait-time\fR,\fP --server-wait-time <secs>
|
|
The max amount of time (in seconds) mpirun should wait for the ompi-server to start. The default
|
|
is 10 seconds.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.P
|
|
The following options are useful for developers; they are not generally
|
|
useful to most ORTE and/or MPI users:
|
|
.
|
|
.TP
|
|
.B -d\fR,\fP --debug-devel
|
|
Enable debugging of the OmpiRTE (the run-time layer in Open MPI).
|
|
This is not generally useful for most users.
|
|
.
|
|
.
|
|
.TP
|
|
.B --debug-daemons
|
|
Enable debugging of any OmpiRTE daemons used by this application.
|
|
.
|
|
.
|
|
.TP
|
|
.B --debug-daemons-file
|
|
Enable debugging of any OmpiRTE daemons used by this application, storing
|
|
output in files.
|
|
.
|
|
.
|
|
.TP
|
|
.B -launch-agent\fR,\fP --launch-agent
|
|
Name of the executable that is to be used to start processes on the remote nodes. The default
|
|
is "orted". This option can be used to test new daemon concepts, or to pass options back to the
|
|
daemons without having mpirun itself see them. For example, specifying a launch agent of
|
|
\fRorted -mca odls_base_verbose 5\fR allows the developer to ask the orted for debugging output
|
|
without clutter from mpirun itself.
|
|
.
|
|
.
|
|
.TP
|
|
.B --noprefix
|
|
Disable the automatic --prefix behavior
|
|
.
|
|
.
|
|
.P
|
|
There may be other options listed with \fImpirun --help\fP.
|
|
.
|
|
.
|
|
.SS Environment Variables
|
|
.
|
|
.TP
|
|
.B MPIEXEC_TIMEOUT
|
|
The maximum number of seconds that
|
|
.I mpirun
|
|
.RI ( mpiexec )
|
|
will run. After this many seconds,
|
|
.I mpirun
|
|
will abort the launched job and exit.
|
|
.
|
|
.
|
|
.\" **************************
|
|
.\" Description Section
|
|
.\" **************************
|
|
.SH DESCRIPTION
|
|
.
|
|
One invocation of \fImpirun\fP starts an MPI application running under Open
|
|
MPI. If the application is single process multiple data (SPMD), the application
|
|
can be specified on the \fImpirun\fP command line.
|
|
|
|
If the application is multiple instruction multiple data (MIMD), comprising of
|
|
multiple programs, the set of programs and argument can be specified in one of
|
|
two ways: Extended Command Line Arguments, and Application Context.
|
|
.PP
|
|
An application context describes the MIMD program set including all arguments
|
|
in a separate file.
|
|
.\" See appcontext(5) for a description of the application context syntax.
|
|
This file essentially contains multiple \fImpirun\fP command lines, less the
|
|
command name itself. The ability to specify different options for different
|
|
instantiations of a program is another reason to use an application context.
|
|
.PP
|
|
Extended command line arguments allow for the description of the application
|
|
layout on the command line using colons (\fI:\fP) to separate the specification
|
|
of programs and arguments. Some options are globally set across all specified
|
|
programs (e.g. --hostfile), while others are specific to a single program
|
|
(e.g. -np).
|
|
.
|
|
.
|
|
.
|
|
.SS Specifying Host Nodes
|
|
.
|
|
Host nodes can be identified on the \fImpirun\fP command line with the \fI-host\fP
|
|
option or in a hostfile.
|
|
.
|
|
.PP
|
|
For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,aa,bb ./a.out
|
|
launches two processes on node aa and one on bb.
|
|
.
|
|
.PP
|
|
Or, consider the hostfile
|
|
.
|
|
|
|
\fB%\fP cat myhostfile
|
|
aa slots=2
|
|
bb slots=2
|
|
cc slots=2
|
|
|
|
.
|
|
.PP
|
|
Here, we list both the host names (aa, bb, and cc) but also how many "slots"
|
|
there are for each. Slots indicate how many processes can potentially execute
|
|
on a node. For best performance, the number of slots may be chosen to be the
|
|
number of cores on the node or the number of processor sockets. If the hostfile
|
|
does not provide slots information, a default of 1 is assumed.
|
|
When running under resource managers (e.g., SLURM, Torque, etc.),
|
|
Open MPI will obtain both the hostnames and the number of slots directly
|
|
from the resource manger.
|
|
.
|
|
.PP
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile ./a.out
|
|
will launch two processes on each of the three nodes.
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -host aa ./a.out
|
|
will launch two processes, both on node aa.
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -host dd ./a.out
|
|
will find no hosts to run on and abort with an error.
|
|
That is, the specified host dd is not in the specified hostfile.
|
|
.
|
|
.SS Specifying Number of Processes
|
|
.
|
|
As we have just seen, the number of processes to run can be set using the
|
|
hostfile. Other mechanisms exist.
|
|
.
|
|
.PP
|
|
The number of processes launched can be specified as a multiple of the
|
|
number of nodes or processor sockets available. For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npersocket 2 ./a.out
|
|
launches processes 0-3 on node aa and process 4-7 on node bb,
|
|
where aa and bb are both dual-socket nodes.
|
|
The \fI-npersocket\fP option also turns on the \fI-bind-to-socket\fP option,
|
|
which is discussed in a later section.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npernode 2 ./a.out
|
|
launches processes 0-1 on node aa and processes 2-3 on node bb.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -npernode 1 ./a.out
|
|
launches one process per host node.
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -pernode ./a.out
|
|
is the same as \fI-npernode\fP 1.
|
|
.
|
|
.
|
|
.PP
|
|
Another alternative is to specify the number of processes with the
|
|
\fI-np\fP option. Consider now the hostfile
|
|
.
|
|
|
|
\fB%\fP cat myhostfile
|
|
aa slots=4
|
|
bb slots=4
|
|
cc slots=4
|
|
|
|
.
|
|
.PP
|
|
Now,
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 6 ./a.out
|
|
will launch processes 0-3 on node aa and processes 4-5 on node bb. The remaining
|
|
slots in the hostfile will not be used since the \fI-np\fP option indicated
|
|
that only 6 processes should be launched.
|
|
.
|
|
.SS Mapping Processes to Nodes: Using Policies
|
|
.
|
|
The examples above illustrate the default mapping of process processes
|
|
to nodes. This mapping can also be controlled with various
|
|
\fImpirun\fP options that describe mapping policies.
|
|
.
|
|
.
|
|
.PP
|
|
Consider the same hostfile as above, again with \fI-np\fP 6:
|
|
.
|
|
|
|
node aa node bb node cc
|
|
|
|
mpirun 0 1 2 3 4 5
|
|
|
|
mpirun -bynode 0 3 1 4 2 5
|
|
|
|
mpirun -nolocal 0 1 2 3 4 5
|
|
.
|
|
.PP
|
|
The \fI-bynode\fP option does likewise but numbers the processes in "by node"
|
|
in a round-robin fashion.
|
|
.
|
|
.PP
|
|
The \fI-nolocal\fP option prevents any processes from being mapped onto the
|
|
local host (in this case node aa). While \fImpirun\fP typically consumes
|
|
few system resources, \fI-nolocal\fP can be helpful for launching very
|
|
large jobs where \fImpirun\fP may actually need to use noticeable amounts
|
|
of memory and/or processing time.
|
|
.
|
|
.PP
|
|
Just as \fI-np\fP can specify fewer processes than there are slots, it can
|
|
also oversubscribe the slots. For example, with the same hostfile:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 ./a.out
|
|
will launch processes 0-3 on node aa, 4-7 on bb, and 8-11 on cc. It will
|
|
then add the remaining two processes to whichever nodes it chooses.
|
|
.
|
|
.PP
|
|
One can also specify limits to oversubscription. For example, with the same
|
|
hostfile:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 -nooversubscribe ./a.out
|
|
will produce an error since \fI-nooversubscribe\fP prevents oversubscription.
|
|
.
|
|
.PP
|
|
Limits to oversubscription can also be specified in the hostfile itself:
|
|
.
|
|
% cat myhostfile
|
|
aa slots=4 max_slots=4
|
|
bb max_slots=4
|
|
cc slots=4
|
|
.
|
|
.PP
|
|
The \fImax_slots\fP field specifies such a limit. When it does, the
|
|
\fIslots\fP value defaults to the limit. Now:
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile -np 14 ./a.out
|
|
causes the first 12 processes to be launched as before, but the remaining
|
|
two processes will be forced onto node cc. The other two nodes are
|
|
protected by the hostfile against oversubscription by this job.
|
|
.
|
|
.PP
|
|
Using the \fI--nooversubscribe\fR option can be helpful since Open MPI
|
|
currently does not get "max_slots" values from the resource manager.
|
|
.
|
|
.PP
|
|
Of course, \fI-np\fP can also be used with the \fI-H\fP or \fI-host\fP
|
|
option. For example,
|
|
.
|
|
.TP 4
|
|
mpirun -H aa,bb -np 8 ./a.out
|
|
launches 8 processes. Since only two hosts are specified, after the first
|
|
two processes are mapped, one to aa and one to bb, the remaining processes
|
|
oversubscribe the specified hosts.
|
|
.
|
|
.PP
|
|
And here is a MIMD example:
|
|
.
|
|
.TP 4
|
|
mpirun -H aa -np 1 hostname : -H bb,cc -np 2 uptime
|
|
will launch process 0 running \fIhostname\fP on node aa and processes 1 and 2
|
|
each running \fIuptime\fP on nodes bb and cc, respectively.
|
|
.
|
|
.SS Mapping Processes to Nodes: Using Arbitrary Mappings
|
|
.
|
|
The mapping of process processes to nodes can be prescribed not just
|
|
with general policies but also, if necessary, using arbitrary mappings
|
|
that cannot be described by a simple policy. One can use the "sequential
|
|
mapper," which reads the hostfile line by line, assigning processes
|
|
to nodes in whatever order the hostfile specifies. Use the
|
|
\fI-mca rmaps seq\fP option. For example, using the same hostfile
|
|
as before
|
|
.
|
|
.TP 4
|
|
mpirun -hostfile myhostfile ./a.out
|
|
will launch three processes, on nodes aa, bb, and cc, respectively.
|
|
The slot counts don't matter; one process is launched per line on
|
|
whatever node is listed on the line.
|
|
.
|
|
.PP
|
|
Another way to specify arbitrary mappings is with a rankfile, which
|
|
gives you detailed control over process binding as well. Rankfiles
|
|
are discussed below.
|
|
.
|
|
.SS Process Binding
|
|
.
|
|
Processes may be bound to specific resources on a node. This can
|
|
improve performance if the operating system is placing processes
|
|
suboptimally. For example, it might oversubscribe some multi-core
|
|
processor sockets, leaving other sockets idle; this can lead
|
|
processes to contend unnecessarily for common resources. Or, it
|
|
might spread processes out too widely; this can be suboptimal if
|
|
application performance is sensitive to interprocess communication
|
|
costs. Binding can also keep the operating system from migrating
|
|
processes excessively, regardless of how optimally those processes
|
|
were placed to begin with.
|
|
.
|
|
.PP
|
|
To bind processes, one must first associate them with the resources
|
|
on which they should run. For example, the \fI-bycore\fP option
|
|
associates the processes on a node with successive cores. Or,
|
|
\fI-bysocket\fP associates the processes with successive processor sockets,
|
|
cycling through the sockets in a round-robin fashion if necessary.
|
|
And \fI-cpus-per-proc\fP indicates how many cores to bind per process.
|
|
.
|
|
.PP
|
|
But, such association is meaningless unless the processes are actually
|
|
bound to those resources. The binding option specifies the granularity
|
|
of binding -- say, with \fI-bind-to-core\fP or \fI-bind-to-socket\fP.
|
|
One can also turn binding off with \fI-bind-to-none\fP, which is
|
|
typically the default.
|
|
.
|
|
.PP
|
|
Finally, \fI-report-bindings\fP can be used to report bindings.
|
|
.
|
|
.PP
|
|
As an example, consider a node with two processor sockets, each comprising
|
|
four cores. We run \fImpirun\fP with \fI-np 4 -report-bindings\fP and
|
|
the following additional options:
|
|
.
|
|
|
|
% mpirun ... -bycore -bind-to-core
|
|
[...] ... binding child [...,0] to cpus 0001
|
|
[...] ... binding child [...,1] to cpus 0002
|
|
[...] ... binding child [...,2] to cpus 0004
|
|
[...] ... binding child [...,3] to cpus 0008
|
|
|
|
% mpirun ... -bysocket -bind-to-socket
|
|
[...] ... binding child [...,0] to socket 0 cpus 000f
|
|
[...] ... binding child [...,1] to socket 1 cpus 00f0
|
|
[...] ... binding child [...,2] to socket 0 cpus 000f
|
|
[...] ... binding child [...,3] to socket 1 cpus 00f0
|
|
|
|
% mpirun ... -cpus-per-proc 2 -bind-to-core
|
|
[...] ... binding child [...,0] to cpus 0003
|
|
[...] ... binding child [...,1] to cpus 000c
|
|
[...] ... binding child [...,2] to cpus 0030
|
|
[...] ... binding child [...,3] to cpus 00c0
|
|
|
|
% mpirun ... -bind-to-none
|
|
.
|
|
.PP
|
|
Here, \fI-report-bindings\fP shows the binding of each process as a mask.
|
|
In the first case, the processes bind to successive cores as indicated by
|
|
the masks 0001, 0002, 0004, and 0008. In the second case, processes bind
|
|
to all cores on successive sockets as indicated by the masks 000f and 00f0.
|
|
The processes cycle through the processor sockets in a round-robin fashion
|
|
as many times as are needed. In the third case, the masks show us that
|
|
2 cores have been bound per process. In the fourth case, binding is
|
|
turned off and no bindings are reported.
|
|
.
|
|
.PP
|
|
Open MPI's support for process binding depends on the underlying
|
|
operating system. Therefore, certain process binding options may not be available
|
|
on every system.
|
|
.
|
|
.PP
|
|
Process binding can also be set with MCA parameters.
|
|
Their usage is less convenient than that of \fImpirun\fP options.
|
|
On the other hand, MCA parameters can be set not only on the \fImpirun\fP
|
|
command line, but alternatively in a system or user mca-params.conf file
|
|
or as environment variables, as described in the MCA section below.
|
|
The correspondences are:
|
|
.
|
|
|
|
mpirun option MCA parameter key value
|
|
|
|
-bycore rmaps_base_schedule_policy core
|
|
-bysocket rmaps_base_schedule_policy socket
|
|
-bind-to-core orte_process_binding core
|
|
-bind-to-socket orte_process_binding socket
|
|
-bind-to-none orte_process_binding none
|
|
.
|
|
.PP
|
|
The \fIorte_process_binding\fP value can also take on the
|
|
\fI:if-avail\fP attribute. This attribute means that processes
|
|
will be bound only if this is supported on the underlying
|
|
operating system. Without the attribute, if there is no
|
|
such support, the binding request results in an error.
|
|
For example, you could have
|
|
.
|
|
|
|
% cat $HOME/.openmpi/mca-params.conf
|
|
rmaps_base_schedule_policy = socket
|
|
orte_process_binding = socket:if-avail
|
|
.
|
|
.
|
|
.SS Rankfiles
|
|
.
|
|
Rankfiles are text files that specify detailed information about how
|
|
individual processes should be mapped to nodes, and to which
|
|
processor(s) they should be bound. Each line of a rankfile specifies
|
|
the location of one process (for MPI jobs, the process' "rank" refers
|
|
to its rank in MPI_COMM_WORLD). The general form of each line in the
|
|
rankfile is:
|
|
.
|
|
|
|
rank <N>=<hostname> slot=<slot list>
|
|
.
|
|
.PP
|
|
For example:
|
|
.
|
|
|
|
$ cat myrankfile
|
|
rank 0=aa slot=1:0-2
|
|
rank 1=bb slot=0:0,1
|
|
rank 2=cc slot=1-2
|
|
$ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out
|
|
.
|
|
.PP
|
|
Means that
|
|
.
|
|
|
|
Rank 0 runs on node aa, bound to socket 1, cores 0-2.
|
|
Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
|
|
Rank 2 runs on node cc, bound to cores 1 and 2.
|
|
.
|
|
.PP
|
|
The hostnames listed above are "absolute," meaning that actual
|
|
resolveable hostnames are specified. However, hostnames can also be
|
|
specified as "relative," meaning that they are specified in relation
|
|
to an externally-specified list of hostnames (e.g., by mpirun's --host
|
|
argument, a hostfile, or a job scheduler).
|
|
.
|
|
.PP
|
|
The "relative" specification is of the form "+n<X>", where X is an
|
|
integer specifying the Xth hostname in the set of all available
|
|
hostnames, indexed from 0. For example:
|
|
.
|
|
|
|
$ cat myrankfile
|
|
rank 0=+n0 slot=1:0-2
|
|
rank 1=+n1 slot=0:0,1
|
|
rank 2=+n2 slot=1-2
|
|
$ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out
|
|
.
|
|
.PP
|
|
Starting with Open MPI v1.7, all socket/core slot locations are be
|
|
specified as
|
|
.I logical
|
|
indexes (the Open MPI v1.6 series used
|
|
.I physical
|
|
indexes). You can use tools such as HWLOC's "lstopo" to find the
|
|
logical indexes of socket and cores.
|
|
.
|
|
.
|
|
.SS Application Context or Executable Program?
|
|
.
|
|
To distinguish the two different forms, \fImpirun\fP
|
|
looks on the command line for \fI--app\fP option. If
|
|
it is specified, then the file named on the command line is
|
|
assumed to be an application context. If it is not
|
|
specified, then the file is assumed to be an executable program.
|
|
.
|
|
.
|
|
.
|
|
.SS Locating Files
|
|
.
|
|
If no relative or absolute path is specified for a file, Open
|
|
MPI will first look for files by searching the directories specified
|
|
by the \fI--path\fP option. If there is no \fI--path\fP option set or
|
|
if the file is not found at the \fI--path\fP location, then Open MPI
|
|
will search the user's PATH environment variable as defined on the
|
|
source node(s).
|
|
.PP
|
|
If a relative directory is specified, it must be relative to the initial
|
|
working directory determined by the specific starter used. For example when
|
|
using the rsh or ssh starters, the initial directory is $HOME by default. Other
|
|
starters may set the initial directory to the current working directory from
|
|
the invocation of \fImpirun\fP.
|
|
.
|
|
.
|
|
.
|
|
.SS Current Working Directory
|
|
.
|
|
The \fI\-wdir\fP mpirun option (and its synonym, \fI\-wd\fP) allows
|
|
the user to change to an arbitrary directory before the program is
|
|
invoked. It can also be used in application context files to specify
|
|
working directories on specific nodes and/or for specific
|
|
applications.
|
|
.PP
|
|
If the \fI\-wdir\fP option appears both in a context file and on the
|
|
command line, the context file directory will override the command
|
|
line value.
|
|
.PP
|
|
If the \fI-wdir\fP option is specified, Open MPI will attempt to
|
|
change to the specified directory on all of the remote nodes. If this
|
|
fails, \fImpirun\fP will abort.
|
|
.PP
|
|
If the \fI-wdir\fP option is \fBnot\fP specified, Open MPI will send
|
|
the directory name where \fImpirun\fP was invoked to each of the
|
|
remote nodes. The remote nodes will try to change to that
|
|
directory. If they are unable (e.g., if the directory does not exist on
|
|
that node), then Open MPI will use the default directory determined by
|
|
the starter.
|
|
.PP
|
|
All directory changing occurs before the user's program is invoked; it
|
|
does not wait until \fIMPI_INIT\fP is called.
|
|
.
|
|
.
|
|
.
|
|
.SS Standard I/O
|
|
.
|
|
Open MPI directs UNIX standard input to /dev/null on all processes
|
|
except the MPI_COMM_WORLD rank 0 process. The MPI_COMM_WORLD rank 0 process
|
|
inherits standard input from \fImpirun\fP.
|
|
.B Note:
|
|
The node that invoked \fImpirun\fP need not be the same as the node where the
|
|
MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of
|
|
\fImpirun\fP's standard input to the rank 0 process.
|
|
.PP
|
|
Open MPI directs UNIX standard output and error from remote nodes to the node
|
|
that invoked \fImpirun\fP and prints it on the standard output/error of
|
|
\fImpirun\fP.
|
|
Local processes inherit the standard output/error of \fImpirun\fP and transfer
|
|
to it directly.
|
|
.PP
|
|
Thus it is possible to redirect standard I/O for Open MPI applications by
|
|
using the typical shell redirection procedure on \fImpirun\fP.
|
|
|
|
\fB%\fP mpirun -np 2 my_app < my_input > my_output
|
|
|
|
Note that in this example \fIonly\fP the MPI_COMM_WORLD rank 0 process will
|
|
receive the stream from \fImy_input\fP on stdin. The stdin on all the other
|
|
nodes will be tied to /dev/null. However, the stdout from all nodes will
|
|
be collected into the \fImy_output\fP file.
|
|
.
|
|
.
|
|
.
|
|
.SS Signal Propagation
|
|
.
|
|
When orterun receives a SIGTERM and SIGINT, it will attempt to kill
|
|
the entire job by sending all processes in the job a SIGTERM, waiting
|
|
a small number of seconds, then sending all processes in the job a
|
|
SIGKILL.
|
|
.
|
|
.PP
|
|
SIGUSR1 and SIGUSR2 signals received by orterun are propagated to
|
|
all processes in the job.
|
|
.
|
|
.PP
|
|
One can turn on forwarding of SIGSTOP and SIGCONT to the program executed
|
|
by mpirun by setting the MCA parameter orte_forward_job_control to 1.
|
|
A SIGTSTOP signal to mpirun will then cause a SIGSTOP signal to be sent
|
|
to all of the programs started by mpirun and likewise a SIGCONT signal
|
|
to mpirun will cause a SIGCONT sent.
|
|
.
|
|
.PP
|
|
Other signals are not currently propagated
|
|
by orterun.
|
|
.
|
|
.
|
|
.SS Process Termination / Signal Handling
|
|
.
|
|
During the run of an MPI application, if any process dies abnormally
|
|
(either exiting before invoking \fIMPI_FINALIZE\fP, or dying as the result of a
|
|
signal), \fImpirun\fP will print out an error message and kill the rest of the
|
|
MPI application.
|
|
.PP
|
|
User signal handlers should probably avoid trying to cleanup MPI state
|
|
(Open MPI is currently not async-signal-safe; see MPI_Init_thread(3)
|
|
for details about
|
|
.I MPI_THREAD_MULTIPLE
|
|
and thread safety). For example, if a segmentation fault occurs in
|
|
\fIMPI_SEND\fP (perhaps because a bad buffer was passed in) and a user
|
|
signal handler is invoked, if this user handler attempts to invoke
|
|
\fIMPI_FINALIZE\fP, Bad Things could happen since Open MPI was already
|
|
"in" MPI when the error occurred. Since \fImpirun\fP will notice that
|
|
the process died due to a signal, it is probably not necessary (and
|
|
safest) for the user to only clean up non-MPI state.
|
|
.
|
|
.
|
|
.
|
|
.SS Process Environment
|
|
.
|
|
Processes in the MPI application inherit their environment from the
|
|
Open RTE daemon upon the node on which they are running. The
|
|
environment is typically inherited from the user's shell. On remote
|
|
nodes, the exact environment is determined by the boot MCA module
|
|
used. The \fIrsh\fR launch module, for example, uses either
|
|
\fIrsh\fR/\fIssh\fR to launch the Open RTE daemon on remote nodes, and
|
|
typically executes one or more of the user's shell-setup files before
|
|
launching the Open RTE daemon. When running dynamically linked
|
|
applications which require the \fILD_LIBRARY_PATH\fR environment
|
|
variable to be set, care must be taken to ensure that it is correctly
|
|
set when booting Open MPI.
|
|
.PP
|
|
See the "Remote Execution" section for more details.
|
|
.
|
|
.
|
|
.SS Remote Execution
|
|
.
|
|
Open MPI requires that the \fIPATH\fR environment variable be set to
|
|
find executables on remote nodes (this is typically only necessary in
|
|
\fIrsh\fR- or \fIssh\fR-based environments -- batch/scheduled
|
|
environments typically copy the current environment to the execution
|
|
of remote jobs, so if the current environment has \fIPATH\fR and/or
|
|
\fILD_LIBRARY_PATH\fR set properly, the remote nodes will also have it
|
|
set properly). If Open MPI was compiled with shared library support,
|
|
it may also be necessary to have the \fILD_LIBRARY_PATH\fR environment
|
|
variable set on remote nodes as well (especially to find the shared
|
|
libraries required to run user MPI applications).
|
|
.PP
|
|
However, it is not always desirable or possible to edit shell
|
|
startup files to set \fIPATH\fR and/or \fILD_LIBRARY_PATH\fR. The
|
|
\fI--prefix\fR option is provided for some simple configurations where
|
|
this is not possible.
|
|
.PP
|
|
The \fI--prefix\fR option takes a single argument: the base directory
|
|
on the remote node where Open MPI is installed. Open MPI will use
|
|
this directory to set the remote \fIPATH\fR and \fILD_LIBRARY_PATH\fR
|
|
before executing any Open MPI or user applications. This allows
|
|
running Open MPI jobs without having pre-configured the \fIPATH\fR and
|
|
\fILD_LIBRARY_PATH\fR on the remote nodes.
|
|
.PP
|
|
Open MPI adds the basename of the current
|
|
node's "bindir" (the directory where Open MPI's executables are
|
|
installed) to the prefix and uses that to set the \fIPATH\fR on the
|
|
remote node. Similarly, Open MPI adds the basename of the current
|
|
node's "libdir" (the directory where Open MPI's libraries are
|
|
installed) to the prefix and uses that to set the
|
|
\fILD_LIBRARY_PATH\fR on the remote node. For example:
|
|
.TP 15
|
|
Local bindir:
|
|
/local/node/directory/bin
|
|
.TP
|
|
Local libdir:
|
|
/local/node/directory/lib64
|
|
.PP
|
|
If the following command line is used:
|
|
|
|
\fB%\fP mpirun --prefix /remote/node/directory
|
|
|
|
Open MPI will add "/remote/node/directory/bin" to the \fIPATH\fR
|
|
and "/remote/node/directory/lib64" to the \fLD_LIBRARY_PATH\fR on the
|
|
remote node before attempting to execute anything.
|
|
.PP
|
|
The \fI--prefix\fR option is not sufficient if the installation paths
|
|
on the remote node are different than the local node (e.g., if "/lib"
|
|
is used on the local node, but "/lib64" is used on the remote node),
|
|
or if the installation paths are something other than a subdirectory
|
|
under a common prefix.
|
|
.PP
|
|
Note that executing \fImpirun\fR via an absolute pathname is
|
|
equivalent to specifying \fI--prefix\fR without the last subdirectory
|
|
in the absolute pathname to \fImpirun\fR. For example:
|
|
|
|
\fB%\fP /usr/local/bin/mpirun ...
|
|
|
|
is equivalent to
|
|
|
|
\fB%\fP mpirun --prefix /usr/local
|
|
.
|
|
.
|
|
.
|
|
.SS Exported Environment Variables
|
|
.
|
|
All environment variables that are named in the form OMPI_* will automatically
|
|
be exported to new processes on the local and remote nodes.
|
|
The \fI\-x\fP option to \fImpirun\fP can be used to export specific environment
|
|
variables to the new processes. While the syntax of the \fI\-x\fP
|
|
option allows the definition of new variables, note that the parser
|
|
for this option is currently not very sophisticated - it does not even
|
|
understand quoted values. Users are advised to set variables in the
|
|
environment and use \fI\-x\fP to export them; not to define them.
|
|
.
|
|
.
|
|
.
|
|
.SS Setting MCA Parameters
|
|
.
|
|
The \fI-mca\fP switch allows the passing of parameters to various MCA
|
|
(Modular Component Architecture) modules.
|
|
.\" Open MPI's MCA modules are described in detail in ompimca(7).
|
|
MCA modules have direct impact on MPI programs because they allow tunable
|
|
parameters to be set at run time (such as which BTL communication device driver
|
|
to use, what parameters to pass to that BTL, etc.).
|
|
.PP
|
|
The \fI-mca\fP switch takes two arguments: \fI<key>\fP and \fI<value>\fP.
|
|
The \fI<key>\fP argument generally specifies which MCA module will receive the value.
|
|
For example, the \fI<key>\fP "btl" is used to select which BTL to be used for
|
|
transporting MPI messages. The \fI<value>\fP argument is the value that is
|
|
passed.
|
|
For example:
|
|
.
|
|
.TP 4
|
|
mpirun -mca btl tcp,self -np 1 foo
|
|
Tells Open MPI to use the "tcp" and "self" BTLs, and to run a single copy of
|
|
"foo" an allocated node.
|
|
.
|
|
.TP
|
|
mpirun -mca btl self -np 1 foo
|
|
Tells Open MPI to use the "self" BTL, and to run a single copy of "foo" an
|
|
allocated node.
|
|
.\" And so on. Open MPI's BTL MCA modules are described in ompimca_btl(7).
|
|
.PP
|
|
The \fI-mca\fP switch can be used multiple times to specify different
|
|
\fI<key>\fP and/or \fI<value>\fP arguments. If the same \fI<key>\fP is
|
|
specified more than once, the \fI<value>\fPs are concatenated with a comma
|
|
(",") separating them.
|
|
.PP
|
|
Note that the \fI-mca\fP switch is simply a shortcut for setting environment variables.
|
|
The same effect may be accomplished by setting corresponding environment
|
|
variables before running \fImpirun\fP.
|
|
The form of the environment variables that Open MPI sets is:
|
|
|
|
OMPI_MCA_<key>=<value>
|
|
.PP
|
|
Thus, the \fI-mca\fP switch overrides any previously set environment
|
|
variables. The \fI-mca\fP settings similarly override MCA parameters set
|
|
in the
|
|
$OPAL_PREFIX/etc/openmpi-mca-params.conf or $HOME/.openmpi/mca-params.conf
|
|
file.
|
|
.
|
|
.PP
|
|
Unknown \fI<key>\fP arguments are still set as
|
|
environment variable -- they are not checked (by \fImpirun\fP) for correctness.
|
|
Illegal or incorrect \fI<value>\fP arguments may or may not be reported -- it
|
|
depends on the specific MCA module.
|
|
.PP
|
|
To find the available component types under the MCA architecture, or to find the
|
|
available parameters for a specific component, use the \fIompi_info\fP command.
|
|
See the \fIompi_info(1)\fP man page for detailed information on the command.
|
|
.
|
|
.SS Exit status
|
|
.
|
|
There is no standard definition for what \fImpirun\fP should return as an exit
|
|
status. After considerable discussion, we settled on the following method for
|
|
assigning the \fImpirun\fP exit status (note: in the following description,
|
|
the "primary" job is the initial application started by mpirun - all jobs that
|
|
are spawned by that job are designated "secondary" jobs):
|
|
.
|
|
.IP \[bu] 2
|
|
if all processes in the primary job normally terminate with exit status 0, we return 0
|
|
.IP \[bu]
|
|
if one or more processes in the primary job normally terminate with non-zero exit status,
|
|
we return the exit status of the process with the lowest MPI_COMM_WORLD rank to have a non-zero status
|
|
.IP \[bu]
|
|
if all processes in the primary job normally terminate with exit status 0, and one or more
|
|
processes in a secondary job normally terminate with non-zero exit status, we (a) return
|
|
the exit status of the process with the lowest MPI_COMM_WORLD rank in the lowest jobid to have a non-zero status, and (b)
|
|
output a message summarizing the exit status of the primary and all secondary jobs.
|
|
.IP \[bu]
|
|
if the cmd line option --report-child-jobs-separately is set, we will return -only- the
|
|
exit status of the primary job. Any non-zero exit status in secondary jobs will be
|
|
reported solely in a summary print statement.
|
|
.
|
|
.PP
|
|
By default, OMPI records and notes that MPI processes exited with non-zero termination status.
|
|
This is generally not considered an "abnormal termination" - i.e., OMPI will not abort an MPI
|
|
job if one or more processes return a non-zero status. Instead, the default behavior simply
|
|
reports the number of processes terminating with non-zero status upon completion of the job.
|
|
.PP
|
|
However, in some cases it can be desirable to have the job abort when any process terminates
|
|
with non-zero status. For example, a non-MPI job might detect a bad result from a calculation
|
|
and want to abort, but doesn't want to generate a core file. Or an MPI job might continue past
|
|
a call to MPI_Finalize, but indicate that all processes should abort due to some post-MPI result.
|
|
.PP
|
|
It is not anticipated that this situation will occur frequently. However, in the interest of
|
|
serving the broader community, OMPI now has a means for allowing users to direct that jobs be
|
|
aborted upon any process exiting with non-zero status. Setting the MCA parameter
|
|
"orte_abort_on_non_zero_status" to 1 will cause OMPI to abort all processes once any process
|
|
exits with non-zero status.
|
|
.PP
|
|
Terminations caused in this manner will be reported on the console as an "abnormal termination",
|
|
with the first process to so exit identified along with its exit status.
|
|
.PP
|
|
.
|
|
.\" **************************
|
|
.\" Examples Section
|
|
.\" **************************
|
|
.SH EXAMPLES
|
|
Be sure also to see the examples throughout the sections above.
|
|
.
|
|
.TP 4
|
|
mpirun -np 4 -mca btl ib,tcp,self prog1
|
|
Run 4 copies of prog1 using the "ib", "tcp", and "self" BTL's for the
|
|
transport of MPI messages.
|
|
.
|
|
.
|
|
.TP 4
|
|
mpirun -np 4 -mca btl tcp,sm,self
|
|
.br
|
|
--mca btl_tcp_if_include eth0 prog1
|
|
.br
|
|
Run 4 copies of prog1 using the "tcp", "sm" and "self" BTLs for the
|
|
transport of MPI messages, with TCP using only the eth0 interface to
|
|
communicate. Note that other BTLs have similar if_include MCA
|
|
parameters.
|
|
.
|
|
.\" **************************
|
|
.\" Diagnostics Section
|
|
.\" **************************
|
|
.
|
|
.\" .SH DIAGNOSTICS
|
|
.\" .TP 4
|
|
.\" Error Msg:
|
|
.\" Description
|
|
.
|
|
.\" **************************
|
|
.\" Return Value Section
|
|
.\" **************************
|
|
.
|
|
.SH RETURN VALUE
|
|
.
|
|
\fImpirun\fP returns 0 if all processes started by \fImpirun\fP exit after calling
|
|
MPI_FINALIZE. A non-zero value is returned if an internal error occurred in
|
|
mpirun, or one or more processes exited before calling MPI_FINALIZE. If an
|
|
internal error occurred in mpirun, the corresponding error code is returned.
|
|
In the event that one or more processes exit before calling MPI_FINALIZE, the
|
|
return value of the MPI_COMM_WORLD rank of the process that \fImpirun\fP first notices died
|
|
before calling MPI_FINALIZE will be returned. Note that, in general, this will
|
|
be the first process that died but is not guaranteed to be so.
|
|
.
|
|
.\" **************************
|
|
.\" See Also Section
|
|
.\" **************************
|
|
.
|
|
.SH SEE ALSO
|
|
MPI_Init_thread(3)
|