2006-02-17 02:38:03 +03:00
|
|
|
.\"
|
|
|
|
.\" Man page for ORTE's orterun process
|
|
|
|
.\"
|
|
|
|
.\" .TH name section center-footer left-footer center-header
|
|
|
|
.TH ORTERUN 1 "February 2006" "Open MPI" "OPEN MPI COMMANDS"
|
|
|
|
.\" **************************
|
|
|
|
.\" Name Section
|
|
|
|
.\" **************************
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH NAME
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
orterun, mpirun, mpiexec \- Execute serial and parallel jobs in Open MPI.
|
|
|
|
|
|
|
|
.B Note:
|
2006-02-16 16:29:37 +03:00
|
|
|
.IR mpirun ,
|
|
|
|
.IR mpiexec ,
|
|
|
|
and
|
|
|
|
.I orterun
|
|
|
|
are all exact synonyms for each other. Using any of the names will
|
|
|
|
result in exactly identical behavior.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Synopsis Section
|
|
|
|
.\" **************************
|
|
|
|
.SH SYNOPSIS
|
|
|
|
.
|
|
|
|
.B mpirun
|
|
|
|
.R [ options ]
|
|
|
|
.B <program>
|
|
|
|
.R [ <args> ]
|
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Quick Summary Section
|
|
|
|
.\" **************************
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH QUICK SUMMARY
|
|
|
|
If you are simply looking for how to run an MPI application, you
|
|
|
|
probably want to use the following command line:
|
2006-02-17 02:38:03 +03:00
|
|
|
|
|
|
|
\fBshell$\fP mpirun -np 4 my_mpi_application
|
|
|
|
|
|
|
|
This will run 4 copies of \fImy_mpi_application\fR in your current run-time
|
|
|
|
environment (if running under a supported resource manager, Open MPI's
|
|
|
|
\fIorterun\fR will usually automatically use the corresponding resource manager
|
|
|
|
process starter, as opposed to, for example, \fIrsh\fR or \fIssh\fR ),
|
2006-02-16 16:29:37 +03:00
|
|
|
scheduling (by default) in a round-robin fashion by CPU slot. See the
|
|
|
|
rest of this page for more details.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Options Section
|
|
|
|
.\" **************************
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH OPTIONS
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.I mpirun
|
|
|
|
will send the name of the directory where it was invoked on the local
|
|
|
|
node to each of the remote nodes, and attempt to change to that
|
|
|
|
directory. See the "Current Working Directory" section, below.
|
2006-02-17 02:38:03 +03:00
|
|
|
.\"
|
|
|
|
.\" Start options listing
|
|
|
|
.\" Indent 10 chacters from start of first column to start of second column
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP 10
|
2006-02-17 02:38:03 +03:00
|
|
|
.B -aborted \fR<#>\fP
|
|
|
|
Set the maximum number of aborted processes to display.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --app \fR<appfile>\fP
|
|
|
|
Provide an appfile, ignoring all other command line options.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -bynode
|
|
|
|
Allocate (map) the processes by node in a round-robin scheme.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -byslot
|
|
|
|
Allocate (map) the processes by slot in a round-robin scheme. This is the
|
|
|
|
default.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -c \fR<#>\fP
|
|
|
|
Synonym for \fI-np\fP (see below).
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -d, --debug-devel
|
|
|
|
Enable debugging og OpenRTE
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --debug
|
|
|
|
Invoke the user-level debugger indicated by the \fIorte_base_user_debugger\fP
|
|
|
|
MCA parameter.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --debug-daemons
|
|
|
|
Enable debugging of any OpenRTE daemons used by this application.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --debug-daemons-file
|
|
|
|
Enable debugging of any OpenRTE daemons used by this application, storing
|
|
|
|
output in files.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --debugger
|
|
|
|
Sequence of debuggers to search for when \fI--debug\fP is used.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -h, --help
|
|
|
|
Display help for this command
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -H \fR<host1,host2,...,hostN>\fP
|
|
|
|
Synonym for \fI-host\fP (see below).
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -host \fR<host1,host2,...,hostN>\fP
|
|
|
|
List of hosts on which to invoke processes.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -hostfile \fR<hostfile>\fP
|
|
|
|
Provide a hostfile to use.
|
|
|
|
JJH - Should have man page for how to format a hostfile properly.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -machinefile \fR<machinefile>\fP
|
|
|
|
Synonym for \fI-hostfile\fP (see above).
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -mca <key> <value>
|
|
|
|
Send arguments to various MCA modules. See the "MCA" section, below.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -n \fR<#>\fP
|
|
|
|
Synonym for \fI-np\fP (see below).
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --no-daemonize
|
|
|
|
Do not detach OpenRTE daemons used by this application.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -np \fR<#>\fP
|
2006-02-16 16:29:37 +03:00
|
|
|
Run this many copies of the program on the given nodes. This option
|
|
|
|
indicates that the specified file is an executable program and not an
|
2006-02-17 02:38:03 +03:00
|
|
|
application schema.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP
|
|
|
|
.B -nw
|
2006-02-17 02:38:03 +03:00
|
|
|
Launch the processes and do not wair for their completion. orterun will
|
|
|
|
complete as soon as successful launch occurs.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP
|
2006-02-17 02:38:03 +03:00
|
|
|
.B -path \fR<path>\fP
|
|
|
|
PATH to be used to look for executables to start processes.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP
|
2006-02-17 02:38:03 +03:00
|
|
|
.B --tmpdir \fR<dir>\fP
|
|
|
|
Set the root for the session directory tree for orterun only.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP
|
2006-02-17 02:38:03 +03:00
|
|
|
.B -tv
|
|
|
|
Launch processes under the TotalView Debugger.
|
|
|
|
Deprecated backwards compatibility flag. Synonym for \fI--debug\fP.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B --universe \fR<username@hostname:universe_name>\fP
|
|
|
|
For this application, set the universe name as:
|
|
|
|
username@hostname:universe_name
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -v, --verbose
|
|
|
|
Be verbose
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -wd \fR<dir>\fP
|
2006-02-16 16:29:37 +03:00
|
|
|
Change to the directory <dir> before the user's program executes.
|
2006-02-17 02:38:03 +03:00
|
|
|
Note that if the \fI-wd\fP option appears both on the command line and in an
|
|
|
|
application schema, the schema will take precendence over the command line.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.TP
|
|
|
|
.B -x \fR<env>\fP
|
2006-02-16 16:29:37 +03:00
|
|
|
Export the specified environment variables to the remote nodes before
|
|
|
|
executing the program. Existing environment variables can be
|
|
|
|
specified (see the Examples section, below), or new variable names
|
2006-02-17 02:38:03 +03:00
|
|
|
specified with corresponding values. The parser for the \fI-x\fP
|
2006-02-16 16:29:37 +03:00
|
|
|
option is not very sophisticated; it does not even understand quoted
|
|
|
|
values. Users are advised to set variables in the environment, and
|
2006-02-17 02:38:03 +03:00
|
|
|
then use \fI-x\fP to export (not define) them.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.TP
|
|
|
|
.B <args>
|
|
|
|
Pass these runtime arguments to every new process. These must always
|
2006-02-17 02:38:03 +03:00
|
|
|
be the last arguments to \fImpirun\fP This option is not valid on the command
|
|
|
|
line if an application schema is specified.
|
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Description Section
|
|
|
|
.\" **************************
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH DESCRIPTION
|
|
|
|
One invocation of
|
|
|
|
.I mpirun
|
|
|
|
starts an MPI application running under LAM.
|
|
|
|
If the application is simply SPMD, the application can be specified on the
|
|
|
|
.I mpirun
|
|
|
|
command line.
|
|
|
|
If the application is MIMD, comprising multiple programs, an application
|
|
|
|
schema is required in a separate file.
|
|
|
|
See appschema(5) for a description of the application schema syntax,
|
|
|
|
but it essentially contains multiple
|
|
|
|
.I mpirun
|
|
|
|
command lines, less the command name itself. The ability to specify
|
|
|
|
different options for different instantiations of a program is another
|
|
|
|
reason to use an application schema.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Location Nomenclature
|
|
|
|
As described above,
|
|
|
|
.I mpirun
|
|
|
|
can specify arbitrary locations in the current LAM universe.
|
|
|
|
Locations can be specified either by CPU or by node (noted by the
|
|
|
|
"<where>" in the SYNTAX section, above). Note that LAM does not bind
|
|
|
|
processes to CPUs -- specifying a location "by CPU" is really a
|
|
|
|
convenience mechanism for SMPs that ultimately maps down to a specific
|
|
|
|
node.
|
|
|
|
.PP
|
|
|
|
Note that LAM effectively numbers MPI_COMM_WORLD ranks from
|
|
|
|
left-to-right in the <where>, regardless of which nomenclature is
|
|
|
|
used. This can be important because typical MPI programs tend to
|
|
|
|
communicate more with their immediate neighbors (i.e., myrank +/- X)
|
|
|
|
than distant neighbors. When neighbors end up on the same node, the
|
|
|
|
shmem RPIs can be used for communication rather than the network RPIs,
|
|
|
|
which can result in faster MPI performance.
|
|
|
|
.PP
|
|
|
|
Specifying locations by node will launch one copy of an executable per
|
|
|
|
specified node. Using a capitol "N" tells LAM to use all available
|
|
|
|
nodes that were lambooted (see lamboot(1)). Ranges of specific nodes
|
|
|
|
can also be specified in the form "nR[,R]*", where R specifies either
|
|
|
|
a single node number or a valid range of node numbers in the range of
|
|
|
|
[0, num_nodes). For example:
|
|
|
|
.TP 4
|
|
|
|
mpirun N a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on all available nodes in the LAM universe. MPI_COMM_WORLD rank 0
|
|
|
|
will be on n0, rank 1 will be on n1, etc.
|
|
|
|
.TP
|
|
|
|
mpirun n0-3 a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on nodes 0 through 3. MPI_COMM_WORLD rank 0 will be on n0, rank 1
|
|
|
|
will be on n1, etc.
|
|
|
|
.TP
|
|
|
|
mpirun n0-3,8-11,15 a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on nodes 0 through 3, 8 through 11, and 15. MPI_COMM_WORLD ranks will
|
|
|
|
be ordered as follows: (0, n0), (1, n1), (2, n2), (3, n3), (4, n8),
|
|
|
|
(5, n9), (6, n10), (7, n11), (8, n15).
|
|
|
|
.PP
|
|
|
|
Specifying by CPU is the preferred method of launching MPI jobs. The
|
|
|
|
intent is that the boot schema used with lamboot(1) will indicate how
|
|
|
|
many CPUs are available on each node, and then a single, simple
|
|
|
|
.I mpirun
|
|
|
|
command can be used to launch across all of them. As noted above,
|
|
|
|
specifying CPUs does not actually bind processes to CPUs -- it is only
|
|
|
|
a convenience mechanism for launching on SMPs. Otherwise, the by-CPU
|
|
|
|
notation is the same as the by-node notation, except that "C" and "c"
|
|
|
|
are used instead of "N" and "n".
|
|
|
|
.PP
|
|
|
|
Assume in the following example that the LAM universe consists of four
|
|
|
|
4-way SMPs. So c0-3 are on n0, c4-7 are on n1, c8-11 are on n2, and
|
|
|
|
13-15 are on n3.
|
|
|
|
.TP 4
|
|
|
|
mpirun C a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on all available CPUs in the LAM universe. This is typically the
|
|
|
|
simplest (and preferred) method of launching all MPI jobs (even if it
|
|
|
|
resolves to one process per node). MPI_COMM_WORLD ranks 0-3 will be
|
|
|
|
on n0, ranks 4-7 will be on n1, ranks 8-11 will be on n2, and ranks
|
|
|
|
13-15 will be on n3.
|
|
|
|
.TP
|
|
|
|
mpirun c0-3 a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on CPUs 0 through 3. All four ranks of MPI_COMM_WORLD will be on
|
|
|
|
MPI_COMM_WORLD.
|
|
|
|
.TP
|
|
|
|
mpirun c0-3,8-11,15 a.out
|
|
|
|
Runs one copy of the the executable
|
|
|
|
.I a.out
|
|
|
|
on CPUs 0 through 3, 8 through 11, and 15. MPI_COMM_WORLD ranks 0-3
|
|
|
|
will be on n0, 4-7 will be on n2, and 8 will be on n3.
|
|
|
|
.PP
|
|
|
|
The reason that the by-CPU nomenclature is preferred over the by-node
|
|
|
|
nomenclature is best shown through example. Consider trying to run
|
|
|
|
the first CPU example (with the same MPI_COMM_WORLD mapping) with the
|
|
|
|
by-node nomenclature -- run one copy of
|
|
|
|
.I a.out
|
|
|
|
for every available CPU, and maximize the number of local neighbors to
|
|
|
|
potentially maximize MPI performance. One solution would be to use
|
|
|
|
the following command:
|
|
|
|
.TP 4
|
|
|
|
mpirun n0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 a.out
|
|
|
|
.PP
|
|
|
|
This
|
|
|
|
.IR works ,
|
|
|
|
but is definitely klunky to type. It is typically easier to use the
|
|
|
|
by-CPU notation. One might think that the following is equivalent:
|
|
|
|
.TP 4
|
|
|
|
mpirun N -np 16 a.out
|
|
|
|
.PP
|
|
|
|
This is
|
|
|
|
.I not
|
|
|
|
equivalent because the MPI_COMM_WORLD rank mappings will be assigned
|
|
|
|
by node rather than by CPU. Hence rank 0 will be on n0, rank 1 will
|
|
|
|
be on n1, etc. Note that the following, however,
|
|
|
|
.I is
|
|
|
|
equivalent, because LAM interprets lack of a <where> as "C":
|
|
|
|
.TP 4
|
|
|
|
mpirun -np 16 a.out
|
|
|
|
.PP
|
|
|
|
However, a "C" can tend to be more convenient, especially for
|
|
|
|
batch-queuing scripts because the exact number of processes may vary
|
|
|
|
between queue submissions. Since the batch system will determine the
|
|
|
|
final number of CPUs available, having a generic script that
|
|
|
|
effectively says "run on everything you gave me" may lead to more
|
|
|
|
portable / re-usable scripts.
|
|
|
|
.PP
|
|
|
|
Finally, it should be noted that specifying multiple <where> clauses
|
|
|
|
are perfectly acceptable. As such, mixing of the by-node and by-CPU
|
|
|
|
syntax is also valid, albiet typically not useful. For example:
|
|
|
|
.TP 4
|
|
|
|
mpirun C N a.out
|
|
|
|
.PP
|
|
|
|
However, in some cases, specifying multiple <where> clauses can be
|
|
|
|
useful. Consider a parallel application where MPI_COMM_WORLD rank 0
|
|
|
|
will be a "manager" and therefore consume very few CPU cycles because
|
|
|
|
it is usually waiting for "worker" processes to return results.
|
|
|
|
Hence, it is probably desirable to run one "worker" process on all
|
|
|
|
available CPUs, and run one extra process that will be the "manager":
|
|
|
|
.TP 4
|
|
|
|
mpirun c0 C manager-worker-program
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Application Schema or Executable Program?
|
|
|
|
To distinguish the two different forms,
|
|
|
|
.I mpirun
|
|
|
|
looks on the command line for <where> or the \fI-c\fR option. If
|
|
|
|
neither is specified, then the file named on the command line is
|
|
|
|
assumed to be an application schema. If either one or both are
|
|
|
|
specified, then the file is assumed to be an executable program. If
|
|
|
|
<where> and \fI-c\fR both are specified, then copies of the program
|
|
|
|
are started on the specified nodes/CPUs according to an internal LAM
|
|
|
|
scheduling policy. Specifying just one node effectively forces LAM to
|
|
|
|
run all copies of the program in one place. If \fI-c\fR is given, but
|
|
|
|
not <where>, then all available CPUs on all LAM nodes are used. If
|
|
|
|
<where> is given, but not \fI-c\fR, then one copy of the program is
|
|
|
|
run on each node.
|
|
|
|
.PP
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Program Transfer
|
|
|
|
By default, LAM searches for executable programs on the target node
|
|
|
|
where a particular instantiation will run. If the file system is not
|
|
|
|
shared, the target nodes are homogeneous, and the program is
|
|
|
|
frequently recompiled, it can be convenient to have LAM transfer the
|
|
|
|
program from a source node (usually the local node) to each target
|
|
|
|
node. The \fI-s\fR option specifies this behavior and identifies the
|
|
|
|
single source node.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Locating Files
|
|
|
|
LAM looks for an executable program by searching the directories in
|
|
|
|
the user's PATH environment variable as defined on the source node(s).
|
|
|
|
This behavior is consistent with logging into the source node and
|
|
|
|
executing the program from the shell. On remote nodes, the "." path
|
|
|
|
is the home directory.
|
|
|
|
.PP
|
|
|
|
LAM looks for an application schema in three directories: the local
|
|
|
|
directory, the value of the LAMAPPLDIR environment variable, and
|
|
|
|
laminstalldir/boot, where "laminstalldir" is the directory where
|
|
|
|
LAM/MPI was installed.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Standard I/O
|
|
|
|
LAM directs UNIX standard input to /dev/null on all remote nodes. On
|
|
|
|
the local node that invoked
|
|
|
|
.IR mpirun ,
|
|
|
|
standard input is inherited from
|
|
|
|
.IR mpirun .
|
|
|
|
The default is what used to be the -w option to prevent conflicting
|
|
|
|
access to the terminal.
|
|
|
|
.PP
|
|
|
|
LAM directs UNIX standard output and error to the LAM daemon on all
|
|
|
|
remote nodes. LAM ships all captured output/error to the node that
|
|
|
|
invoked
|
|
|
|
.I mpirun
|
|
|
|
and prints it on the standard output/error of
|
|
|
|
.IR mpirun .
|
|
|
|
Local processes inherit the standard output/error of
|
|
|
|
.I mpirun
|
|
|
|
and transfer to it directly.
|
|
|
|
.PP
|
|
|
|
Thus it is possible to redirect standard I/O for LAM applications by
|
|
|
|
using the typical shell redirection procedure on
|
|
|
|
.IR mpirun .
|
|
|
|
.sp
|
|
|
|
.RS
|
|
|
|
% mpirun C my_app < my_input > my_output
|
|
|
|
.RE
|
|
|
|
.PP
|
|
|
|
Note that in this example
|
|
|
|
.I only
|
|
|
|
the local node (i.e., the node where mpirun was invoked from) will
|
|
|
|
receive the stream from my_input on stdin. The stdin on all the other
|
|
|
|
nodes will be tied to /dev/null. However, the stdout from all nodes
|
|
|
|
will be collected into the my_output file.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I \-f
|
|
|
|
option avoids all the setup required to support standard I/O described
|
|
|
|
above. Remote processes are completely directed to /dev/null and
|
|
|
|
local processes inherit file descriptors from lamboot(1).
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Pseudo-tty support
|
|
|
|
The
|
|
|
|
.I \-pty
|
|
|
|
option enabled pseudo-tty support for process output (it is also
|
|
|
|
enabled by default). This allows, among other things, for line
|
|
|
|
buffered output from remote nodes (which is probably what you want).
|
|
|
|
This option can be disabled with the
|
|
|
|
.I \-npty
|
|
|
|
switch.
|
|
|
|
.PP
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Process Termination / Signal Handling
|
|
|
|
During the run of an MPI application, if any rank dies abnormally
|
|
|
|
(either exiting before invoking
|
|
|
|
.IR MPI_FINALIZE ,
|
|
|
|
or dying as the result of a signal),
|
|
|
|
.I mpirun
|
|
|
|
will print out an error message and kill the rest of the MPI
|
|
|
|
application.
|
|
|
|
.PP
|
|
|
|
By default, LAM/MPI only installs a signal handler for one signal in
|
|
|
|
user programs (SIGUSR2 by default, but this can be overridden when LAM
|
|
|
|
is configured and built). Therefore, it is safe for users to install
|
|
|
|
their own signal handlers in LAM/MPI programs (LAM notices
|
|
|
|
death-by-signal cases by examining the process' return status provided
|
|
|
|
by the operating system).
|
|
|
|
.PP
|
|
|
|
User signal handlers should probably avoid trying to cleanup MPI state
|
|
|
|
-- LAM is neither thread-safe nor async-signal-safe. For example, if
|
|
|
|
a seg fault occurs in
|
|
|
|
.I MPI_SEND
|
|
|
|
(perhaps because a bad buffer was passed in) and a user signal handler
|
|
|
|
is invoked, if this user handler attempts to invoke
|
|
|
|
.IR MPI_FINALIZE ,
|
|
|
|
Bad Things could happen since LAM/MPI was already "in" MPI when the
|
|
|
|
error occurred. Since
|
|
|
|
.I mpirun
|
|
|
|
will notice that the process died due to a signal, it is probably not
|
|
|
|
necessary (and safest) for the user to only clean up non-MPI state.
|
|
|
|
.PP
|
|
|
|
If the
|
|
|
|
.I -sigs
|
|
|
|
option is used with
|
|
|
|
.IR mpirun ,
|
|
|
|
LAM/MPI will install several signal handlers to locally on each rank
|
|
|
|
to catch signals, print out error messages, and kill the rest of the
|
|
|
|
MPI application. This is somewhat redundant behavior since this is
|
|
|
|
now all handled by
|
|
|
|
.IR mpirun ,
|
|
|
|
but it has been left for backwards compatability.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Process Exit Statuses
|
|
|
|
The
|
|
|
|
.IR -sa ,
|
|
|
|
\
|
|
|
|
.IR -sf ,
|
|
|
|
and
|
|
|
|
.I -p
|
|
|
|
parameters can be used to display the exist statuses of the individual
|
|
|
|
MPI processes as they terminate.
|
|
|
|
.I -sa
|
|
|
|
forces the exit statuses to be displayed for all processes;
|
|
|
|
.I -sf
|
|
|
|
only displays the exist statuses if at least one process terminates
|
|
|
|
either by a signal or a non-zero exit status (note that exiting before
|
|
|
|
invoking
|
|
|
|
.I MPI_FINALIZE
|
|
|
|
will cause a non-zero exit status).
|
|
|
|
.PP
|
|
|
|
The status of each process is printed out, one per line, in the
|
|
|
|
following format:
|
|
|
|
.sp
|
|
|
|
.RS
|
|
|
|
prefix_string node pid killed status
|
|
|
|
.RE
|
|
|
|
.PP
|
|
|
|
If
|
|
|
|
.I killed
|
|
|
|
is 1, then
|
|
|
|
.I status
|
|
|
|
is the signal number. If
|
|
|
|
.I killed
|
|
|
|
is 0, then
|
|
|
|
.I status
|
|
|
|
is the exit status of the process.
|
|
|
|
.PP
|
|
|
|
The default
|
|
|
|
.I prefix_string
|
|
|
|
is "mpirun:", but the
|
|
|
|
.I -p
|
|
|
|
option can be used override this string.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Current Working Directory
|
|
|
|
The default behavior of mpirun has changed with respect to the
|
|
|
|
directory that processes will be started in.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I \-wd
|
|
|
|
option to mpirun allows the user to change to an arbitrary directory
|
|
|
|
before their program is invoked. It can also be used in application
|
|
|
|
schema files to specify working directories on specific nodes and/or
|
|
|
|
for specific applications.
|
|
|
|
.PP
|
|
|
|
If the
|
|
|
|
.I \-wd
|
|
|
|
option appears both in a schema file and on the command line, the
|
|
|
|
schema file directory will override the command line value.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I \-D
|
|
|
|
option will change the current working directory to the directory
|
|
|
|
where the executable resides. It cannot be used in application schema
|
|
|
|
files.
|
|
|
|
.I \-wd
|
|
|
|
is mutually exclusive with
|
|
|
|
.IR \-D .
|
|
|
|
.PP
|
|
|
|
If neither
|
|
|
|
.I \-wd
|
|
|
|
nor
|
|
|
|
.I \-D
|
|
|
|
are specified, the local node will send the directory name where
|
|
|
|
mpirun was invoked from to each of the remote nodes. The remote nodes
|
|
|
|
will then try to change to that directory. If they fail (e.g., if the
|
|
|
|
directory does not exists on that node), they will start with from the
|
|
|
|
user's home directory.
|
|
|
|
.PP
|
|
|
|
All directory changing occurs before the user's program is invoked; it
|
|
|
|
does not wait until
|
|
|
|
.I MPI_INIT
|
|
|
|
is called.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Process Environment
|
|
|
|
Processes in the MPI application inherit their environment from the
|
|
|
|
LAM daemon upon the node on which they are running. The environment
|
|
|
|
of a LAM daemon is fixed upon booting of the LAM with lamboot(1) and
|
|
|
|
is typically inherited from the user's shell. On the origin node,
|
|
|
|
this will be the shell from which lamboot(1) was invoked; on remote
|
|
|
|
nodes, the exact environment is determined by the boot SSI module used
|
|
|
|
by lamboot(1). The rsh boot module, for example, uses either rsh/ssh
|
|
|
|
to launch the LAM daemon on remote nodes, and typically executes one
|
|
|
|
or more of the user's shell-setup files before launching the LAM
|
|
|
|
daemon. When running dynamically linked applications which require
|
|
|
|
the LD_LIBRARY_PATH environment variable to be set, care must be taken
|
|
|
|
to ensure that it is correctly set when booting the LAM.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Exported Environment Variables
|
|
|
|
All environment variables that are named in the form LAM_MPI_*,
|
|
|
|
LAM_IMPI_*, or IMPI_* will automatically be exported to new processes
|
|
|
|
on the local and remote nodes. This exporting may be inhibited with
|
|
|
|
the
|
|
|
|
.I \-nx
|
|
|
|
option.
|
|
|
|
.PP
|
|
|
|
Additionally, the
|
|
|
|
.I \-x
|
|
|
|
option to
|
|
|
|
.IR mpirun
|
|
|
|
can be used to export specific environment variables to the new
|
|
|
|
processes. While the syntax of the
|
|
|
|
.I \-x
|
|
|
|
option allows the definition of new variables, note that the parser
|
|
|
|
for this option is currently not very sophisticated - it does not even
|
|
|
|
understand quoted values. Users are advised to set variables in the
|
|
|
|
environment and use
|
|
|
|
.I \-x
|
|
|
|
to export them; not to define them.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Trace Generation
|
|
|
|
Two switches control trace generation from processes running under LAM
|
|
|
|
and both must be in the on position for traces to actually be
|
|
|
|
generated. The first switch is controlled by
|
|
|
|
.I mpirun
|
|
|
|
and the second switch is initially set by
|
|
|
|
.I mpirun
|
|
|
|
but can be toggled at runtime with MPIL_Trace_on(2) and
|
|
|
|
MPIL_Trace_off(2). The \fI-t\fR (\fI-ton\fR is equivalent) and
|
|
|
|
\fI-toff\fR options all turn on the first switch. Otherwise the first
|
|
|
|
switch is off and calls to MPIL_Trace_on(2) in the application program
|
|
|
|
are ineffective. The \fI-t\fR option also turns on the second switch.
|
|
|
|
The \fI-toff\fR option turns off the second switch. See
|
|
|
|
MPIL_Trace_on(2) and lamtrace(1) for more details.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS MPI Data Conversion
|
|
|
|
LAM's MPI library converts MPI messages from local representation to
|
|
|
|
LAM representation upon sending them and then back to local
|
|
|
|
representation upon receiving them. If the case of a LAM consisting
|
|
|
|
of a homogeneous network of machines where the local representation
|
|
|
|
differs from the LAM representation this can result in unnecessary
|
|
|
|
conversions.
|
|
|
|
.P
|
|
|
|
The \fI-O\fR switch used to be necessary to indicate to LAM whether
|
|
|
|
the mulitcomputer was homogeneous or not. LAM now automatically
|
|
|
|
determines whether a given MPI job is homogeneous or not. The
|
|
|
|
.I -O
|
|
|
|
flag will silently be accepted for backwards compatability, but it is
|
|
|
|
ignored.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS SSI (System Services Interface)
|
|
|
|
The
|
|
|
|
.I -ssi
|
|
|
|
switch allows the passing of parameters to various SSI modules. LAM's
|
|
|
|
SSI modules are described in detail in lamssi(7). SSI modules have
|
|
|
|
direct impact on MPI programs because they allow tunable parameters to
|
|
|
|
be set at run time (such as which RPI communication device driver to
|
|
|
|
use, what parameters to pass to that RPI, etc.).
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I -ssi
|
|
|
|
switch takes two arguments:
|
|
|
|
.I <key>
|
|
|
|
and
|
|
|
|
.IR <value> .
|
|
|
|
The
|
|
|
|
.I <key>
|
|
|
|
argument generally specifies which SSI module will receive the value.
|
|
|
|
For example, the
|
|
|
|
.I <key>
|
|
|
|
"rpi" is used to select which RPI to be used for transporting MPI
|
|
|
|
messages. The
|
|
|
|
.I <value>
|
|
|
|
argument is the value that is passed. For example:
|
|
|
|
.TP 4
|
|
|
|
mpirun -ssi rpi lamd N foo
|
|
|
|
Tells LAM to use the "lamd" RPI and to run a single copy of "foo" on
|
|
|
|
every node.
|
|
|
|
.TP
|
|
|
|
mpirun -ssi rpi tcp N foo
|
|
|
|
Tells LAM to use the "tcp" RPI.
|
|
|
|
.TP
|
|
|
|
mpirun -ssi rpi sysv N foo
|
|
|
|
Tells LAM to use the "sysv" RPI.
|
|
|
|
.PP
|
|
|
|
And so on. LAM's RPI SSI modules are described in lamssi_rpi(7).
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I -ssi
|
|
|
|
switch can be used multiple times to specify different
|
|
|
|
.I <key>
|
|
|
|
and/or
|
|
|
|
.I <value>
|
|
|
|
arguments. If the same
|
|
|
|
.I <key>
|
|
|
|
is specified more than once, the
|
|
|
|
.IR <value> s
|
|
|
|
are concatenated with a comma (",") separating them.
|
|
|
|
.PP
|
|
|
|
Note that the
|
|
|
|
.I -ssi
|
|
|
|
switch is simply a shortcut for setting environment variables. The
|
|
|
|
same effect may be accomplished by setting corresponding environment
|
|
|
|
variables before running
|
|
|
|
.IR mpirun .
|
|
|
|
The form of the environment variables that LAM sets are:
|
|
|
|
.IR LAM_MPI_SSI_<key>=<value> .
|
|
|
|
.PP
|
|
|
|
Note that the
|
|
|
|
.I -ssi
|
|
|
|
switch overrides any previously set environment variables. Also note
|
|
|
|
that unknown
|
|
|
|
.I <key>
|
|
|
|
arguments are still set as environment variable -- they are not
|
|
|
|
checked (by
|
|
|
|
.IR mpirun )
|
|
|
|
for correctness. Illegal or incorrect
|
|
|
|
.I <value>
|
|
|
|
arguments may or may not be reported -- it depends on the specific SSI
|
|
|
|
module.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I -ssi
|
|
|
|
switch obsoletes the old
|
|
|
|
.I -c2c
|
|
|
|
and
|
|
|
|
.I -lamd
|
|
|
|
switches. These switches used to be relevant because LAM could only
|
|
|
|
have two RPI's available at a time: the lamd RPI and one of the C2C
|
|
|
|
RPIs. This is no longer true -- all RPI's are now available and
|
|
|
|
choosable at run-time. Selecting the lamd RPI is shown in the
|
|
|
|
examples above.
|
|
|
|
The
|
|
|
|
.I -c2c
|
|
|
|
switch has no direct translation since "C2C" used to refer to all
|
|
|
|
other RPI's that were not the lamd RPI. As such,
|
|
|
|
.I -ssi rpi <value>
|
|
|
|
must be used to select the specific desired RPI (whether it is "lamd"
|
|
|
|
or one of the other RPI's).
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SS Guaranteed Envelope Resources
|
|
|
|
By default, LAM will guarantee a minimum amount of message envelope
|
|
|
|
buffering to each MPI process pair and will impede or report an error
|
|
|
|
to a process that attempts to overflow this system resource. This
|
|
|
|
robustness and debugging feature is implemented in a machine specific
|
|
|
|
manner when direct communication is used. For normal LAM
|
|
|
|
communication via the LAM daemon, a protocol is used. The \fI-nger\fR
|
|
|
|
option disables GER and the measures taken to support it. The minimum
|
|
|
|
GER is configured by the system administrator when LAM is installed.
|
|
|
|
See MPI(7) for more details.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Examples Section
|
|
|
|
.\" **************************
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH EXAMPLES
|
|
|
|
Be sure to also see the examples in the "Location Nomenclature"
|
|
|
|
section, above.
|
|
|
|
.TP 4
|
|
|
|
mpirun N prog1
|
|
|
|
Load and execute prog1 on all nodes. Search the user's $PATH for the
|
|
|
|
executable file on each node.
|
|
|
|
.TP
|
|
|
|
mpirun -c 8 prog1
|
|
|
|
Run 8 copies of prog1 wherever LAM wants to run them.
|
|
|
|
.TP
|
|
|
|
mpirun n8-10 -v -nw -s n3 prog1 -q
|
|
|
|
Load and execute prog1 on nodes 8, 9, and 10. Search for prog1 on
|
|
|
|
node 3 and transfer it to the three target nodes. Report as each
|
|
|
|
process is created. Give "-q" as a command line to each new process.
|
|
|
|
Do not wait for the processes to complete before exiting
|
|
|
|
.IR mpirun .
|
|
|
|
.TP
|
|
|
|
mpirun -v myapp
|
|
|
|
Parse the application schema, myapp, and start all processes specified
|
|
|
|
in it. Report as each process is created.
|
|
|
|
.TP
|
|
|
|
mpirun -npty -wd /work/output -x DISPLAY C my_application
|
|
|
|
|
|
|
|
Start one copy of "my_application" on each available CPU. The number
|
|
|
|
of available CPUs on each node was previously specified when LAM was
|
|
|
|
booted with lamboot(1). As noted above,
|
|
|
|
.I mpirun
|
|
|
|
will schedule adjoining rank in
|
|
|
|
.I MPI_COMM_WORLD
|
|
|
|
on the same node where possible. For example, if n0 has a CPU count
|
|
|
|
of 8, and n1 has a CPU count of 4,
|
|
|
|
.I mpirun
|
|
|
|
will place
|
|
|
|
.I MPI_COMM_WORLD
|
|
|
|
ranks 0 through 7 on n0, and 8 through 11 on n1. This tends to
|
|
|
|
maximize on-node communication for many parallel applications; when
|
|
|
|
used in conjunction with the multi-protocol network/shared memory RPIs
|
|
|
|
in LAM (see the RELEASE_NOTES and INSTALL files with the LAM
|
|
|
|
distribution), overall communication performance can be quite good.
|
|
|
|
Also disable pseudo-tty support, change directory to /work/output, and
|
|
|
|
export the DISPLAY variable to the new processes (perhaps
|
|
|
|
my_application will invoke an X application such as xv to display
|
|
|
|
output).
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Diagnostics Section
|
|
|
|
.\" **************************
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH DIAGNOSTICS
|
|
|
|
.TP 4
|
|
|
|
mpirun: Exec format error
|
|
|
|
This usually means that either a number of processes or an appropriate
|
|
|
|
<where> clause was not specified, indicating that LAM does not know
|
|
|
|
how many processes to run. See the EXAMPLES and "Location
|
|
|
|
Nomenclature" sections, above, for examples on how to specify how many
|
|
|
|
processes to run, and/or where to run them. However, it can also mean
|
|
|
|
that a non-ASCII character was detected in the application schema.
|
|
|
|
This is usually a command line usage error where
|
|
|
|
.I mpirun
|
|
|
|
is expecting an application schema and an executable file was given.
|
|
|
|
.TP
|
|
|
|
mpirun: syntax error in application schema, line XXX
|
|
|
|
The application schema cannot be parsed because of a usage or syntax error
|
|
|
|
on the given line in the file.
|
|
|
|
.TP
|
|
|
|
<filename>: No such file or directory
|
|
|
|
This error can occur in two cases. Either the named file cannot be
|
|
|
|
located or it has been found but the user does not have sufficient
|
|
|
|
permissions to execute the program or read the application schema.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" Return Value Section
|
|
|
|
.\" **************************
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH RETURN VALUE
|
|
|
|
.I mpirun
|
|
|
|
returns 0 if all ranks started by
|
|
|
|
.I mpirun
|
|
|
|
exit after calling MPI_FINALIZE. A non-zero value is returned if an
|
|
|
|
internal error occurred in mpirun, or one or more ranks exited before
|
|
|
|
calling MPI_FINALIZE. If an internal error occurred in mpirun, the
|
|
|
|
corresponding error code is returned. In the event that one or more ranks
|
|
|
|
exit before calling MPI_FINALIZE, the return value of the rank of the
|
|
|
|
process that
|
|
|
|
.I mpirun
|
|
|
|
first notices died before calling MPI_FINALIZE will be returned. Note
|
|
|
|
that, in general, this will be the first rank that died but is not
|
|
|
|
guaranteed to be so.
|
|
|
|
.PP
|
|
|
|
However, note that if the
|
|
|
|
.I \-nw
|
|
|
|
switch is used, the return value from mpirun does not indicate the exit status
|
|
|
|
of the ranks.
|
2006-02-17 02:38:03 +03:00
|
|
|
.
|
|
|
|
.\" **************************
|
|
|
|
.\" See Also Section
|
|
|
|
.\" **************************
|
|
|
|
.
|
2006-02-16 16:29:37 +03:00
|
|
|
.SH SEE ALSO
|
2006-02-17 02:38:03 +03:00
|
|
|
bhost(5),
|
|
|
|
lamexec (1),
|
|
|
|
lamssi(7),
|
|
|
|
lamssi_rpi(7),
|
|
|
|
lamtrace(1),
|
|
|
|
loadgo(1),
|
|
|
|
MPIL_Trace_on(2),
|
|
|
|
mpimsg(1),
|
|
|
|
mpitask(1)
|