853 строки
32 KiB
TeX
853 строки
32 KiB
TeX
|
% -*- latex -*-
|
||
|
%
|
||
|
% Copyright (c) 2004-2005 The Trustees of Indiana University.
|
||
|
% All rights reserved.
|
||
|
% Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||
|
% All rights reserved.
|
||
|
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
||
|
% University of Stuttgart. All rights reserved.
|
||
|
% Copyright (c) 2004-2005 The Regents of the University of California.
|
||
|
% All rights reserved.
|
||
|
% $COPYRIGHT$
|
||
|
%
|
||
|
% Additional copyrights may follow
|
||
|
%
|
||
|
% $HEADER$
|
||
|
%
|
||
|
|
||
|
\chapter{Available ORTE Components}
|
||
|
\label{sec:mca-orte}
|
||
|
|
||
|
There is currently only type of ORTE component that is visible to
|
||
|
users: \kind{boot}, which is used to start the Open MPI run-time
|
||
|
environment, most often through the \icmd{lamboot} command. The
|
||
|
\cmd{lamboot} command itself is discussed in
|
||
|
Section~\ref{sec:commands-lamboot}
|
||
|
(page~\pageref{sec:commands-lamboot}); the discussion below focuses on
|
||
|
the boot modules that make up the ``back end'' implementation of
|
||
|
\cmd{lamboot}.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\section{Using the Open Run-Time Environment}
|
||
|
\label{sec:mca-orte-pls}
|
||
|
\index{ORTE MCA components|(}
|
||
|
\index{MCA boot components|see {boot MCA components}}
|
||
|
|
||
|
{\Huge JMS needs massive overhaul}
|
||
|
|
||
|
Open MPI provides a number of modules for starting the \cmd{lamd}
|
||
|
control daemons. In most cases, the \cmd{lamd}s are started using the
|
||
|
\icmd{lamboot} command. In previous versions of Open MPI,
|
||
|
\icmd{lamboot} could only use \icmd{rsh} or \icmd{ssh} for starting
|
||
|
the Open MPI run-time environment on remote nodes. In Open MPI
|
||
|
\ompiversion, it is possible to use a variety of mechanisms for this
|
||
|
process startup. The following mechanisms are available in Open MPI
|
||
|
\ompiversion:
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item BProc
|
||
|
\item Globus (beta-level support)
|
||
|
\item \cmd{rsh} / \cmd{ssh}
|
||
|
\item OpenPBS / PBS Pro / Torque (using the Task Management interface)
|
||
|
\changebegin{7.1}
|
||
|
\item SLURM (using its native interface)
|
||
|
\changeend{7.1}
|
||
|
\end{itemize}
|
||
|
|
||
|
These mechanisms are discussed in detail below. Note that the
|
||
|
sections below each assume that support for these modules have been
|
||
|
compiled into Open MPI. The \icmd{laminfo} command can be used to
|
||
|
determine exactly which modules are supported in your installation
|
||
|
(see Section~\ref{sec:commands-laminfo},
|
||
|
page~\pageref{sec:commands-laminfo}).
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{Boot Schema Files (a.k.a., ``Hostfiles'' or
|
||
|
``Machinefiles'')}
|
||
|
\label{sec:mca-orte-pls-schema}
|
||
|
\index{boot schema}
|
||
|
\index{hostfile|see {boot schema}}
|
||
|
\index{machinefile|see {boot schema}}
|
||
|
\cmdindex{lamboot}{boot schema file}
|
||
|
|
||
|
Before discussing any of the specific boot MCA modules, this section
|
||
|
discusses the boot schema file, commonly referred to as a ``hostfile''
|
||
|
or a ``machinefile''. Most (but not all) boot MCA modules require a
|
||
|
boot schema, and the text below makes frequent mention of them.
|
||
|
Hence, it is worth discussing them before getting into the details of
|
||
|
each boot MCA.
|
||
|
|
||
|
A boot schema is a text file that, in its simplest form, simply lists
|
||
|
every host that the Open MPI run-time environment will be invoked on. For
|
||
|
example:
|
||
|
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
# This is my boot schema
|
||
|
inky.cluster.example.com
|
||
|
pinky.cluster.example.com
|
||
|
blinkly.cluster.example.com
|
||
|
clyde.cluster.example.com
|
||
|
\end{lstlisting}
|
||
|
|
||
|
Lines beginning with ``{\tt \#}'' are treated as comments and are
|
||
|
ignored. Each non-blank, non-comment line must, at a minimum, list a
|
||
|
host. Specifically, the first token on each line must specify a host
|
||
|
(although the definition of how that host is specified may vary differ
|
||
|
between boot modules).
|
||
|
|
||
|
However, each line can also specify arbitrary ``key=value'' pairs. A
|
||
|
common global key is ``{\tt cpu}''. This key takes an integer value
|
||
|
and indicates to Open MPI how many CPUs are available for Open MPI to use. If
|
||
|
the key is not present, the value of 1 is assumed. This number does
|
||
|
{\em not} need to reflect the physical number of CPUs -- it can be
|
||
|
smaller then, equal to, or greater than the number of physical CPUs in
|
||
|
the machine. It is solely used as a shorthand notation for
|
||
|
\icmd{mpirun}'s ``C'' notation, meaning ``launch one process per CPU
|
||
|
as specified in the boot schema file.'' For example, in the following
|
||
|
boot schema:
|
||
|
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
inky.cluster.example.com cpu=2
|
||
|
pinky.cluster.example.com cpu=4
|
||
|
blinkly.cluster.example.com cpu=4
|
||
|
# clyde doesn't mention a cpu count, and is therefore implicitly 1
|
||
|
clyde.cluster.example.com
|
||
|
\end{lstlisting}
|
||
|
|
||
|
\noindent issuing the command ``{\tt mpirun C foo}'' would actually
|
||
|
launch 11 copies of \cmd{foo}: 2 on \host{inky}, 4 on \host{pinky}, 4
|
||
|
on \host{blinky}, and 1 on \host{clyde}.
|
||
|
|
||
|
Note that listing a host more than once has the same effect as
|
||
|
incrementing the CPU count. The following boot schema has the same
|
||
|
effect as the previous example (i.e., CPU counts of 2, 4, 4, and 1,
|
||
|
respectively):
|
||
|
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
# inky has a CPU count of 2
|
||
|
inky.cluster.example.com
|
||
|
inky.cluster.example.com
|
||
|
# pinky has a CPU count of 4
|
||
|
pinky.cluster.example.com
|
||
|
pinky.cluster.example.com
|
||
|
pinky.cluster.example.com
|
||
|
pinky.cluster.example.com
|
||
|
# blinky has a CPU count of 4
|
||
|
blinkly.cluster.example.com
|
||
|
blinkly.cluster.example.com
|
||
|
blinkly.cluster.example.com
|
||
|
blinkly.cluster.example.com
|
||
|
# clyde only has 1 CPU
|
||
|
clyde.cluster.example.com
|
||
|
\end{lstlisting}
|
||
|
|
||
|
Other keys are defined on a per-boot-MCA-module, and are described
|
||
|
below.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{Minimum Requirements}
|
||
|
\label{sec:mca-orte-pls-min-reqs}
|
||
|
|
||
|
In order to successfully launch a process on a remote node, several
|
||
|
requirements must be met. Although each of the boot modules have
|
||
|
different specific requirements, all of them share the following
|
||
|
conditions for successful operation:
|
||
|
|
||
|
\begin{enumerate}
|
||
|
\item Each target host must be reachable and operational.
|
||
|
|
||
|
\item The user must be able to execute arbitrary processes on the
|
||
|
target.
|
||
|
|
||
|
\item The Open MPI executables must be locatable on that machine. This
|
||
|
typically involves using: the shell's search path, the
|
||
|
\ienvvar{Open MPIHOME} environment variable, or a boot-module-specific
|
||
|
mechanism.
|
||
|
|
||
|
\item The user must be able to write to the Open MPI session directory
|
||
|
(typically somewhere under \file{/tmp}; see
|
||
|
Section~\ref{sec:misc-session-directory},
|
||
|
page~\pageref{sec:misc-session-directory}).
|
||
|
|
||
|
\item All hosts must be able to resolve the fully-qualified domain
|
||
|
name (FQDN) of all the machines being booted (including itself).
|
||
|
|
||
|
\item Unless there is only one host being booted, any host
|
||
|
resolving to the IP address 127.0.0.1 cannot be included in the list
|
||
|
of hosts.
|
||
|
\end{enumerate}
|
||
|
|
||
|
If all of these conditions are not met, \cmd{lamboot} will fail.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{Selecting a \kind{boot} Module}
|
||
|
|
||
|
Only one \kind{boot} module will be selected; it will be used for the
|
||
|
life of the Open MPI universe. As such, module priority values are the
|
||
|
only factor used to determine which available module should be
|
||
|
selected.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{\kind{boot} MCA Parameters}
|
||
|
|
||
|
On many kinds of networks, Open MPI can know exactly which nodes should be
|
||
|
making connections while booting the Open MPI run-time environment, and
|
||
|
promiscuous connections (i.e., allowing any node to connect) are
|
||
|
discouraged. However, this is not possible in some complex network
|
||
|
configurations and promiscuous connections {\em must} be enabled.
|
||
|
|
||
|
By default, Open MPI's base \kind{boot} MCA startup protocols disable
|
||
|
promiscuous connections. However, this behavior can be overridden
|
||
|
when Open MPI is configured and at run-time. If the MCA parameter
|
||
|
\issiparam{boot\_\-base\_\-promisc} set to an empty value, or set to
|
||
|
the integer value 1, promiscuous connections will be accepted when
|
||
|
than Open MPI RTE is booted.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{The \boot{bproc} Module}
|
||
|
\index{bproc boot MCA module@\boot{bproc} boot MCA module}
|
||
|
\index{boot MCA modules!bproc@\boot{bproc}}
|
||
|
|
||
|
The Beowulf Distributed Process Space (BProc)
|
||
|
project\footnote{\url{http://bproc.sourceforge.net/}} is set of kernel
|
||
|
modifications, utilities and libraries which allow a user to start
|
||
|
processes on other machines in a Beowulf-style cluster. Remote
|
||
|
processes started with this mechanism appear in the process table of
|
||
|
the front-end machine in a cluster.
|
||
|
|
||
|
Open MPI functionality has been tested with BProc version 3.2.5. Prior
|
||
|
versions had a bug that affected at least some Open MPI functionality.
|
||
|
It is strongly recommended to upgrade to at least version 3.2.5 before
|
||
|
attempting to use the Open MPI native BProc capabilities.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Minimum Requirements}
|
||
|
|
||
|
Several of the minimum requirements listed in
|
||
|
Section~\ref{sec:mca-orte-pls-min-reqs} will already be met in a BProc
|
||
|
environment because BProc will copy \cmd{lamboot}'s entire environment
|
||
|
(including the \envvar{PATH}) to the remote node. Hence, if
|
||
|
\cmd{lamboot} is in the user's path on the local node, it will also
|
||
|
[automatically] be in the user's path on the remote node.
|
||
|
|
||
|
However, one of the minimum requirements conditions (``The user must
|
||
|
be able to execute arbitrary processes on the target'') deserves a
|
||
|
BProc-specific clarification. BProc has its own internal permission
|
||
|
system for determining if users are allowed to execute on specific
|
||
|
nodes. The system is similar to the user/group/other mechanism
|
||
|
typically used in many Unix filesystems. Hence, in order for a user
|
||
|
to successfully \cmd{lamboot} on a BProc cluster, he/she must have
|
||
|
BProc execute permissions on each of the target nodes. Consult the
|
||
|
BProc documentation for more details.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Usage}
|
||
|
|
||
|
In most situations, the \cmd{lamboot} command (and related commands)
|
||
|
should automatically ``know'' to use the \boot{bproc} boot MCA module
|
||
|
when running on the BProc head node; no additional command line
|
||
|
parameters or environment variables should be required.
|
||
|
%
|
||
|
Specifically, when running in a BProc environment, the \boot{bproc}
|
||
|
module will report that it is available, and artificially inflate its
|
||
|
priority relatively high in order to influence the boot module
|
||
|
selection process.
|
||
|
%
|
||
|
However, the BProc boot module can be forced by specifying the
|
||
|
\issiparam{boot} MCA parameter with the value of
|
||
|
\issivalue{boot}{bproc}.
|
||
|
|
||
|
Running \cmd{lamboot} on a BProc cluster is just like running
|
||
|
\cmd{lamboot} in a ``normal'' cluster. Specifically, you provide a
|
||
|
boot schema file (i.e., a list of nodes to boot on) and run
|
||
|
\cmd{lamboot} with it. For example:
|
||
|
|
||
|
\lstset{style=lam-cmdline}
|
||
|
\begin{lstlisting}
|
||
|
shell$ lamboot hostfile
|
||
|
\end{lstlisting}
|
||
|
% stupid emacs mode: $
|
||
|
|
||
|
Note that when using the \boot{bproc} module, \cmd{lamboot} will only
|
||
|
function properly from the head node. If you launch \cmd{lamboot}
|
||
|
from a client node, it will likely either fail outright, or fall back
|
||
|
to a different boot module (e.g., \cmd{rsh}/\cmd{ssh}).
|
||
|
|
||
|
It is suggested that the \file{hostfile} file contain hostnames in the
|
||
|
style that BProc prefers -- integer numbers. For example,
|
||
|
\file{hostfile} may contain the following:
|
||
|
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
-1
|
||
|
0
|
||
|
1
|
||
|
2
|
||
|
3
|
||
|
\end{lstlisting}
|
||
|
|
||
|
\noindent which boots on the BProc front end node (-1) and four slave
|
||
|
nodes (0, 1, 2, 3). Note that using IP hostnames will also work, but
|
||
|
using integer numbers is recommended.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Tunable Parameters}
|
||
|
|
||
|
Table~\ref{tbl:mca-orte-pls-bproc-mca-params} lists the MCA parameters
|
||
|
that are available to the \boot{bproc} module.
|
||
|
|
||
|
\begin{table}[htbp]
|
||
|
\begin{ssiparamtb}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-bproc\_\-priority}{50}{Default priority level.}
|
||
|
\end{ssiparamtb}
|
||
|
\caption{MCA parameters for the \boot{bproc} boot module.}
|
||
|
\label{tbl:mca-orte-pls-bproc-mca-params}
|
||
|
\end{table}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Special Notes}
|
||
|
|
||
|
After booting, Open MPI will, by default, not schedule to run MPI jobs on
|
||
|
the BProc front end. Specifically, Open MPI implicitly sets the
|
||
|
``no-schedule'' attribute on the -1 node in a BProc cluster. See
|
||
|
Section~\ref{sec:commands-lamboot}
|
||
|
(page~\pageref{sec:commands-lamboot}) for more detail about this
|
||
|
attribute and boot schemas in general, and
|
||
|
\ref{sec:commands-lamboot-no-schedule} (page
|
||
|
\pageref{sec:commands-lamboot-no-schedule}).
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{The \boot{globus} Module}
|
||
|
\index{globus boot MCA module@\boot{globus} boot MCA module}
|
||
|
\index{boot MCA modules!globus@\boot{globus}}
|
||
|
|
||
|
Open MPI \ompiversion\ includes beta support for Globus.
|
||
|
Specifically, only limited types of execution are possible. The Open
|
||
|
MPI Team would appreciate feedback from the Globus community on
|
||
|
expanding Globus support in Open MPI.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Minimum Requirements}
|
||
|
|
||
|
Open MPI jobs in Globus environment can only be started on nodes using
|
||
|
the ``fork'' job manager for the Globus gatekeeper. Other job
|
||
|
managers are not yet supported.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Usage}
|
||
|
|
||
|
Starting the Open MPI run-time environmetn in Globus environment makes use
|
||
|
of the Globus Resource Allocation Manager (GRAM) client
|
||
|
\icmd{globus-job-run}.
|
||
|
%
|
||
|
The Globus boot MCA module will never run automatically; it must
|
||
|
always be specifically requested setting the \issiparam{boot} MCA
|
||
|
parameter to \issivalue{boot}{globus}. Specifically, although the
|
||
|
\boot{globus} module will report itself available if
|
||
|
\icmd{globus-job-run} can be found in the \envvar{PATH}, the default
|
||
|
priority will be quite low, effectively ensuring that it will not be
|
||
|
selected unless it is the only module available (which will only occur
|
||
|
if the \ssiparam{boot} parameter is set to \issivalue{boot}{globus}).
|
||
|
|
||
|
Open MPI needs to be able to find the Globus executables. This can be
|
||
|
accompilshed either by adding the appropriate directory to your path,
|
||
|
or by setting the \ienvvar{GLOBUS\_\-LOCATION} environment variable.
|
||
|
|
||
|
Additionally, the \ienvvar{Open MPI\_\-MPI\_\-SEMCAON\_\-SUFFIX}
|
||
|
environment variable should be set to a unique value. This ensures
|
||
|
that this instance of the Open MPI universe does not conflict with any
|
||
|
other, concurrent Open MPI universes that are running under the same
|
||
|
username on nodes in the Globus environment. Although any value can
|
||
|
be used for this variable, it is probably best to have some kind of
|
||
|
organized format, such as {\tt
|
||
|
<your\_\-username>-<some\_\-long\_\-random\_\-number>}.
|
||
|
|
||
|
Next, create a boot schema to use with \cmd{lamboot}.
|
||
|
%
|
||
|
Hosts are listed by their Globus contact strings (see the Globus
|
||
|
manual for more information about contact strings). In cases where
|
||
|
the Globus gatekeeper is running as a \cmd{inetd} service on the node,
|
||
|
the contact string will simply be the hostname. If the contact string
|
||
|
contains whitespace, the {\em entire} contact string must be enclosed
|
||
|
in quotes (i.e., not just the values with whitespaces).
|
||
|
%
|
||
|
For example, if your contact string is:
|
||
|
|
||
|
\centerline{\tt host1:port1:/O=xxx/OU=yyy/CN=aaa bbb ccc}
|
||
|
|
||
|
Then you will need to have it listed as:
|
||
|
|
||
|
\centerline{\tt "host1:port1:/O=xxx/OU=yyy/CN=aaa bbb ccc"}
|
||
|
|
||
|
The following will not work:
|
||
|
|
||
|
\centerline{\tt host1:port1:/O=xxx/OU=yyy/CN="aaa bbb ccc"}
|
||
|
|
||
|
Each host in the boot schema must also have a ``{\tt
|
||
|
lam\_\-install\_\-path}'' key indicating the absolute directory
|
||
|
where Open MPI is installed. This value is mandatory because you
|
||
|
cannot rely on the \ienvvar{PATH} environment variable in Globus
|
||
|
environment because users' ``dot'' files are not executed in Globus
|
||
|
jobs (and therefore the \envvar{PATH} environment variable is not
|
||
|
provided). Other keys can be used as well; {\tt
|
||
|
lam\_\-install\_\-path} is the only mandatory key.
|
||
|
|
||
|
Here is a sample Globus boot schema:
|
||
|
|
||
|
\changebegin{7.0.5}
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
# Globus boot schema
|
||
|
``inky.mycluster:12853:/O=MegaCorp/OU=Mine/CN=HPC Group'' prefix=/opt/lam cpu=2
|
||
|
``pinky.yourcluster:3245:/O=MegaCorp/OU=Yours/CN=HPC Group'' prefix=/opt/lam cpu=4
|
||
|
``blinky.hiscluster:23452:/O=MegaCorp/OU=His/CN=HPC Group'' prefix=/opt/lam cpu=4
|
||
|
``clyde.hercluster:82342:/O=MegaCorp/OU=Hers/CN=HPC Group'' prefix=/software/lam
|
||
|
\end{lstlisting}
|
||
|
\changeend{7.0.5}
|
||
|
|
||
|
Once you have this boot schema, the \cmd{lamboot} command can be used
|
||
|
to launch it. Note, however, that unlike the other boot MCA modules,
|
||
|
the Globus boot module will never be automatically selected by Open MPI --
|
||
|
it must be selected manually with the \issiparam{boot} MCA parameter
|
||
|
with the value \issivalue{boot}{globus}.
|
||
|
|
||
|
\lstset{style=lam-cmdline}
|
||
|
\begin{lstlisting}
|
||
|
shell$ lamboot -ssi boot globus hostfile
|
||
|
\end{lstlisting}
|
||
|
% stupid emacs mode: $
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Tunable Parameters}
|
||
|
|
||
|
Table~\ref{tbl:mca-orte-pls-globus-mca-params} lists the MCA
|
||
|
parameters that are available to the \boot{globus} module.
|
||
|
|
||
|
\begin{table}[htbp]
|
||
|
\begin{ssiparamtb}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-globus\_\-priority}{3}{Default priority level.}
|
||
|
\end{ssiparamtb}
|
||
|
\caption{MCA parameters for the \boot{globus} boot module.}
|
||
|
\label{tbl:mca-orte-pls-globus-mca-params}
|
||
|
\end{table}
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{The \boot{rsh} Module (including \cmd{ssh})}
|
||
|
\index{rsh (ssh) boot MCA module@\boot{rsh} (\cmd{ssh}) boot MCA module}
|
||
|
\index{boot MCA modules!rsh (rsh/ssh)@\boot{rsh} (\cmd{rsh}/\cmd{ssh})}
|
||
|
|
||
|
The \cmd{rsh}/\cmd{ssh} boot MCA module is typically the ``least
|
||
|
common denominator'' boot module. When not in an otherwise
|
||
|
``special'' environment (such as a batch scheduler), the
|
||
|
\cmd{rsh}/\cmd{ssh} boot module is typically used to start the Open MPI
|
||
|
run-time environment.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Minimum Requirements}
|
||
|
|
||
|
In addition to the minimum requirements listed in
|
||
|
Section~\ref{sec:mca-orte-pls-min-reqs}, the following additional
|
||
|
conditions must also be met for a successful \cmd{lamboot} using the
|
||
|
\cmd{rsh} / \cmd{ssh} boot module:
|
||
|
|
||
|
\begin{enumerate}
|
||
|
\item The user must be able to execute arbitrary commands on each
|
||
|
target host without being prompted for a password.
|
||
|
|
||
|
\item The shell's start-up script must not print anything on standard
|
||
|
error. The user can take advantage of the fact that \cmd{rsh} /
|
||
|
\cmd{ssh} will start the shell non-interactively. The start-up
|
||
|
script can exit early in this case, before executing many commands
|
||
|
relevant only to interactive sessions and likely to generate output.
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
This has now been changed in version 7.1; if the MCA parameter
|
||
|
\issiparam{boot\_\-rsh\_\-ignore\_\-stderr} is nonzero, any output
|
||
|
on standard error will {\em not} be treated as an error.
|
||
|
\changeend{7.1}
|
||
|
\end{enumerate}
|
||
|
|
||
|
Section~\ref{sec:getting-started} (page~\pageref{sec:getting-started})
|
||
|
provides a short tutorial on using the \cmd{rsh} / \cmd{ssh} boot
|
||
|
module, including tips on setting up ``dot'' files, setting up
|
||
|
password-less remote execution, etc.
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Usage}
|
||
|
|
||
|
Using \cmd{rsh}, \cmd{ssh}, or other remote-execution agent is
|
||
|
probably the most common method for starting the Open MPI run-time
|
||
|
execution environment. The boot schema typically lists the hostnames,
|
||
|
CPU counts, and an optional username (if the user's name is different
|
||
|
on the remote machine).
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
|
||
|
The boot schema can also list an optional ``prefix'', which specifies
|
||
|
the Open MPI installatation to be used on the particular host listed in
|
||
|
the boot schema. This is typically used if the user has mutliple
|
||
|
Open MPI installations on a host and want to switch between them
|
||
|
without changing the dot files or \envvar{PATH} environment variables,
|
||
|
or if the user has Open MPI installed under different paths on
|
||
|
different hosts. If the prefix is not specified for a host in the
|
||
|
boot schema file, then the Open MPI installation which is available in
|
||
|
the \envvar{PATH} will be used on that host, or if the \cmdarg{-prefix
|
||
|
$<$/lam/install/path$>$} option is specified for \cmd{lamboot}, the
|
||
|
$<$/lam/install/path$>$ installation will be used. The prefix option
|
||
|
in the boot schema file however overrides any prefix option specified
|
||
|
on the \cmd{lamboot} command line for that host.
|
||
|
|
||
|
For example:
|
||
|
|
||
|
\lstset{style=lam-shell}
|
||
|
\begin{lstlisting}
|
||
|
# rsh boot schema
|
||
|
inky.cluster.example.com cpu=2
|
||
|
pinky.cluster.example.com cpu=4 prefix=/home/joe/lam7.1/install/
|
||
|
blinky.cluster.example.com cpu=4
|
||
|
clyde.cluster.example.com user=jsmith
|
||
|
\end{lstlisting}
|
||
|
|
||
|
\changeend{7.1}
|
||
|
|
||
|
The \cmd{rsh} / \cmd{ssh} boot module will usually run when no other
|
||
|
boot module has been selected. It can, however, be manually selected,
|
||
|
even when another module would typically [automatically] be selected
|
||
|
by specifying the \issiparam{boot} MCA parameter with the value of
|
||
|
\issivalue{boot}{rsh}. For example:
|
||
|
|
||
|
\lstset{style=lam-cmdline}
|
||
|
\begin{lstlisting}
|
||
|
shell$ lamboot -ssi boot rsh hostfile
|
||
|
\end{lstlisting}
|
||
|
% stupid emacs mode: $
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Tunable Parameters}
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
|
||
|
Table~\ref{tbl:mca-orte-pls-rsh-mca-params} lists the MCA parameters that
|
||
|
are available to the \boot{rsh} module.
|
||
|
|
||
|
\changeend{7.1}
|
||
|
|
||
|
\begin{table}[htbp]
|
||
|
\begin{ssiparamtb}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-rsh\_\-agent}{From configure}{Remote shell
|
||
|
agent to use.}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-rsh\_\-ignore\_\-stderr}{0}{If nonzero,
|
||
|
ignore output from \file{stderr} when booting; don't treat it as
|
||
|
an error.} \ssiparamentry{boot\_\-rsh\_\-priority}{10}{Default
|
||
|
priority level.}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-rsh\_\-no\_\-n}{0}{If nonzero, don't use
|
||
|
``\cmd{-n}'' as an argument to the boot agent}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-rsh\_\-no\_\-profile}{0}{If nonzero, don't
|
||
|
attempt to run ``\file{.profile}'' for Bourne-type shells.}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-rsh\_\-username}{None}{Username to use if
|
||
|
different than login name.}
|
||
|
\end{ssiparamtb}
|
||
|
\caption{MCA parameters for the \boot{rsh} boot module.}
|
||
|
\label{tbl:mca-orte-pls-rsh-mca-params}
|
||
|
\end{table}
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{The \boot{slurm} Module}
|
||
|
\index{batch queue systems!SLURM boot MCA module}
|
||
|
\index{slurm boot MCA module@\boot{slurm} boot MCA module}
|
||
|
\index{boot MCA modules!slurm@\boot{slurm}}
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
|
||
|
As its name implies, the Simple Linux Utility for Resource Management
|
||
|
(SLURM)\footnote{http://www.llnl.gov/linux/slurm/} package is commonly
|
||
|
used for managing Linux clusters, typically in high-performance
|
||
|
computing environments. SLURM contains a native system for launching
|
||
|
applications across the nodes that it manages. When using SLURM,
|
||
|
\cmd{rsh}/\cmd{ssh} is not necessary to launch jobs on remote nodes.
|
||
|
Instead, the \boot{slurm} boot module will automatically use SLURM's
|
||
|
native job-launching interface to start Open MPI daemons.
|
||
|
|
||
|
The advantages of using SLURM's native interface are:
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item SLURM can generate proper accounting information for all nodes in
|
||
|
a parallel job.
|
||
|
|
||
|
\item SLURM can kill entire jobs properly when the job ends.
|
||
|
|
||
|
\item \icmd{lamboot} executes significantly faster when using SLURM as
|
||
|
compared to when it uses \cmd{rsh} / \cmd{ssh}.
|
||
|
\end{itemize}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Usage}
|
||
|
|
||
|
SLURM allows running jobs in multiple ways. The \boot{slurm} boot
|
||
|
module is only supported in some of them:
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item ``Batch'' mode: where a script is submitted via the \icmd{srun}
|
||
|
command and is executed on the first node from the set that SLURM
|
||
|
allocated for the job. The script runs \icmd{lamboot},
|
||
|
\icmd{mpirun}, etc., as is normal for a Open MPI job.
|
||
|
|
||
|
This method is supported, and is perhaps the most common way to run
|
||
|
Open MPI automated jobs in SLURM environments.
|
||
|
|
||
|
\item ``Allocate'' mode: where the ``\cmdarg{-A}'' option is given to
|
||
|
\icmd{srun}, meaning that the shell were \icmd{lamboot} runs is
|
||
|
likely to {\em not} be one of the nodes that SLURM has allocated for
|
||
|
the job. In this case, Open MPI daemons will be launched on all nodes
|
||
|
that were allocated by SLURM as well as the origin (i.e., the node
|
||
|
where \cmd{lamboot} was run. The origin will be marked as
|
||
|
``no-schedule,'' meaning that applications launched by \cmd{mpirun}
|
||
|
and \cmd{lamexec} will not be run there unless specifically
|
||
|
requested (see See Section~\ref{sec:commands-lamboot},
|
||
|
page~\pageref{sec:commands-lamboot}, for more detail about this
|
||
|
attribute and boot schemas in general).
|
||
|
|
||
|
This method is supported, and is perhaps the most common way to run
|
||
|
Open MPI interactive jobs in SLURM environments.
|
||
|
|
||
|
\item ``\icmd{srun}'' mode: where a script is submitted via the
|
||
|
\icmd{srun} command and is executed on {\em all} nodes that SLURM
|
||
|
allocated for the job. In this case, the commands in the script
|
||
|
(e.g., \icmd{lamboot}, \icmd{mpirun}, etc.) will be run on {\em all}
|
||
|
nodes simultaneously, which is most likely not what you want.
|
||
|
|
||
|
This mode is not supported.
|
||
|
\end{itemize}
|
||
|
|
||
|
When running in any of the supported SLURM modes, Open MPI will
|
||
|
automatically detect that it should use the \boot{slurm} boot module
|
||
|
-- no extra command line parameters or environment variables should be
|
||
|
necessary.
|
||
|
%
|
||
|
Specifically, when running in a SLURM job, the \boot{slurm} module
|
||
|
will report that it is available, and artificially inflate its
|
||
|
priority relatively high in order to influence the boot module
|
||
|
selection process.
|
||
|
%
|
||
|
However, the \boot{slurm} boot module can be forced by specifying the
|
||
|
\issiparam{boot} MCA parameter with the value of
|
||
|
\issivalue{boot}{slurm}.
|
||
|
|
||
|
Unlike the \cmd{rsh}/\cmd{ssh} boot module, you do not need to specify
|
||
|
a hostfile for the \boot{slurm} boot module. Instead, SLURM itself
|
||
|
provides a list of nodes (and associated CPU counts) to Open MPI. Using
|
||
|
\icmd{lamboot} is therefore as simple as:
|
||
|
|
||
|
\lstset{style=lam-cmdline}
|
||
|
\begin{lstlisting}
|
||
|
shell$ lamboot
|
||
|
\end{lstlisting}
|
||
|
% stupid emacs mode: $
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
|
||
|
Note that in environments with multiple TCP networks, SLURM may be
|
||
|
configured to use a network that is specifically designated for
|
||
|
commodity traffic -- another network may exist that is specifically
|
||
|
allocated for high-speed MPI traffic. By default, Open MPI will use the
|
||
|
same hostnames that SLURM provides for all of its traffic. This means
|
||
|
that Open MPI will send all of its MPI traffic across the same network that
|
||
|
SLURM uses.
|
||
|
|
||
|
However, Open MPI has the ability to boot using one set of hostnames /
|
||
|
addresses and then use a second set of hostnames / addresses for MPI
|
||
|
traffic. As such, Open MPI can redirect its TCP MPI traffic across a
|
||
|
secondary network. It is possible that your system administrator has
|
||
|
already configured Open MPI to operate in this manner.
|
||
|
|
||
|
If a secondary TCP network is intended to be used for MPI traffic, see
|
||
|
the section entitled ``Separating Open MPI and MPI TCP Traffic'' in the
|
||
|
Open MPI Installation Guide. Note that this functionality has no
|
||
|
effect on non-TCP \kind{rpi} modules (such as Myrinet, Infiniband,
|
||
|
etc.).
|
||
|
|
||
|
\changeend{7.1}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Tunable Parameters}
|
||
|
|
||
|
Table~\ref{tbl:mca-orte-pls-slurm-mca-params} lists the MCA parameters
|
||
|
that are available to the \boot{slurm} module.
|
||
|
|
||
|
\begin{table}[htbp]
|
||
|
\begin{ssiparamtb}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-slurm\_\-priority}{50}{Default priority level.}
|
||
|
\end{ssiparamtb}
|
||
|
\caption{MCA parameters for the \boot{slurm} boot module.}
|
||
|
\label{tbl:mca-orte-pls-slurm-mca-params}
|
||
|
\end{table}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Special Notes}
|
||
|
|
||
|
Since the \boot{slurm} boot module is designed to work in SLURM jobs,
|
||
|
it will fail if the \boot{slurm} boot module is manually specified and
|
||
|
Open MPI is not currently running in a SLURM job.
|
||
|
|
||
|
The \boot{slurm} module does not start a shell on the remote node.
|
||
|
Instead, the entire environment of \cmd{lamboot} is pushed to the
|
||
|
remote nodes before starting the Open MPI run-time environment.
|
||
|
|
||
|
\changeend{7.1}
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
\subsection{The \boot{tm} Module (OpenPBS / PBS Pro / Torque)}
|
||
|
\index{batch queue systems!OpenPBS / PBS Pro / Torque (TM) boot MCA module}
|
||
|
\index{tm boot MCA module@\boot{tm} boot MCA module}
|
||
|
\index{boot MCA modules!tm (PBS / Torque)@\boot{tm} (PBS / Torque)}
|
||
|
|
||
|
Both OpenPBS and PBS Pro (both products of Altair Grid Technologies,
|
||
|
LLC), contain support for the Task Management (TM) interface. Torque,
|
||
|
the open source fork of the Open MPI product, also contains the TM
|
||
|
interface. When using TM, \cmd{rsh}/\cmd{ssh} is not necessary to
|
||
|
launch jobs on remote nodes.
|
||
|
|
||
|
The advantages of using the TM interface are:
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item PBS/Torque can generate proper accounting information for all
|
||
|
nodes in a parallel job.
|
||
|
|
||
|
\item PBS/Torque can kill entire jobs properly when the job ends.
|
||
|
|
||
|
\item \icmd{lamboot} executes significantly faster when using TM as
|
||
|
compared to when it uses \cmd{rsh} / \cmd{ssh}.
|
||
|
\end{itemize}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Usage}
|
||
|
|
||
|
When running in a PBS/Torque batch job, Open MPI will automatically detect
|
||
|
that it should use the \boot{tm} boot module -- no extra command line
|
||
|
parameters or environment variables should be necessary.
|
||
|
%
|
||
|
Specifically, when running in a PBS/Torque job, the \boot{tm} module
|
||
|
will report that it is available, and artificially inflate its
|
||
|
priority relatively high in order to influence the boot module
|
||
|
selection process.
|
||
|
%
|
||
|
However, the \boot{tm} boot module can be forced by specifying the
|
||
|
\issiparam{boot} MCA parameter with the value of \issivalue{boot}{tm}.
|
||
|
|
||
|
Unlike the \cmd{rsh}/\cmd{ssh} boot module, you do not need to specify
|
||
|
a hostfile for the \boot{tm} boot module. Instead, PBS/Torque itself
|
||
|
provides a list of nodes (and associated CPU counts) to Open MPI. Using
|
||
|
\icmd{lamboot} is therefore as simple as:
|
||
|
|
||
|
\lstset{style=lam-cmdline}
|
||
|
\begin{lstlisting}
|
||
|
shell$ lamboot
|
||
|
\end{lstlisting}
|
||
|
% stupid emacs mode: $
|
||
|
|
||
|
The \boot{tm} boot modules works in both interactive and
|
||
|
non-interactive batch jobs.
|
||
|
|
||
|
\changebegin{7.1}
|
||
|
|
||
|
Note that in environments with multiple TCP networks, PBS / Torque may
|
||
|
be configured to use a network that is specifically designated for
|
||
|
commodity traffic -- another network may exist that is specifically
|
||
|
allocated for high-speed MPI traffic. By default, Open MPI will use the
|
||
|
same hostnames that the TM interface provides for all of its traffic.
|
||
|
This means that Open MPI will send all of its MPI traffic across the same
|
||
|
network that PBS / Torque uses.
|
||
|
|
||
|
However, Open MPI has the ability to boot using one set of hostnames /
|
||
|
addresses and then use a second set of hostnames / addresses for MPI
|
||
|
traffic. As such, Open MPI can redirect its TCP MPI traffic across a
|
||
|
secondary network. It is possible that your system administrator has
|
||
|
already configured Open MPI to operate in this manner.
|
||
|
|
||
|
If a secondary TCP network is intended to be used for MPI traffic, see
|
||
|
the section entitled ``Separating Open MPI and MPI TCP Traffic'' in the
|
||
|
Open MPI Installation Guide. Note that this has no effect on non-TCP
|
||
|
\kind{rpi} modules (such as Myrinet, Infiniband, etc.).
|
||
|
|
||
|
\changeend{7.1}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Tunable Parameters}
|
||
|
|
||
|
Table~\ref{tbl:mca-orte-pls-tm-mca-params} lists the MCA parameters
|
||
|
that are available to the \boot{tm} module.
|
||
|
|
||
|
\begin{table}[htbp]
|
||
|
\begin{ssiparamtb}
|
||
|
%
|
||
|
\ssiparamentry{boot\_\-tm\_\-priority}{50}{Default priority level.}
|
||
|
\end{ssiparamtb}
|
||
|
\caption{MCA parameters for the \boot{tm} boot module.}
|
||
|
\label{tbl:mca-orte-pls-tm-mca-params}
|
||
|
\end{table}
|
||
|
|
||
|
%%%%%
|
||
|
|
||
|
\subsubsection{Special Notes}
|
||
|
|
||
|
Since the \boot{tm} boot module is designed to work in PBS/Torque
|
||
|
jobs, it will fail if the \boot{tm} boot module is manually specified
|
||
|
and Open MPI is not currently running in a PBS/Torque job.
|
||
|
|
||
|
The \boot{tm} module does not start a shell on the remote node.
|
||
|
Instead, the entire environment of \cmd{lamboot} is pushed to the
|
||
|
remote nodes before starting the Open MPI run-time environment.
|
||
|
|
||
|
Also note that the Altair-provided client RPMs for PBS Pro do not
|
||
|
include the \icmd{pbs\_\-demux} command, which is necessary for proper
|
||
|
execution of TM jobs. The solution is to copy the executable from the
|
||
|
server RPMs to the client nodes.
|
||
|
|
||
|
Finally, TM does not provide a mechanism for path searching on the
|
||
|
remote nodes, so the \cmd{lamd} executable is required to reside in
|
||
|
the same location on each node to be booted.
|
||
|
|
||
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
|
||
|
% Close of index
|
||
|
|
||
|
\index{ORTE MCA components|)}
|
||
|
|