openmpi/doc/user/mca-orte.tex

% -*- latex -*-
%
% Copyright (c) 2004-2005 The Trustees of Indiana University.
%                         All rights reserved.
% Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
%                         All rights reserved.
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
%                         University of Stuttgart.  All rights reserved.
% Copyright (c) 2004-2005 The Regents of the University of California.
%                         All rights reserved.
% $COPYRIGHT$
%
% Additional copyrights may follow
%
% $HEADER$
%

\chapter{Available ORTE Components}
\label{sec:mca-orte}

There is currently only type of ORTE component that is visible to
users: \kind{boot}, which is used to start the Open MPI run-time
environment, most often through the \icmd{lamboot} command.  The
\cmd{lamboot} command itself is discussed in
Section~\ref{sec:commands-lamboot}
(page~\pageref{sec:commands-lamboot}); the discussion below focuses on
the boot modules that make up the ``back end'' implementation of
\cmd{lamboot}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Using the Open Run-Time Environment}
\label{sec:mca-orte-pls}
\index{ORTE MCA components|(}
\index{MCA boot components|see {boot MCA components}}

{\Huge JMS needs massive overhaul}

Open MPI provides a number of modules for starting the \cmd{lamd}
control daemons.  In most cases, the \cmd{lamd}s are started using the
\icmd{lamboot} command.  In previous versions of Open MPI,
\icmd{lamboot} could only use \icmd{rsh} or \icmd{ssh} for starting
the Open MPI run-time environment on remote nodes.  In Open MPI
\ompiversion, it is possible to use a variety of mechanisms for this
process startup.  The following mechanisms are available in Open MPI
\ompiversion:

\begin{itemize}
\item BProc
\item Globus (beta-level support)
\item \cmd{rsh} / \cmd{ssh}
\item OpenPBS / PBS Pro / Torque (using the Task Management interface)
\changebegin{7.1}
\item SLURM (using its native interface)
\changeend{7.1}
\end{itemize}

These mechanisms are discussed in detail below.  Note that the
sections below each assume that support for these modules have been
compiled into Open MPI.  The \icmd{laminfo} command can be used to
determine exactly which modules are supported in your installation
(see Section~\ref{sec:commands-laminfo},
page~\pageref{sec:commands-laminfo}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Boot Schema Files (a.k.a., ``Hostfiles'' or
  ``Machinefiles'')}
\label{sec:mca-orte-pls-schema}
\index{boot schema}
\index{hostfile|see {boot schema}}
\index{machinefile|see {boot schema}}
\cmdindex{lamboot}{boot schema file}

Before discussing any of the specific boot MCA modules, this section
discusses the boot schema file, commonly referred to as a ``hostfile''
or a ``machinefile''.  Most (but not all) boot MCA modules require a
boot schema, and the text below makes frequent mention of them.
Hence, it is worth discussing them before getting into the details of
each boot MCA.

A boot schema is a text file that, in its simplest form, simply lists
every host that the Open MPI run-time environment will be invoked on.  For
example:

\lstset{style=lam-shell}
\begin{lstlisting}
# This is my boot schema
inky.cluster.example.com
pinky.cluster.example.com
blinkly.cluster.example.com
clyde.cluster.example.com
\end{lstlisting}

Lines beginning with ``{\tt \#}'' are treated as comments and are
ignored.  Each non-blank, non-comment line must, at a minimum, list a
host.  Specifically, the first token on each line must specify a host
(although the definition of how that host is specified may vary differ
between boot modules).

However, each line can also specify arbitrary ``key=value'' pairs.  A
common global key is ``{\tt cpu}''.  This key takes an integer value
and indicates to Open MPI how many CPUs are available for Open MPI to use.  If
the key is not present, the value of 1 is assumed.  This number does
{\em not} need to reflect the physical number of CPUs -- it can be
smaller then, equal to, or greater than the number of physical CPUs in
the machine.  It is solely used as a shorthand notation for
\icmd{mpirun}'s ``C'' notation, meaning ``launch one process per CPU
as specified in the boot schema file.''  For example, in the following
boot schema:

\lstset{style=lam-shell}
\begin{lstlisting}
inky.cluster.example.com cpu=2
pinky.cluster.example.com cpu=4
blinkly.cluster.example.com cpu=4
# clyde doesn't mention a cpu count, and is therefore implicitly 1
clyde.cluster.example.com
\end{lstlisting}

\noindent issuing the command ``{\tt mpirun C foo}'' would actually
launch 11 copies of \cmd{foo}: 2 on \host{inky}, 4 on \host{pinky}, 4
on \host{blinky}, and 1 on \host{clyde}.

Note that listing a host more than once has the same effect as
incrementing the CPU count.  The following boot schema has the same
effect as the previous example (i.e., CPU counts of 2, 4, 4, and 1,
respectively):

\lstset{style=lam-shell}
\begin{lstlisting}
# inky has a CPU count of 2
inky.cluster.example.com
inky.cluster.example.com
# pinky has a CPU count of 4
pinky.cluster.example.com
pinky.cluster.example.com
pinky.cluster.example.com
pinky.cluster.example.com
# blinky has a CPU count of 4
blinkly.cluster.example.com
blinkly.cluster.example.com
blinkly.cluster.example.com
blinkly.cluster.example.com
# clyde only has 1 CPU
clyde.cluster.example.com
\end{lstlisting}

Other keys are defined on a per-boot-MCA-module, and are described
below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Minimum Requirements}
\label{sec:mca-orte-pls-min-reqs}

In order to successfully launch a process on a remote node, several
requirements must be met.  Although each of the boot modules have
different specific requirements, all of them share the following
conditions for successful operation:

\begin{enumerate}
\item Each target host must be reachable and operational.

\item The user must be able to execute arbitrary processes on the
  target.

\item The Open MPI executables must be locatable on that machine.  This
  typically involves using: the shell's search path, the
  \ienvvar{Open MPIHOME} environment variable, or a boot-module-specific
  mechanism.

\item The user must be able to write to the Open MPI session directory
  (typically somewhere under \file{/tmp}; see
  Section~\ref{sec:misc-session-directory},
  page~\pageref{sec:misc-session-directory}).

\item All hosts must be able to resolve the fully-qualified domain
  name (FQDN) of all the machines being booted (including itself).

\item Unless there is only one host being booted, any host
  resolving to the IP address 127.0.0.1 cannot be included in the list
  of hosts.
\end{enumerate}

If all of these conditions are not met, \cmd{lamboot} will fail.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Selecting a \kind{boot} Module}

Only one \kind{boot} module will be selected; it will be used for the
life of the Open MPI universe.  As such, module priority values are the
only factor used to determine which available module should be
selected.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{\kind{boot} MCA Parameters}

On many kinds of networks, Open MPI can know exactly which nodes should be
making connections while booting the Open MPI run-time environment, and
promiscuous connections (i.e., allowing any node to connect) are
discouraged.  However, this is not possible in some complex network
configurations and promiscuous connections {\em must} be enabled.

By default, Open MPI's base \kind{boot} MCA startup protocols disable
promiscuous connections.  However, this behavior can be overridden
when Open MPI is configured and at run-time.  If the MCA parameter
\issiparam{boot\_\-base\_\-promisc} set to an empty value, or set to
the integer value 1, promiscuous connections will be accepted when
than Open MPI RTE is booted.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \boot{bproc} Module}
\index{bproc boot MCA module@\boot{bproc} boot MCA module}
\index{boot MCA modules!bproc@\boot{bproc}}

The Beowulf Distributed Process Space (BProc)
project\footnote{\url{http://bproc.sourceforge.net/}} is set of kernel
modifications, utilities and libraries which allow a user to start
processes on other machines in a Beowulf-style cluster.  Remote
processes started with this mechanism appear in the process table of
the front-end machine in a cluster.

Open MPI functionality has been tested with BProc version 3.2.5.  Prior
versions had a bug that affected at least some Open MPI functionality.
It is strongly recommended to upgrade to at least version 3.2.5 before
attempting to use the Open MPI native BProc capabilities.

%%%%%

\subsubsection{Minimum Requirements}

Several of the minimum requirements listed in
Section~\ref{sec:mca-orte-pls-min-reqs} will already be met in a BProc
environment because BProc will copy \cmd{lamboot}'s entire environment
(including the \envvar{PATH}) to the remote node.  Hence, if
\cmd{lamboot} is in the user's path on the local node, it will also
[automatically] be in the user's path on the remote node.

However, one of the minimum requirements conditions (``The user must
be able to execute arbitrary processes on the target'') deserves a
BProc-specific clarification.  BProc has its own internal permission
system for determining if users are allowed to execute on specific
nodes.  The system is similar to the user/group/other mechanism
typically used in many Unix filesystems.  Hence, in order for a user
to successfully \cmd{lamboot} on a BProc cluster, he/she must have
BProc execute permissions on each of the target nodes.  Consult the
BProc documentation for more details.

%%%%%

\subsubsection{Usage}

In most situations, the \cmd{lamboot} command (and related commands)
should automatically ``know'' to use the \boot{bproc} boot MCA module
when running on the BProc head node; no additional command line
parameters or environment variables should be required.
%
Specifically, when running in a BProc environment, the \boot{bproc}
module will report that it is available, and artificially inflate its
priority relatively high in order to influence the boot module
selection process.
%
However, the BProc boot module can be forced by specifying the
\issiparam{boot} MCA parameter with the value of
\issivalue{boot}{bproc}.

Running \cmd{lamboot} on a BProc cluster is just like running
\cmd{lamboot} in a ``normal'' cluster.  Specifically, you provide a
boot schema file (i.e., a list of nodes to boot on) and run
\cmd{lamboot} with it.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ lamboot hostfile
\end{lstlisting}
% stupid emacs mode: $

Note that when using the \boot{bproc} module, \cmd{lamboot} will only
function properly from the head node.  If you launch \cmd{lamboot}
from a client node, it will likely either fail outright, or fall back
to a different boot module (e.g., \cmd{rsh}/\cmd{ssh}).

It is suggested that the \file{hostfile} file contain hostnames in the
style that BProc prefers -- integer numbers.  For example,
\file{hostfile} may contain the following:

\lstset{style=lam-shell}
\begin{lstlisting}
-1
 0
 1
 2
 3
\end{lstlisting}

\noindent which boots on the BProc front end node (-1) and four slave
nodes (0, 1, 2, 3).  Note that using IP hostnames will also work, but
using integer numbers is recommended.

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-orte-pls-bproc-mca-params} lists the MCA parameters
that are available to the \boot{bproc} module.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{boot\_\-bproc\_\-priority}{50}{Default priority level.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \boot{bproc} boot module.}
  \label{tbl:mca-orte-pls-bproc-mca-params}
\end{table}

%%%%%

\subsubsection{Special Notes}

After booting, Open MPI will, by default, not schedule to run MPI jobs on
the BProc front end.  Specifically, Open MPI implicitly sets the
``no-schedule'' attribute on the -1 node in a BProc cluster.  See
Section~\ref{sec:commands-lamboot}
(page~\pageref{sec:commands-lamboot}) for more detail about this
attribute and boot schemas in general, and
\ref{sec:commands-lamboot-no-schedule} (page
\pageref{sec:commands-lamboot-no-schedule}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \boot{globus} Module}
\index{globus boot MCA module@\boot{globus} boot MCA module}
\index{boot MCA modules!globus@\boot{globus}}

Open MPI \ompiversion\ includes beta support for Globus.
Specifically, only limited types of execution are possible.  The Open
MPI Team would appreciate feedback from the Globus community on
expanding Globus support in Open MPI.

%%%%%

\subsubsection{Minimum Requirements}

Open MPI jobs in Globus environment can only be started on nodes using
the ``fork'' job manager for the Globus gatekeeper.  Other job
managers are not yet supported.

%%%%%

\subsubsection{Usage}

Starting the Open MPI run-time environmetn in Globus environment makes use
of the Globus Resource Allocation Manager (GRAM) client
\icmd{globus-job-run}.
%
The Globus boot MCA module will never run automatically; it must
always be specifically requested setting the \issiparam{boot} MCA
parameter to \issivalue{boot}{globus}.  Specifically, although the
\boot{globus} module will report itself available if
\icmd{globus-job-run} can be found in the \envvar{PATH}, the default
priority will be quite low, effectively ensuring that it will not be
selected unless it is the only module available (which will only occur
if the \ssiparam{boot} parameter is set to \issivalue{boot}{globus}).

Open MPI needs to be able to find the Globus executables.  This can be
accompilshed either by adding the appropriate directory to your path,
or by setting the \ienvvar{GLOBUS\_\-LOCATION} environment variable.

Additionally, the \ienvvar{Open MPI\_\-MPI\_\-SEMCAON\_\-SUFFIX}
environment variable should be set to a unique value.  This ensures
that this instance of the Open MPI universe does not conflict with any
other, concurrent Open MPI universes that are running under the same
username on nodes in the Globus environment.  Although any value can
be used for this variable, it is probably best to have some kind of
organized format, such as {\tt
  <your\_\-username>-<some\_\-long\_\-random\_\-number>}.

Next, create a boot schema to use with \cmd{lamboot}.
%
Hosts are listed by their Globus contact strings (see the Globus
manual for more information about contact strings).  In cases where
the Globus gatekeeper is running as a \cmd{inetd} service on the node,
the contact string will simply be the hostname.  If the contact string
contains whitespace, the {\em entire} contact string must be enclosed
in quotes (i.e., not just the values with whitespaces).
%
For example, if your contact string is:

\centerline{\tt host1:port1:/O=xxx/OU=yyy/CN=aaa bbb ccc}

Then you will need to have it listed as:

\centerline{\tt "host1:port1:/O=xxx/OU=yyy/CN=aaa bbb ccc"}

The following will not work:

\centerline{\tt host1:port1:/O=xxx/OU=yyy/CN="aaa bbb ccc"}

Each host in the boot schema must also have a ``{\tt
  lam\_\-install\_\-path}'' key indicating the absolute directory
where Open MPI is installed.  This value is mandatory because you
cannot rely on the \ienvvar{PATH} environment variable in Globus
environment because users' ``dot'' files are not executed in Globus
jobs (and therefore the \envvar{PATH} environment variable is not
provided).  Other keys can be used as well; {\tt
  lam\_\-install\_\-path} is the only mandatory key.

Here is a sample Globus boot schema:

\changebegin{7.0.5}
\lstset{style=lam-shell}
\begin{lstlisting}
# Globus boot schema
``inky.mycluster:12853:/O=MegaCorp/OU=Mine/CN=HPC Group'' prefix=/opt/lam cpu=2
``pinky.yourcluster:3245:/O=MegaCorp/OU=Yours/CN=HPC Group'' prefix=/opt/lam cpu=4
``blinky.hiscluster:23452:/O=MegaCorp/OU=His/CN=HPC Group'' prefix=/opt/lam cpu=4
``clyde.hercluster:82342:/O=MegaCorp/OU=Hers/CN=HPC Group'' prefix=/software/lam
\end{lstlisting}
\changeend{7.0.5}

Once you have this boot schema, the \cmd{lamboot} command can be used
to launch it.  Note, however, that unlike the other boot MCA modules,
the Globus boot module will never be automatically selected by Open MPI --
it must be selected manually with the \issiparam{boot} MCA parameter
with the value \issivalue{boot}{globus}.

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ lamboot -ssi boot globus hostfile
\end{lstlisting}
% stupid emacs mode: $

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-orte-pls-globus-mca-params} lists the MCA
parameters that are available to the \boot{globus} module.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{boot\_\-globus\_\-priority}{3}{Default priority level.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \boot{globus} boot module.}
  \label{tbl:mca-orte-pls-globus-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \boot{rsh} Module (including \cmd{ssh})}
\index{rsh (ssh) boot MCA module@\boot{rsh} (\cmd{ssh}) boot MCA module}
\index{boot MCA modules!rsh (rsh/ssh)@\boot{rsh} (\cmd{rsh}/\cmd{ssh})}

The \cmd{rsh}/\cmd{ssh} boot MCA module is typically the ``least
common denominator'' boot module.  When not in an otherwise
``special'' environment (such as a batch scheduler), the
\cmd{rsh}/\cmd{ssh} boot module is typically used to start the Open MPI
run-time environment.

%%%%%

\subsubsection{Minimum Requirements}

In addition to the minimum requirements listed in
Section~\ref{sec:mca-orte-pls-min-reqs}, the following additional
conditions must also be met for a successful \cmd{lamboot} using the
\cmd{rsh} / \cmd{ssh} boot module:

\begin{enumerate}
\item The user must be able to execute arbitrary commands on each
  target host without being prompted for a password.

\item The shell's start-up script must not print anything on standard
  error. The user can take advantage of the fact that \cmd{rsh} /
  \cmd{ssh} will start the shell non-interactively. The start-up
  script can exit early in this case, before executing many commands
  relevant only to interactive sessions and likely to generate output.

  \changebegin{7.1}
  This has now been changed in version 7.1; if the MCA parameter
  \issiparam{boot\_\-rsh\_\-ignore\_\-stderr} is nonzero, any output
  on standard error will {\em not} be treated as an error.
  \changeend{7.1}
\end{enumerate}

Section~\ref{sec:getting-started} (page~\pageref{sec:getting-started})
provides a short tutorial on using the \cmd{rsh} / \cmd{ssh} boot
module, including tips on setting up ``dot'' files, setting up
password-less remote execution, etc.

%%%%%

\subsubsection{Usage}

Using \cmd{rsh}, \cmd{ssh}, or other remote-execution agent is
probably the most common method for starting the Open MPI run-time
execution environment.  The boot schema typically lists the hostnames,
CPU counts, and an optional username (if the user's name is different
on the remote machine).

\changebegin{7.1}

The boot schema can also list an optional ``prefix'', which specifies
the Open MPI installatation to be used on the particular host listed in
the boot schema. This is typically used if the user has mutliple
Open MPI installations on a host and want to switch between them
without changing the dot files or \envvar{PATH} environment variables,
or if the user has Open MPI installed under different paths on
different hosts.  If the prefix is not specified for a host in the
boot schema file, then the Open MPI installation which is available in
the \envvar{PATH} will be used on that host, or if the \cmdarg{-prefix
  $<$/lam/install/path$>$} option is specified for \cmd{lamboot}, the
$<$/lam/install/path$>$ installation will be used.  The prefix option
in the boot schema file however overrides any prefix option specified
on the \cmd{lamboot} command line for that host.

For example:

\lstset{style=lam-shell}
\begin{lstlisting}
# rsh boot schema
inky.cluster.example.com cpu=2
pinky.cluster.example.com cpu=4 prefix=/home/joe/lam7.1/install/
blinky.cluster.example.com cpu=4
clyde.cluster.example.com user=jsmith
\end{lstlisting}

\changeend{7.1}

The \cmd{rsh} / \cmd{ssh} boot module will usually run when no other
boot module has been selected.  It can, however, be manually selected,
even when another module would typically [automatically] be selected
by specifying the \issiparam{boot} MCA parameter with the value of
\issivalue{boot}{rsh}.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ lamboot -ssi boot rsh hostfile
\end{lstlisting}
% stupid emacs mode: $

%%%%%

\subsubsection{Tunable Parameters}

\changebegin{7.1}

Table~\ref{tbl:mca-orte-pls-rsh-mca-params} lists the MCA parameters that
are available to the \boot{rsh} module.

\changeend{7.1}

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{boot\_\-rsh\_\-agent}{From configure}{Remote shell
      agent to use.}
%
   \ssiparamentry{boot\_\-rsh\_\-ignore\_\-stderr}{0}{If nonzero,
      ignore output from \file{stderr} when booting; don't treat it as
      an error.}  \ssiparamentry{boot\_\-rsh\_\-priority}{10}{Default
      priority level.}
%
    \ssiparamentry{boot\_\-rsh\_\-no\_\-n}{0}{If nonzero, don't use
      ``\cmd{-n}'' as an argument to the boot agent}
%
    \ssiparamentry{boot\_\-rsh\_\-no\_\-profile}{0}{If nonzero, don't
      attempt to run ``\file{.profile}'' for Bourne-type shells.}
%
    \ssiparamentry{boot\_\-rsh\_\-username}{None}{Username to use if
      different than login name.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \boot{rsh} boot module.}
  \label{tbl:mca-orte-pls-rsh-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \boot{slurm} Module}
\index{batch queue systems!SLURM boot MCA module}
\index{slurm boot MCA module@\boot{slurm} boot MCA module}
\index{boot MCA modules!slurm@\boot{slurm}}

\changebegin{7.1}

As its name implies, the Simple Linux Utility for Resource Management
(SLURM)\footnote{http://www.llnl.gov/linux/slurm/} package is commonly
used for managing Linux clusters, typically in high-performance
computing environments.  SLURM contains a native system for launching
applications across the nodes that it manages.  When using SLURM,
\cmd{rsh}/\cmd{ssh} is not necessary to launch jobs on remote nodes.
Instead, the \boot{slurm} boot module will automatically use SLURM's
native job-launching interface to start Open MPI daemons.

The advantages of using SLURM's native interface are:

\begin{itemize}
\item SLURM can generate proper accounting information for all nodes in
  a parallel job.

\item SLURM can kill entire jobs properly when the job ends.

\item \icmd{lamboot} executes significantly faster when using SLURM as
  compared to when it uses \cmd{rsh} / \cmd{ssh}.
\end{itemize}

%%%%%

\subsubsection{Usage}

SLURM allows running jobs in multiple ways.  The \boot{slurm} boot
module is only supported in some of them:

\begin{itemize}
\item ``Batch'' mode: where a script is submitted via the \icmd{srun}
  command and is executed on the first node from the set that SLURM
  allocated for the job.  The script runs \icmd{lamboot},
  \icmd{mpirun}, etc., as is normal for a Open MPI job.

  This method is supported, and is perhaps the most common way to run
  Open MPI automated jobs in SLURM environments.

\item ``Allocate'' mode: where the ``\cmdarg{-A}'' option is given to
  \icmd{srun}, meaning that the shell were \icmd{lamboot} runs is
  likely to {\em not} be one of the nodes that SLURM has allocated for
  the job.  In this case, Open MPI daemons will be launched on all nodes
  that were allocated by SLURM as well as the origin (i.e., the node
  where \cmd{lamboot} was run.  The origin will be marked as
  ``no-schedule,'' meaning that applications launched by \cmd{mpirun}
  and \cmd{lamexec} will not be run there unless specifically
  requested (see See Section~\ref{sec:commands-lamboot},
  page~\pageref{sec:commands-lamboot}, for more detail about this
  attribute and boot schemas in general).

  This method is supported, and is perhaps the most common way to run
  Open MPI interactive jobs in SLURM environments.

\item ``\icmd{srun}'' mode: where a script is submitted via the
  \icmd{srun} command and is executed on {\em all} nodes that SLURM
  allocated for the job.  In this case, the commands in the script
  (e.g., \icmd{lamboot}, \icmd{mpirun}, etc.) will be run on {\em all}
  nodes simultaneously, which is most likely not what you want.

  This mode is not supported.
\end{itemize}

When running in any of the supported SLURM modes, Open MPI will
automatically detect that it should use the \boot{slurm} boot module
-- no extra command line parameters or environment variables should be
necessary.
%
Specifically, when running in a SLURM job, the \boot{slurm} module
will report that it is available, and artificially inflate its
priority relatively high in order to influence the boot module
selection process.
%
However, the \boot{slurm} boot module can be forced by specifying the
\issiparam{boot} MCA parameter with the value of
\issivalue{boot}{slurm}.

Unlike the \cmd{rsh}/\cmd{ssh} boot module, you do not need to specify
a hostfile for the \boot{slurm} boot module.  Instead, SLURM itself
provides a list of nodes (and associated CPU counts) to Open MPI.  Using
\icmd{lamboot} is therefore as simple as:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ lamboot
\end{lstlisting}
% stupid emacs mode: $

\changebegin{7.1}

Note that in environments with multiple TCP networks, SLURM may be
configured to use a network that is specifically designated for
commodity traffic -- another network may exist that is specifically
allocated for high-speed MPI traffic.  By default, Open MPI will use the
same hostnames that SLURM provides for all of its traffic.  This means
that Open MPI will send all of its MPI traffic across the same network that
SLURM uses.

However, Open MPI has the ability to boot using one set of hostnames /
addresses and then use a second set of hostnames / addresses for MPI
traffic.  As such, Open MPI can redirect its TCP MPI traffic across a
secondary network.  It is possible that your system administrator has
already configured Open MPI to operate in this manner.

If a secondary TCP network is intended to be used for MPI traffic, see
the section entitled ``Separating Open MPI and MPI TCP Traffic'' in the
Open MPI Installation Guide.  Note that this functionality has no
effect on non-TCP \kind{rpi} modules (such as Myrinet, Infiniband,
etc.).

\changeend{7.1}

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-orte-pls-slurm-mca-params} lists the MCA parameters
that are available to the \boot{slurm} module.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{boot\_\-slurm\_\-priority}{50}{Default priority level.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \boot{slurm} boot module.}
  \label{tbl:mca-orte-pls-slurm-mca-params}
\end{table}

%%%%%

\subsubsection{Special Notes}

Since the \boot{slurm} boot module is designed to work in SLURM jobs,
it will fail if the \boot{slurm} boot module is manually specified and
Open MPI is not currently running in a SLURM job.

The \boot{slurm} module does not start a shell on the remote node.
Instead, the entire environment of \cmd{lamboot} is pushed to the
remote nodes before starting the Open MPI run-time environment.

\changeend{7.1}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \boot{tm} Module (OpenPBS / PBS Pro / Torque)}
\index{batch queue systems!OpenPBS / PBS Pro / Torque (TM) boot MCA module}
\index{tm boot MCA module@\boot{tm} boot MCA module}
\index{boot MCA modules!tm (PBS / Torque)@\boot{tm} (PBS / Torque)}

Both OpenPBS and PBS Pro (both products of Altair Grid Technologies,
LLC), contain support for the Task Management (TM) interface.  Torque,
the open source fork of the Open MPI product, also contains the TM
interface.  When using TM, \cmd{rsh}/\cmd{ssh} is not necessary to
launch jobs on remote nodes.

The advantages of using the TM interface are:

\begin{itemize}
\item PBS/Torque can generate proper accounting information for all
  nodes in a parallel job.

\item PBS/Torque can kill entire jobs properly when the job ends.

\item \icmd{lamboot} executes significantly faster when using TM as
  compared to when it uses \cmd{rsh} / \cmd{ssh}.
\end{itemize}

%%%%%

\subsubsection{Usage}

When running in a PBS/Torque batch job, Open MPI will automatically detect
that it should use the \boot{tm} boot module -- no extra command line
parameters or environment variables should be necessary.
%
Specifically, when running in a PBS/Torque job, the \boot{tm} module
will report that it is available, and artificially inflate its
priority relatively high in order to influence the boot module
selection process.
%
However, the \boot{tm} boot module can be forced by specifying the
\issiparam{boot} MCA parameter with the value of \issivalue{boot}{tm}.

Unlike the \cmd{rsh}/\cmd{ssh} boot module, you do not need to specify
a hostfile for the \boot{tm} boot module.  Instead, PBS/Torque itself
provides a list of nodes (and associated CPU counts) to Open MPI.  Using
\icmd{lamboot} is therefore as simple as:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ lamboot
\end{lstlisting}
% stupid emacs mode: $

The \boot{tm} boot modules works in both interactive and
non-interactive batch jobs.

\changebegin{7.1}

Note that in environments with multiple TCP networks, PBS / Torque may
be configured to use a network that is specifically designated for
commodity traffic -- another network may exist that is specifically
allocated for high-speed MPI traffic.  By default, Open MPI will use the
same hostnames that the TM interface provides for all of its traffic.
This means that Open MPI will send all of its MPI traffic across the same
network that PBS / Torque uses.

However, Open MPI has the ability to boot using one set of hostnames /
addresses and then use a second set of hostnames / addresses for MPI
traffic.  As such, Open MPI can redirect its TCP MPI traffic across a
secondary network.  It is possible that your system administrator has
already configured Open MPI to operate in this manner.

If a secondary TCP network is intended to be used for MPI traffic, see
the section entitled ``Separating Open MPI and MPI TCP Traffic'' in the
Open MPI Installation Guide.  Note that this has no effect on non-TCP
\kind{rpi} modules (such as Myrinet, Infiniband, etc.).

\changeend{7.1}

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-orte-pls-tm-mca-params} lists the MCA parameters
that are available to the \boot{tm} module.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{boot\_\-tm\_\-priority}{50}{Default priority level.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \boot{tm} boot module.}
  \label{tbl:mca-orte-pls-tm-mca-params}
\end{table}

%%%%%

\subsubsection{Special Notes}

Since the \boot{tm} boot module is designed to work in PBS/Torque
jobs, it will fail if the \boot{tm} boot module is manually specified
and Open MPI is not currently running in a PBS/Torque job.

The \boot{tm} module does not start a shell on the remote node.
Instead, the entire environment of \cmd{lamboot} is pushed to the
remote nodes before starting the Open MPI run-time environment.

Also note that the Altair-provided client RPMs for PBS Pro do not
include the \icmd{pbs\_\-demux} command, which is necessary for proper
execution of TM jobs.  The solution is to copy the executable from the
server RPMs to the client nodes.

Finally, TM does not provide a mechanism for path searching on the
remote nodes, so the \cmd{lamd} executable is required to reside in
the same location on each node to be booted.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Close of index

\index{ORTE MCA components|)}