openmpi/doc/user/mca-ompi.tex

% -*- latex -*-
%
% Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
%                         University Research and Technology
%                         Corporation.  All rights reserved.
% Copyright (c) 2004-2005 The University of Tennessee and The University
%                         of Tennessee Research Foundation.  All rights
%                         reserved.
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
%                         University of Stuttgart.  All rights reserved.
% Copyright (c) 2004-2005 The Regents of the University of California.
%                         All rights reserved.
% $COPYRIGHT$
% 
% Additional copyrights may follow
% 
% $HEADER$
%

\chapter{Open MPI Components}
\label{sec:mca-ompi}

{\Huge JMS Needs massive overhaul}

There are multiple types of MPI components:

\begin{enumerate}
\item \mcafw{rpi}: MPI point-to-point communication, also known as the
  Open MPI Request Progression Interface (RPI).

\item \mcafw{coll}: MPI collective communication.

\item \mcafw{cr}: Checkpoint/restart support for MPI programs.
\end{enumerate}

Each of these types, and the modules that are available in the default
Open MPI distribution, are discussed in detail below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{General MPI MCA Parameters}
\label{sec:mca-ompi:mpi-params}

\changebegin{7.1}

The default hostmap file is \ifile{\$sysconf/lam-hostmap} (typically
\file{\$prefix/etc/lam-hostmap.txt}).  This file is only useful in
environments with multiple TCP networks, and is typically populated by
the system administrator (see the Open MPI Installation Guide for more
details on this file).

The MCA parameter \issiparam{mpi\_hostmap} can be used to specify an
alternate hostmap file.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -ssi mpi_hostmap my_hostmap.txt my_mpi_application
\end{lstlisting}
% stupid emacs mode: $

This tells Open MPI to use the hostmap \file{my\_hostmap.txt} instead of
\file{\$sysconf/lam-hostmap.txt}.  The special filename
``\cmdarg{none}'' can also be used to indicate that no address
remapping should be performed.

\changeend{7.1}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{MPI Module Selection Process}
\label{sec:mca-ompi-component-selection}

The modules used in an MPI process may be related or dependent upon
external factors.  For example, the \rpi{gm} RPI cannot be used for
MPI point-to-point communication unless there is Myrinet hardware
present in the node.  The \crssi{blcr} checkpoint/restart module
cannot be used unless thread support was included.  And so on.  As
such, it is important for users to understand the module selection
algorithm.

\begin{enumerate}
\item Set the thread level to be what was requested, either via
  \mpifunc{MPI\_\-INIT\_\-THREAD} or the environment variable
  \ienvvar{Open MPI\_\-MPI\_\-THREAD\_\-LEVEL}.

\item Query relevant modules and make lists of the resulting available
  modules.  ``Relevant'' means either a specific module (or set of
  modules) if the user specified them through MCA parameters, or all
  modules if not specified.
  
\item Eliminate all modules who do not support the current MPI thread
  level.

\item If no \kind{rpi} modules remain, try a lower thread support
  level until all levels have been tried.  If no thread support level
  can provide an \kind{rpi} module, abort.
  
\item Select the highest priority \kind{rpi} module.  Reset the thread
  level (if necessary) to be at least the lower bound of thread levels
  that the selected \kind{rpi} module supports.
  
\item Eliminate all \kind{coll} and \kind{cr} modules that cannot
  operate at the current thread level.
  
\item If no \kind{coll} modules remain, abort.  Final selection
  \kind{coll} modules is discussed in
  Section~\ref{sec:mca-ompi-coll-select}
  (page~\pageref{sec:mca-ompi-coll-select}).

\item If no \kind{cr} modules remain and checkpoint/restart support
  was specifically requested, abort.  Otherwise, select the highest
  priority \kind{cr} module.
\end{enumerate}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{MPI Point-to-point Communication (Request Progression
  Interface / RPI)}
\label{sec:mca-ompi-rpi}

Open MPI provides multiple MCA modules for MPI point-to-point
communication.  Also known as the Request Progression Interface (RPI),
these modules are used for all aspects of MPI point-to-point
communication in an MPI application.  Some of the modules require
external hardware and/or software (e.g., the native Myrinet RPI module
requires both Myrinet hardware and the GM message passing library).
The \icmd{laminfo} command can be used to determine which RPI modules
are available in a Open MPI installation.

Although one RPI module will likely be the default, the selection of
which RPI module is used can be changed through the MCA parameter
\issiparam{rpi}.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -ssi rpi tcp C my_mpi_program
\end{lstlisting}
% Stupid emacs mode: $

\noindent runs the \file{my\_\-mpi\_\-program} executable on all
available CPUs using the \rpi{tcp} RPI module, while:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun -ssi rpi gm C my_mpi_program
\end{lstlisting}
% Stupid emacs mode: $

\noindent runs the \file{my\_\-mpi\_\-program} executable on all
available CPUs using the \rpi{gm} RPI module.

It should be noted that the choice of RPI usually does not affect the
\kind{boot} MCA module -- hence, the \cmd{lamboot} command
requirements on hostnames specified in the boot schema is not
dependent upon the RPI.  For example, if the \rpi{gm} RPI is selected,
\cmd{lamboot} may still require TCP/IP hostnames in the boot schema,
not Myrinet hostnames.  Also note that selecting a particular module
does not guarantee that it will be able to be used.  For example,
selecting the \rpi{gm} RPI module will still cause a run-time failure
if there is no Myrinet hardware present.

The available modules are described in the sections below.  Note that
much of this information (particularly the tunable MCA parameters) is
also available in the \file{lamssi\_\-rpi(7)} manual page.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\changebegin{7.0.3}

\subsection{Two Different Shared Memory RPI Modules}
\label{sec:mca-ompi-shmem}

The \rpi{sysv} (Section~\ref{sec:mca-ompi-sysv},
page~\pageref{sec:mca-ompi-sysv}) and the \rpi{usysv}
(Section~\ref{sec:mca-ompi-usysv}, page~\pageref{sec:mca-ompi-usysv})
modules differ only in the mechanism used to synchronize the transfer
of messages via shared memory.
%
The \rpi{sysv} module uses System V semaphores while the \rpi{usysv}
module uses spin locks with back-off.
%
Both modules use a small number of System V semaphores for
synchronizing both the deallocation of shared structures and access to
the shared pool.

The blocking nature of the \rpi{sysv} module should generally provide
better performance than \rpi{usysv} on oversubscribed nodes (i.e.,
when the number of processes is greater than the number of available
processors).  System V semaphores will effectively force processes
yield to other processes, allowing at least some degree of
fair/regular scheduling.  In non-oversubscribed environments (i.e.,
where the number of processes is less than or equal to the number of
available processors), the \rpi{usysv} RPI should generally provide
better performance than the \rpi{sysv} RPI because spin locks keep
processors busy-waiting.  This hopefully keeps the operating system
from suspending or swapping out the processes, allowing them to react
immediately when the lock becomes available.

\changeend{7.0.3}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{crtcp} Module (Checkpoint-able TCP
  Communication)}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt crtcp} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 25 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{crtcp} RPI module is almost identical to the \rpi{tcp}
module, described in Section~\ref{sec:mca-ompi-tcp}.  TCP sockets are
used for communication between MPI processes.

%%%%%

\subsubsection{Overview}

The following are the main differences between the \rpi{tcp} and
\rpi{crtcp} RPI modules:

\begin{itemize}
\item The \rpi{crtcp} module can be checkpointed and restarted.  It is
  currently the {\em only} RPI module in Open MPI that supports
  checkpoint/restart functionality.
  
\item The \rpi{crtcp} module does not have the ``fast'' message
  passing optimization that is in the \rpi{tcp} module.  As result,
  there is a small performance loss in certain types of MPI
  applications.
\end{itemize}

All other aspects of the \rpi{crtcp} module are the same as the
\rpi{tcp} module.

%%%%%

\subsubsection{Checkpoint/Restart Functionality}

The \rpi{crtcp} module is designed to work in conjunction with a
\kind{cr} module to provide checkpoint/restart functionality.  See
Section~\ref{sec:mca-ompi-cr} for a description of how Open MPI's overall
checkpoint/restart functionality is used.

The \rpi{crtcp} module's checkpoint/restart functionality is invoked
when the \kind{cr} module indicates that it is time to perform a
checkpoint.  The \rpi{crtcp} then quiesces all ``in-flight'' MPI
messages and then allows the checkpoint to be performed.  Upon
restart, TCP connections are re-formed, and message passing processing
continues.  No additional buffers or ``rollback'' mechanisms are
required, nor is any special coding required in the user's MPI
application.

%%%%%  

\subsubsection{Tunable Parameters}

\changebegin{7.1}

The \rpi{crtcp} module has the same tunable parameters as the
\rpi{tcp} module (maximum size of a short message and amount of OS
socket buffering), although they have different names:
\issiparam{rpi\_\-crtcp\_\-short},
\issiparam{rpi\_\-crtcp\_\-sockbuf}.

\changeend{7.1}

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-crtcp\_\-priority}{25}{Default priority
    level.}
%
    \ssiparamentry{rpi\_\-crtcp\_\-short}{65535}{Maximum length
    (in bytes) of a ``short'' message.}
%
    \ssiparamentry{rpi\_\-crtcp\_\-sockbuf}{-1}{Socket buffering in
    the OS kernel (-1 means use the short message size).}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{crtcp} RPI module.}
  \label{tbl:mca-ompi-crtcp-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{gm} Module (Myrinet)}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt gm} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 50 \\
  {\bf Checkpoint / restart:} & yes (*) \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{gm} RPI module is for native message passing over Myrinet
networking hardware.  The \rpi{gm} RPI provides low latency, high
bandwidth message passing performance.

\changebegin{7.1}

Be sure to also read the release notes entitled ``Operating System
Bypass Communication: Myrinet and Infiniband'' in the Open MPI
Installation Guide for notes about memory management with Myrinet.
Specifically, it deals with Open MPI's automatic overrides of the
\func{malloc()}, \func{calloc()}, and \func{free()} functions.

\changeend{7.1}

%%%%%

\subsubsection{Overview}

In general, using the \rpi{gm} RPI module is just like using any other
RPI module -- MPI functions will simply use native GM message passing
for their back-end message transport.

Although it is not required, users are strongly encouraged to use the
\mpifunc{MPI\_\-ALLOC\_\-MEM} and \mpifunc{MPI\_\-FREE\_\-MEM}
functions to allocate and free memory (instead of, for example,
\func{malloc()} and \func{free()}.

\changebegin{7.1}

The \rpi{gm} RPI module is marked as ``yes'' for checkpoint / restart
support, but this is only true when the module was configured and
compiled with the \confflag{with-rpi-gm-get} configure flag.  This
enables Open MPI to use the GM 2.x function \func{gm\_\-get()}.  Note that
enabling this feature slightly with the \ssiparam{rpi\_\-gm\_\-cr} MCA
parameter decreases the performance of the \rpi{gm} module (which is
why it is disabled by default) because of additional bookkeeping that
is necessary.  The performance difference is actually barely
measurable -- it is well below one microsecond.  It is not the default
behavior simply on principle.

At the time of this writing, there still appeared to be problems with
\func{gm\_\-get()}, so this behavior is disabled by default.  It is
not clear whether the problems with \func{gm\_\-get()} are due to a
problem with Myricom's GM library or a problem in Open MPI itself; the
\confflag{with-rpi-gm-get} option is provided as a ``hedging our
bets'' solution; if the problem does turn out to be with the GM
library, Open MPI users can enable checkpoint support (and slightly lower
long message latency) by using this switch.

\changeend{7.1}

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-ompi-gm-mca-params} shows the MCA parameters that
may be changed at run-time; the text below explains each one in
detail.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-gm\_\-cr}{0}{Whether to enable checkpoint /
      restart support or not.}
%
    \ssiparamentry{rpi\_\-gm\_\-fast}{0}{Whether to enable the
      ``fast'' algorithm for sending short messages.  This is an
      unreliable transport and is not recommended for MPI applications
      that do not continually invoke the MPI progression engine.}
%
    \ssiparamentry{rpi\_\-gm\_\-maxport}{32}{Maximum GM port
    number to check during \mpifunc{MPI\_\-INIT} when looking for an
    available port.}
%
    \ssiparamentry{rpi\_\-gm\_\-nopin}{0}{Whether to let Open MPI
    register (``pin'') arbitrary buffers or not.}
%
    \ssiparamentry{rpi\_\-gm\_\-port}{-1}{Specific GM port to use
    (-1 indicates none).}
%
    \ssiparamentry{rpi\_\-gm\_\-priority}{50}{Default priority level.}
%
    \ssiparamentry{rpi\_\-gm\_\-tinymsglen}{1024}{Maximum length
    (in bytes) of a ``tiny'' message.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{gm} RPI module.}
  \label{tbl:mca-ompi-gm-mca-params}
\end{table}

%%%%%

\subsubsection{Port Allocation}

It is usually unnecessary to specify which Myrinet/GM port to use.
Open MPI will automatically attempt to acquire ports greater than 1.

By default, Open MPI will check for any available port between 1 and 8.  If
your Myrinet hardware has more than 8 possible ports, you can change
the upper port number that Open MPI will check with the
\issiparam{rpi\_\-gm\_\-maxport} MCA parameter.

However, if you wish Open MPI to use a specific GM port number (and not
check all the ports from $[1, {\rm maxport}]$), you can tell Open MPI which
port to use with the \issiparam{rpi\_\-gm\_\-port} MCA parameter.
%
Specifying which port to use has precedence over the port range check
-- if a specific port is indicated, Open MPI will try to use that and not
check a range of ports.  Specifying to use port ``-1'' (or not
specifying to use a specific port) will tell Open MPI to check the range of
ports to find any available port.

Note that in all cases, if Open MPI cannot acquire a valid port for every
MPI process in the job, the entire job will be aborted.

Be wary of forcing a specific port to be used, particularly in
conjunction with the MPI dynamic process calls (e.g.,
\mpifunc{MPI\_\-COMM\_\-SPAWN}).  For example, attempting to spawn a
child process on a node that already has an MPI process in the same
job, Open MPI will try to use the same specific port, which will result in
failure because the MPI process already on that node will have already
claimed that port.

%%%%%

\subsubsection{Adjusting Message Lengths}

The \rpi{gm} RPI uses two different protocols for passing data between
MPI processes: tiny and long.  Selection of which protocol to use is
based solely on the length of the message.  Tiny messages are sent
(along with tag and communicator information) in one transfer to the
receiver.  Long messages use a rendezvous protocol -- the envelope is
sent to the destination, the receiver responds with an ACK (when it is
ready), and then the sender sends another envelope followed by the
data of the message.

The message lengths at which the different protocols are used can be
changed with the MCA parameter \issiparam{rpi\_\-gm\_\-tinymsglen} ,
which represent the maximum length of tiny messages.  Open MPI defaults to
1,024 bytes for the maximum lengths of tiny messages.

It may be desirable to adjust these values for different kinds of
applications and message passing patterns.  The Open MPI Team would
appreciate feedback on the performance of different values for real
world applications.

%%%%%

\subsubsection{Pinning Memory}
\label{sec:mca-ompi-gm-ptmalloc}

The Myrinet native communication library (gm) can only communicate
through ``registered'' (sometimes called ``pinned'') memory.  In most
operating systems, Open MPI handles this automatically by pinning
user-provided buffers when required.  This allows for good message
passing performance, especially when re-using buffers to send/receive
multiple messages.

However, the gm library does not have the ability to pin arbitrary
memory on Solaris systems -- auxiliary buffers must be used.  Although
Open MPI controls all pinned memory, this has a detrimental effect on
performance of large messages: Open MPI must copy all messages from the
application-provided buffer to an auxiliary buffer before it can be
sent (and vice versa for receiving messages).  As such, users are
strongly encouraged to use the \mpifunc{MPI\_\-ALLOC\_\-MEM} and
\mpifunc{MPI\_\-FREE\_\-MEM} functions instead of \func{malloc()} and
\func{free()}.  Using these functions will allocate ``pinned'' memory
such that Open MPI will not have to use auxiliary buffers and an extra
memory copy.

The \ssiparam{rpi\_\-gm\_\-nopin} MCA parameter can be used to force
Solaris-like behavior.  On Solaris platforms, the default value is
``1'', specifying to use auxiliary buffers as described above.  On
non-Solaris platforms, the default value is ``0'', meaning that
Open MPI will attempt to pin and send/receive directly from user
buffers.

Note that since Open MPI manages all pinned memory, Open MPI must be
aware of memory that is freed so that it can be properly unpinned
before it is returned to the operating system.  Hence, Open MPI must
intercept calls to functions such as \func{sbrk()} and \func{munmap()}
to effect this behavior.  Since gm cannot pin arbitrary memory on
Solaris, Open MPI does not need to intercept these calls on Solaris
machines.

To this end, support for additional memory allocation packages are
included in Open MPI and will automatically be used on platforms that
support arbitrary pinning.  These memory allocation managers allow
Open MPI to intercept the relevant functions and ensure that memory is
unpinned before returning it to the operating system.  Use of these
managers will effectively overload all memory allocation functions
(e.g., \func{malloc()}, \func{calloc()}, \func{free()}, etc.) for all
applications that are linked against the Open MPI libraries
(potentially regardless of whether they are using the ib RPI module or
not).  

See Section \ref{release-notes:os-bypass} (page
\pageref{release-notes:os-bypass}) for more information on Open MPI's
memory allocation managers.

%%%%%

\subsubsection{Memory Checking Debuggers}

When running Open MPI's \rpi{gm} RPI through a memory checking debugger
(see Section~\ref{sec:debug-mem}), a number of ``Read from
unallocated'' (RUA) and/or ``Read from uninitialized'' (RFU) errors
may appear, originating from functions beginning with ``\func{gm\_*}''
or ``\func{lam\_\-ssi\_\-rpi\_\-gm\_*}''.  These RUA/RFU errors are
normal -- they are not actually reads from unallocated sections of
memory.  The Myrinet hardware and gm kernel device driver handle some
aspects of memory allocation, and therefore the operating
system/debugging environment is not always aware of all valid memory.
As a result, a memory checking debugger will often raise warnings,
even though this is valid behavior.

%%%%%

\subsubsection{Known Issues}

As of Open MPI \lamversion, the following issues still remain in the
\rpi{gm} RPI module:

\begin{itemize}
\item Heterogeneity between big and little endian machines is not
  supported.
  
\item The \rpi{gm} RPI is not supported with IMPI.
  
\item Mixed shared memory / GM message passing is not yet supported;
  all message passing is through Myrinet / GM.
  
\item XMPI tracing is not yet supported.

\changebegin{7.0.3}
  
\item The \rpi{gm} RPI module is designed to run in environments where
  the number of available processors is greater than or equal to the
  number of MPI processes on a given node.  The \rpi{gm} RPI module
  will perform poorly (particularly in blocking MPI communication
  calls) if there are less processors than processes on a node.

\changeend{7.0.3}

\changebegin{7.1}

\item ``Fast'' support is available and slightly decreases the latency
  for short gm messages.  However, it is unreliable and is subject to
  timeouts for MPI applications that do not invoke the MPI progression
  engine often, and is therefore not the default behavior.

\item Support for the \func{gm\_\-get()} function in the GM 2.x series
  is available starting with Open MPI 7.1, but is disabled by support.
  See the Installation Guide for more details.
  
\item Checkpoint/restart support is included for the \rpi{gm} module,
  but is only possible when the \rpi{gm} module was compiled with
  support for \func{gm\_\-get()}.

\changeend{7.1}

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{ib} Module (Infiniband)}
\label{sec:mca-ompi-ib}

\changebegin{7.1}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt ib} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 50 \\
  {\bf Checkpoint / restart:} & no \\
  \hline
\end{tabular}
\vspace{11pt}


The \rpi{ib} RPI module is for native message passing over Infiniband
networking hardware.  The \rpi{ib} RPI provides low latency, high
bandwidth message passing performance.

Be sure to also read the release notes entitled ``Operating System
Bypass Communication: Myrinet and Infiniband'' in the Open MPI
Installation Guide for notes about memory management with Infiniband.
Specifically, it deals with Open MPI's automatic overrides of the
\func{malloc()}, \func{calloc()}, and \func{free()} functions.

%%%%%

\subsubsection{Overview}

In general, using the \rpi{ib} RPI module is just like using any other
RPI module -- MPI functions will simply use native Infiniband message
passing for their back-end message transport.

Although it is not required, users are strongly encouraged to use the
\mpifunc{MPI\_\-ALLOC\_\-MEM} and \mpifunc{MPI\_\-FREE\_\-MEM}
functions to allocate and free memory used for communication (instead
of, for example, \func{malloc()} and \func{free()}. This would avoid
the need to pin the memory during communication time and hence save on
message passsing latency.

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-ompi-ib-mca-params} shows the MCA parameters that
may be changed at run-time; the text below explains each one in
detail.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-ib\_\-hca\_\-id}{X}{The string ID of the
    Infiniband hardware HCA to be used}
%
    \ssiparamentry{rpi\_\-ib\_\-num\_\-envelopes}{64}{Number of
    envelopes to be preposted per peer process.}
%
    \ssiparamentry{rpi\_\-ib\_\-port}{-1}{Specific IB port to use
    (-1 indicates none).}
%
    \ssiparamentry{rpi\_\-ib\_\-priority}{50}{Default priority level.}
%
    \ssiparamentry{rpi\_\-ib\_\-tinymsglen}{1024}{Maximum length
    (in bytes) of a ``tiny'' message.}
%
    \ssiparamentry{rpi\_\-ib\_\-mtu}{1024}{Maximum Transmission
    Unit (MTU) value to be used for IB.}

  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{ib} RPI module.}
  \label{tbl:mca-ompi-ib-mca-params}
\end{table}

%%%%%

\subsubsection{Port Allocation}

It is usually unnecessary to specify which Infiniband port to use.
Open MPI will automatically attempt to acquire ports greater than 1.

However, if you wish Open MPI to use a specific Infiniband port number, you
can tell Open MPI which port to use with the \issiparam{rpi\_\-ib\_\-port}
MCA parameter.
%
Specifying which port to use has precedence over the port range check
-- if a specific port is indicated, Open MPI will try to use that and not
check a range of ports.  Specifying to use port ``-1'' (or not
specifying to use a specific port) will tell Open MPI to check the range of
ports to find any available port.

Note that in all cases, if Open MPI cannot acquire a valid port for every
MPI process in the job, the entire job will be aborted.

Be wary of forcing a specific port to be used, particularly in
conjunction with the MPI dynamic process calls (e.g.,
\mpifunc{MPI\_\-COMM\_\-SPAWN}).  For example, attempting to spawn a
child process on a node that already has an MPI process in the same
job, Open MPI will try to use the same specific port, which will result in
failure because the MPI process already on that node will have already
claimed that port.

%%%%%

\subsubsection{Choosing an HCA ID}

The HCA ID is the Mellanox Host Channel Adapter ID.  For example:
InfiniHost0.  It is usually unnecessary to specify which HCA ID to
use.  Open MPI will search for all HCAs available and select the first
one which is available.  If you want to use a fixed HCA ID, then you
can specify that using the \ssiparam{rpi\_\-ib\_\-hca\_\-id} MCA
parameter.

%%%%%

\subsubsection{Adjusting Message Lengths}

The \rpi{ib} RPI uses two different protocols for passing data between
MPI processes: tiny and long.  Selection of which protocol to use is
based solely on the length of the message.  Tiny messages are sent
(along with tag and communicator information) in one transfer to the
receiver.  Long messages use a rendezvous protocol -- the envelope is
sent to the destination, the receiver responds with an ACK (when it is
ready), and then the sender sends another envelope followed by the
data of the message.

The message lengths at which the different protocols are used can be
changed with the MCA parameter \issiparam{rpi\_\-ib\_\-tinymsglen} ,
which represent the maximum length of tiny messages.  Open MPI defaults to
1,024 bytes for the maximum lengths of tiny messages.

It may be desirable to adjust these values for different kinds of
applications and message passing patterns.  The Open MPI Team would
appreciate feedback on the performance of different values for real
world applications.

%%%%%

\subsubsection{Posting Envelopes to Recieve / Scalability}
\label{sec:mca-ompi-ib-envelopes}

Receive buffers must be posted to the IB communication
hardware/library before any receives can occur.  Open MPI uses
enevelopes that contain MPI signature information, and in the case of
tiny messages, they also hold the actual message contents.
%
The size of each envelope is therefore sum of the size of the headers
and the maximum size of a tiny message (controlled by
\issiparam{rpi\_\-ib\_\-tinymsglen} MCA parameter).  Open MPI pre-posts 64
evnvelope buffers per peer process by default, but can be overridden
at run-time with then \issiparam{rpi\_\-ib\_\-num\_\-envelopes} MCA
parameter.

\changebegin{7.1.2}

These two MCA parameters can have a large effect on scalability.
Since Open MPI pre-posts a total of $((num\_processes - 1) \times
num\_envelopes \times tinymsglen)$ bytes, this can be prohibitive if
$num\_processes$ grows large.  However, $num\_envelopes$ and
$tinymsglen$ can be adjusted to help keep this number low, although
they may have an effect on run-time performance.  Changing the number
of pre-posted envelopes effectively controls how many messages can be
simultaneously flowing across the network; changing the tiny message
size affects when Open MPI switches to use a rendezvous sending protocol
instead of an eager send protocol.  Relevant values for these
parameters are likely to be application-specific; keep this in mind
when running large parallel jobs.

\changeend{7.1.2}

%%%%%

\subsubsection{Modifying the MTU value}
\label{sec:mca-ompi-ib-mtu}

\changebegin{7.1.2}

The Maximum Transmission Unit (MTU) values to be used for Infiniband
can be configured at runtime using the
\issiparam{rpi\_\-ib\_\-mtu} MCA parameter. It can take in
values of 256, 512, 1024, 2048 and 4096 corresponding to MTU256,
MTU512, MTU1024, MTU2048 and MTU4096 values of Infiniband MTUs
respectively. The default value is 1024 (corresponding to MTU1024).

\changeend{7.1.2}

%%%%%

\subsubsection{Pinning Memory}
\label{sec:mca-ompi-ib-ptmalloc}

The Infiniband communication library can only communicate through
``registered'' (sometimes called ``pinned'') memory.  Open MPI handles
this automatically by pinning user-provided buffers when required.
This allows for good message passing performance, especially when
re-using buffers to send/receive multiple messages.

Note that since Open MPI manages all pinned memory, Open MPI must be
aware of memory that is freed so that it can be properly unpinned
before it is returned to the operating system.  Hence, Open MPI must
intercept calls to functions such as \func{sbrk()} and \func{munmap()}
to effect this behavior.

To this end, support for additional memory allocation packages are
included in Open MPI and will automatically be used on platforms that
support arbitrary pinning.  These memory allocation managers allow
Open MPI to intercept the relevant functions and ensure that memory is
unpinned before returning it to the operating system.  Use of these
managers will effectively overload all memory allocation functions
(e.g., \func{malloc()}, \func{calloc()}, \func{free()}, etc.) for all
applications that are linked against the Open MPI libraries
(potentially regardless of whether they are using the ib RPI module or
not).  

See Section \ref{release-notes:os-bypass} (page
\pageref{release-notes:os-bypass}) for more information on Open MPI's
memory allocation managers.

%%%%%

\subsubsection{Memory Checking Debuggers}

When running Open MPI's \rpi{ib} RPI through a memory checking debugger
(see Section~\ref{sec:debug-mem}), a number of ``Read from
unallocated'' (RUA) and/or ``Read from uninitialized'' (RFU) errors
may appear pertaining to VAPI.  These RUA/RFU errors are
normal -- they are not actually reads from unallocated sections of
memory.  The Infiniband hardware and kernel device driver handle some
aspects of memory allocation, and therefore the operating
system/debugging environment is not always aware of all valid memory.
As a result, a memory checking debugger will often raise warnings,
even though this is valid behavior.

%%%%%

\subsubsection{Known Issues}

As of Open MPI \lamversion, the following issues remain in the
\rpi{ib} RPI module:

\begin{itemize}

\changebegin{7.1.2}

\item The \rpi{ib} \kind{rpi} will not scale well to large numbers of
  processes.  See the section entitled ``Posting Envelopes to Receive
  / Scalability,'' above.

\changeend{7.1.2}

\item On machines which have IB (VAPI) shared libraries but not the IB
  hardware, and when Open MPI is compiled with IB support, you may see some
  error messages like ``can't open device file'' when trying to use
  Open MPI, even when you are not using the IB module.  This error
  message pertains to IB (VAPI) shared libraries and is not from
  within Open MPI.  It results because when Open MPI tries to query the
  shared libraries, VAPI tries to open the IB device during the shared
  library init phase, which is not proper.  

\item Heterogeneity between big and little endian machines is not
  supported.
  
\item The \rpi{ib} RPI is not supported with IMPI.
  
\item Mixed shared memory / IB message passing is not yet supported;
  all message passing is through Infiniband.
  
\item XMPI tracing is not yet supported.

\item The \rpi{ib} RPI module is designed to run in environments where
  the number of available processors is greater than or equal to the
  number of MPI processes on a given node.  The \rpi{ib} RPI module
  will perform poorly (particularly in blocking MPI communication
  calls) if there are less processors than processes on a node.

\changeend{7.1}

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{lamd} Module (Daemon-Based Communication)}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt lamd} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 10 \\
  {\bf Checkpoint / restart:} & no \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{lamd} RPI module uses the Open MPI daemons for all interprocess
communication.  This allows for true asynchronous message passing
(i.e., messages can progress even while the user's program is
executing), albeit at the cost of a significantly higher latency and
lower bandwidth.

%%%%%

\subsubsection{Overview}

Rather than send messages directly from one MPI process to another,
all messages are routed through the local Open MPI daemon, the remote Open MPI
daemon (if the target process is on a different node), and then
finally to the target MPI process.  This potentially adds two hops to
each MPI message.

Although the latency incurred can be significant, the \rpi{lamd} RPI
can actually make message passing progress ``in the background.''
Specifically, since Open MPI is an single-threaded MPI implementation,
it can typically only make progress passing messages when the user's
program is in an MPI function call.  With the \rpi{lamd} RPI, since
the messages are all routed through separate processes, message
passing can actually occur when the user's program is {\em not} in an
MPI function call.

User programs that utilize latency-hiding techniques can exploit this
asynchronous message passing behavior, and therefore actually achieve
high performance despite of the high overhead associated with the
\rpi{lamd} RPI.\footnote{Several users on the Open MPI mailing list
  have mentioned this specifically; even though the \rpi{lamd} RPI is
  slow, it provides {\em significantly} better performance because it
  can provide true asynchronous message passing.}

%%%%%

\subsubsection{Tunable Parameters}

The \rpi{lamd} module has only one tunable parameter: its priority.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-lamd\_\-priority}{10}{Default priority
    level.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{lamd} RPI module.}
  \label{tbl:mca-ompi-lamd-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{sysv} Module (Shared Memory Using System V
  Semaphores)}
\label{sec:mca-ompi-sysv}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt sysv} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 30 \\
  {\bf Checkpoint / restart:} & no \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{sysv} RPI is the one of two combination shared-memory/TCP
message passing modules.  Shared memory is used for passing messages
to processes on the same node; TCP sockets are used for passing
messages to processes on other nodes.  System V semaphores are used
for synchronization of the shared memory pool.

\changebegin{7.0.3}

Be sure to read Section~\ref{sec:mca-ompi-shmem}
(page~\pageref{sec:mca-ompi-shmem}) on the difference between this
module and the \rpi{usysv} module.

\changeend{7.0.3}

%%%%%

\subsubsection{Overview}

Processes located on the same node communicate via shared memory.  One
System V shared segment is shared by all processes on the same node.
This segment is logically divided into three areas.  The total size of
the shared segment (in bytes) allocated on each node is:

\[ 
(2 \times C) + (N \times (N-1) \times (S + C)) + P
\]

where $C$ is the cache line size, $N$ is the number of processes on
the node, $S$ is the maximum size of short messages, and $P$ is the
size of the pool for large messages,

The first area (of size $(2 \times C)$) is for the global pool lock.
The \rpi{sysv} module allocates a semaphore set (of size six) for each
process pair communicating via shared memory.  On some systems, the
operating system may need to be reconfigured to allow for more
semaphore sets if running tasks with many processes communicating via
shared memory.

The second area is for ``postboxes,'' or short message passing.  A
postbox is used for communication one-way between two processes.  Each
postbox is the size of a short message plus the length of a cache
line.  There is enough space allocated for $(N \times (N-1))$
postboxes.  The maximum size of a short message is configurable with
the \issiparam{rpi\_\-ssi\_\-sysv\_\-short} MCA parameter.

% JMS: Someday, cache line should be an MCA parameter.

% \const{CACHELINESIZE} must be the size of a cache line or a multiple
% thereof.  The default setting is 64 bytes.  You shouldn't need to
% change it.  \const{CACHELINESIZE} bytes in the postbox are used for a
% cache-line sized synchronization location.

The final area in the shared memory area (of size $P$) is used as a
global pool from which space for long message transfers is allocated.
Allocation from this pool is locked.  The default lock mechanism is a
System V semaphore but can be changed to a process-shared pthread
mutex lock.  The size of this pool is configurable with the
\issiparam{rpi\_\-ssi\_\-sysv\_\-shmpoolsize} MCA parameter.  Open MPI will
try to determine $P$ at configuration time if none is explicitly
specified.  Larger values should improve performance (especially when
an application passes large messages) but will also increase the
system resources used by each task.

%%%%%

\subsubsection{Use of the Global Pool}
  
When a message larger than ($2S$) is sent, the transport sends $S$
bytes with the first packet.  When the acknowledgment is received, it
allocates (${\rm message length} - S$) bytes from the global pool to
transfer the rest of the message.

To prevent a single large message transfer from monopolizing the
global pool, allocations from the pool are actually restricted to a
maximum of \issiparam{rpi\_\-ssi\_\-sysv\_\-shmmaxalloc} bytes.  Even
with this restriction, it is possible for the global pool to
temporarily become exhausted.  In this case, the transport will fall
back to using the postbox area to transfer the message.  Performance
will be degraded, but the application will progress.

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-ompi-sysv-mca-params} shows the MCA parameters that
may be changed at run-time.  Each of these parameters were discussed
in the previous sections.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
  \ssiparamentry{rpi\_\-sysv\_\-priority}{30}{Default priority level.}
%
  \ssiparamentry{rpi\_\-sysv\_\-pollyield}{1}{Whether or not to force
    the use of \func{yield()} to yield the processor.}
%
  \ssiparamentry{rpi\_\-sysv\_\-shmmaxalloc}{From configure}{Maximum
    size of a large message atomic transfer.  The default value is
    calculated when Open MPI is configured.}
%
  \ssiparamentry{rpi\_\-sysv\_\-shmpoolsize}{From configure}{Size of
    the shared memory pool for large messages.  The default value is
    calculated when Open MPI is configured.}
%
  \ssiparamentry{rpi\_\-sysv\_\-short}{8192}{Maximum length (in bytes)
    of a ``short'' message for sending via shared memory (i.e.,
    on-node).  Directly affects the size of the allocated ``postbox''
    shared memory area.}
%
  \ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
    (in bytes) of a ``short'' message for sending via TCP sockets
    (i.e., off-node).}
%
  \ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
    kernel (-1 means use the short message size).}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{sysv} RPI module.}
  \label{tbl:mca-ompi-sysv-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{tcp} Module (TCP Communication)}
\label{sec:mca-ompi-tcp}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt tcp} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 20 \\
  {\bf Checkpoint / restart:} & no \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{tcp} RPI module uses TCP sockets for MPI point-to-point
communication.  

%%%%%

\subsubsection{Tunable Parameters}
    
Two different protocols are used to pass messages between processes:
short and long.  Short messages are sent eagerly and will not block
unless the operating system blocks.  Long messages use a rendezvous
protocol; the body of the message is not sent until a matching MPI
receive is posted.  The crossover point between the short and long
protocol defaults to 64KB, but can be changed with the
\issiparam{rpi\_\-tcp\_\-short} MCA parameter, an integer specifying
the maximum size (in bytes) of a short message.
%
\changebegin{7.1}
%
Additionally, the amount of socket buffering requested of the kernel
defaults to the size of short messages.  It can be altered with the
\issiparam{rpi\_\-tcp\_\-sockbuf} parameter.  When this value is -1,
the value of the \issiparam{rpi\_\-tcp\_\-short} parameter is used.
Otherwise, its value is passed to the \func{setsockopt(2)} system call
to set the amount of operating system buffering on every socket that
is used for MPI communication.

\changeend{7.1}

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-tcp\_\-priority}{20}{Default priority level.}
%
    \ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
      (in bytes) of a ``short'' message.}
%
    \ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
      kernel (-1 means use the short message size).}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{tcp} RPI module.}
  \label{tbl:mca-ompi-tcp-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \rpi{usysv} Module (Shared Memory Using Spin Locks)}
\label{sec:mca-ompi-usysv}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt usysv} \\
  {\bf Kind:} & \kind{rpi} \\
  {\bf Default MCA priority:} & 40 \\
  {\bf Checkpoint / restart:} & no \\
  \hline
\end{tabular}
\vspace{11pt}

The \rpi{usysv} RPI is the one of two combination shared-memory/TCP
message passing modules.  Shared memory is used for passing messages
to processes on the same node; TCP sockets are used for passing
messages to processes on other nodes.  Spin locks with back-off are
used for synchronization of the shared memory pool (a System V
semaphore or pthread mutex is also used for access to the per-node
shared memory pool).

\changebegin{7.0.3}

The nature of spin locks means that the \rpi{usysv} RPI will perform
poorly when there are more processes than processors (particularly in
blocking MPI communication calls).  If no higher priority RPI modules
are available (e.g., Myrinet/\rpi{gm}) and the user does not select a
specific RPI module through the \ssiparam{rpi} MCA parameter,
\rpi{usysv} may be selected as the default -- even if there are more
processes than processors.  Users should keep this in mind; in such
circumstances, it is probably better to manually select the \rpi{sysv}
or \rpi{tcp} RPI modules.

\changeend{7.0.3}

%%%%%

\subsubsection{Overview}

Aside from synchronization, the \rpi{usysv} RPI module is almost
identical to the \rpi{sysv} module.
%
The \rpi{usysv} module uses spin locks with back-off.  When a process
backs off, it attempts to yield the processor.  If the configure
script found a system-provided yield function,\footnote{Such as
  \func{yield()} or \func{sched\_\-yield()}.} it is used. If no such
function is found, then \func{select()} on \const{NULL} file
descriptor sets with a timeout of 10us is used.

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-ompi-usysv-mca-params} shows the MCA parameters that
may be changed at run-time.  Many of these parameters are identical to
their \rpi{sysv} counterparts and are not re-described here.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
      (in bytes) of a ``short'' message for sending via TCP sockets
      (i.e., off-node).}
%
    \ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
      kernel (-1 means use the short message size).}
%
    \ssiparamentry{rpi\_\-usysv\_\-pollyield}{1}{Same as \rpi{sysv}
    counterpart.}
%
    \ssiparamentry{rpi\_\-usysv\_\-priority}{40}{Default priority level.}
%
    \ssiparamentry{rpi\_\-usysv\_\-readlockpoll}{10,000}{Number of
    iterations to spin before yielding the processing while waiting to
    read.}
%
    \ssiparamentry{rpi\_\-usysv\_\-shmmaxalloc}{From configure}{Same as
    \rpi{sysv} counterpart.}
%
    \ssiparamentry{rpi\_\-usysv\_\-shmpoolsize}{From configure}{Same as
    \rpi{sysv} counterpart.}
%
    \ssiparamentry{rpi\_\-usysv\_\-short}{8192}{Same as \rpi{sysv}
    counterpart.}
%
    \ssiparamentry{rpi\_\-usysv\_\-writelockpoll}{10}{Number of
    iterations to spin before yielding the processing while waiting to
    write.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \rpi{usysv} RPI module.}
  \label{tbl:mca-ompi-usysv-mca-params}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{MPI Collective Communication}
\label{sec:mca-ompi-coll}
\index{collective MCA modules|(}
\index{MPI collective modules|see {collective MCA modules}}
\index{MCA collective modules|see {collective MCA modules}}

MPI collective communication functions have their basic functionality
outlined in the MPI standard.  However, the implementation of this
functionality can be optimized and/or implemented in different ways.
As such, Open MPI provides modules for implementing the MPI collective
routines that are targeted for different environments.

\begin{itemize}
\item Basic algorithms
\item SMP-optimized algorithms
\item Shared Memory algorithms
\end{itemize}

These modules are discussed in detail below.  Note that the sections
below each assume that support for these modules have been compiled
into Open MPI.  The \icmd{laminfo} command can be used to determine
exactly which modules are supported in your installation (see
Section~\ref{sec:commands-laminfo},
page~\pageref{sec:commands-laminfo}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Selecting a \kind{coll} Module}
\label{sec:mca-ompi-coll-select}
\index{collective MCA modules!selection process}

\kind{coll} modules are selected on a per-communicator basis.  Most
users will not need to override the \kind{coll} selection mechanisms;
the \kind{coll} modules currently included in Open MPI usually select
the best module for each communicator.  However, mechanisms are
provided to override which \kind{coll} module will be selected on a
given communicator.

When each communicator is created (including \mcw\ and \mcs), all
available \kind{coll} modules are queried to see if they want to be
selected.  A \kind{coll} module may therefore be in use by zero or
more communicators at any given time.  The final selection of which
module will be used for a given communicator is based on priority; the
module with the highest priority from the set of available modules
will be used for all collective calls on that communicator.

Since the selection of which module to use is inherently dynamic and
potentially different for each communicator, there are two levels of
parameters specifying which modules should be used.  The first level
specifies the overall set of \kind{coll} modules that will be
available to {\em all} communicators; the second level is a
per-communicator parameter indicating which specific module should be
used.

The first level is provided with the \issiparam{coll} MCA parameter.
Its value is a comma-separated list of \kind{coll} module names.  If
this parameter is supplied, only these modules will be queried at run
time, effectively determining the set of modules available for
selection on all communicators.  If this parameter is not supplied,
all \kind{coll} modules will be queried.

The second level is provided with the MPI attribute
\mpiattr{Open MPI\_\-MPI\_\-MCA\_\-COLL}.  This attribute can be set to
the string name of a specific \kind{coll} module on a parent
communicator before a new communicator is created.  If set, the
attribute's value indicates the {\em only} module that will be
queried.  If this attribute is not set, all available modules are
queried.

Note that no coordination is done between the MCA frameworks in each
MPI process to ensure that the same modules are available and/or are
selected for each communicator.  Although \cmd{mpirun} allows
different environment variables to be exported to each MPI process,
and the value of an MPI attribute is local to each process, Open MPI's
behavior is undefined if the same MCA parameters are not available in
all MPI processes.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{\kind{coll} MCA Parameters}

There are three parameters that apply to all \kind{coll} modules.
Depending on when their values are checked, they may be set by
environment variables, command line switches, or attributes on MPI
communicators.

\begin{itemize}
\item \issiparam{coll\_\-base\_\-associative}: The MPI standard defines
  whether reduction operations are commutative or not, but makes no
  provisions for whether an operator is associative or not.  This
  parameter, if defined to 1, asserts that all reduction operations on
  a communicator are assumed to be associative.  If undefined or
  defined to 0, all reduction operations are assumed to be
  non-associative.  
  
  This parameter is examined during every reduction operation.  See
  {\bf Commutative and Associative Reduction Operators}, below.
  
\item \issiparam{coll\_\-crossover}: If set, define the maximum number
  of processes that will be used with a linear algorithm.  More than
  this number of processes may use some other kind of algorithm.

  This parameter is only examined during \mpifunc{MPI\_\-INIT}.
  
\item \issiparam{coll\_\-reduce\_\-crossover}: For reduction
  operations, the determination as to whether an algorithm should be
  linear or not is not based on the number of process, but rather by
  the number of bytes to be transferred by each process.  If this
  parameter is set, it defines the maximum number of bytes transferred
  by a single process with a linear algorithm.  More than this number
  of bytes may result in some other kind of algorithm.

  This parameter is only examined during \mpifunc{MPI\_\-INIT}.
\end{itemize}

%%%%%

\subsubsection{Commutative and Associative Reduction Operators}

MPI-1 defines that all built-in reduction operators are commutative.
User-defined reduction operators can specify whether they are
commutative or not.  The MPI standard makes no provisions for whether
a reduction operation is associative or not.
%
For some operators and datatypes, this distinction is largely
irrelevant (e.g., find the maximum in a set of integers).  However,
for operations involving the combination of floating point numbers,
associativity and commutativity matter.  An {\em Advice to
  Implementors} note in MPI-1, section 4.9.1, 114:20, states:

\begin{quote}
  It is strongly recommended that \mpifunc{MPI\_\-REDUCE} be
  implemented so that the same result be obtained whenever the
  function is applied on the same arguments, appearing in the same
  order.  Note that this may prevent optimizations that take advantage
  of the physical location of processors.
\end{quote}

Some implementations of the reduction operations may specifically take
advantage of data locality, and therefore assume that the reduction
operator is associative.
%
As such, Open MPI will always take the conservative approach to reduction
operations and fall back to non-associative algorithms (e.g.,
\coll{lam\_\-basic}) for the reduction operations unless specifically
told to use associative (SMP-optimized) algorithms by setting the MCA
parameter \issiparam{coll\_\-base\_\-associative} to 1.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \coll{lam\_\-basic} Module}
\index{collective MCA modules!lam\_basic@\coll{lam\_\-basic}}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt lam\_\-basic} \\
  {\bf Kind:} & \kind{coll} \\
  {\bf Default MCA priority:} & 0 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

The \coll{lam\_\-basic} module provides simplistic algorithms for each
of the MPI collectives that are layered on top of point-to-point
functionality.\footnote{The basic algorithms are the same that have
  been included in Open MPI since at least version 6.2.}  It can be
used in any environment.  Its priority is sufficiently low that it
will be chosen if no other \kind{coll} module is available.

Many of the algorithms are twofold: for $N$ or less processes, linear
algorithms are used.  For more than $N$ processes, binomial algorithms
are used.  No attempt is made to determine the locality of processes,
however -- the \coll{lam\_\-basic} module effectively assumes that
there is equal latency between all processes.  All reduction
operations are performed in a strictly-defined order; associativity is
not assumed.

\subsubsection{Collectives for Intercommunicators}

As of now, only \coll{lam\_\-basic} module supports intercommunicator
collectives according to the MPI-2 standard. These algorithms are
built over point-to-point layer and they also make use of an
intra-communicator collectives with the help of intra-communicator
corresponding to the local group. Mapping among the intercommunicator
and corresponding local-intracommunicator is separately managed in the
\coll{lam\_\-basic} module.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \coll{smp} Module}
\index{collective MCA modules!smp@\coll{smp}}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt smp} \\
  {\bf Kind:} & \kind{coll} \\
  {\bf Default MCA priority:} & 50 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

The \coll{smp} module is geared towards SMP nodes in a LAN.  Heavily
inspired by the MagPIe
algorithms~\cite{kielmann++:00:magpie_bandwidth}, the \coll{smp}
module determines the locality of processes before setting up a
dynamic structure in which to perform the collective function.
Although all communication is still layered on MPI point-to-point
functions, the algorithms attempt to maximize the use of on-node
communication before communicating with off-node processes.  This
results in lower overall latency for the collective operation.  

The \coll{smp} module assumes that there are only two levels of
latency between all processes.  As such, it will only allow itself to
be available for selection when there are at least two nodes in a
communicator and there are at least two processes on the same
node.\footnote{As a direct result, \coll{smp} will never be selected
  for \mcs.}

Only some of the collectives have been optimized for SMP environments.
Table~\ref{tbl:mca-ompi-coll-smp-algorithms} shows which collective
functions have been optimized, which were already optimal (from the
\coll{lam\_\-basic} module), and which will eventually be optimized.

\def\lbid{Identical to \coll{lam\_\-basic} algorithm; already
  optimized for SMP environments.}
\def\smpopt{Optimized for SMP environments.}
\def\smpnoopt{Not yet optimized for SMP environments; uses
  \coll{lam\_\-basic} algorithm instead.}

\begin{table}[htbp]
  \centering
  \begin{tabular}{|l|p{3.7in}|}
    \hline
    \multicolumn{1}{|c|}{MPI function} &
    \multicolumn{1}{|c|}{Status} \\
    \hline
    \hline
    \mpifunc{MPI\_\-ALLGATHER} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-ALLGATHERV} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-ALLREDUCE} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-ALLTOALL} & \lbid \\
    \hline
    \mpifunc{MPI\_\-ALLTOALLV} & \lbid \\
    \hline
    \mpifunc{MPI\_\-ALLTOALLW} & lbid. \\
    \hline
    \mpifunc{MPI\_\-BARRIER} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-BCAST} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-EXSCAN} & lbid. \\
    \hline
    \mpifunc{MPI\_\-GATHER} & \lbid \\
    \hline
    \mpifunc{MPI\_\-GATHERV} & \lbid \\
    \hline
    \mpifunc{MPI\_\-REDUCE} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-REDUCE\_\-SCATTER} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-SCAN} & \smpopt \\
    \hline
    \mpifunc{MPI\_\-SCATTER} & \lbid \\
    \hline
    \mpifunc{MPI\_\-SCATTERV} & \lbid \\
    \hline
  \end{tabular}
  \caption{Listing of MPI collective functions indicating which have
    been optimized for SMP environments.}
  \label{tbl:mca-ompi-coll-smp-algorithms}
\end{table}

%%%%%

\subsubsection{Special Notes}

Since the goal of the SMP-optimized algorithms attempt to take
advantage of data locality, it is strongly recommended to maximize the
proximity of \mcw\ rank neighbors on each node.  The \cmdarg{C}
nomenclature to \cmd{mpirun} can ensure this automatically.

Also, as a result of the data-locality exploitation, the
\issiparam{coll\_\-base\_\-associative} parameter is highly relevant -- if it
is not set to 1, the \coll{smp} module will fall back to the
\coll{lam\_\-basic} reduction algorithms.

\index{collective MCA modules|)}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\changebegin{7.1}

\subsection{The \coll{shmem} Module}
\index{collective MCA modules!shmem@\coll{shmem}}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt shmem} \\
  {\bf Kind:} & \kind{coll} \\
  {\bf Default MCA priority:} & 50 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

The \coll{shmem} module is developed to facilitate fast collective
communication among processes on a single node.  Processes on a N-way
SMP node can take advantage of the shared memory for message
passing.  The module will be selected only if the communicator spans
over a single node and the all the processes in the communicator can
successfully attach the shared memory region to their address space. 

The shared memory region consists two disjoint sections.  First section
of the shared memory is used for synchronization among the processes
while the second section is used for message passing (Copying data
into and from shared memory). 

The second section is known as \const{MESSAGE\_POOL} and is divided
into N equal segments.  Default value of $N$ is 8 and is configurable
with the \issiparam{coll\_base\_shmem\_num\_segments} MCA
parameter.  The size of the \const{MESSAGE\_POOL} can be also configured
with the \issiparam{coll\_base\_shmem\_message\_pool\_size} MCA
parameter.  Default size of the \const{MESSAGE\_POOL} is $(16384 \times
8)$.

The first section is known as \const{CONTROL\_SECTION} and it is
logicallu divided into $(2 \times N + 2)$ segments.  $N$ is the number
of segments in the \const{MESSAGE\_POOL} section.  Total size of this
section is:

\[
((2 \times N) + 2) \times C \times S 
\]

Where $C$ is the cache line size, $S$ is the size of the communicator.
Shared variabled for synchronization are placed in different
\const{CACHELINE} for each processes to prevent trashing due to cache
invalidation.  

%%%%%

\subsubsection{General Logic behind Shared Memory Management}

Each segment in the \const{MESSAGE\_POOL} corresponds to {\em TWO}
segments in the \const{CONTROL\_SECTION}.  Whenever a particular segment
in \const{MESSAGE\_POOL} is active, its corresponding segments in the
\const{CONTROL\_SECTION} are used for synchronization.  Processes can
operate on one segment (Copy the messages), set appropriate
synchronizattion variables and can continue with the next message
segment.  This approach improves performance of the collective
algorithms. All the process need to complete a \mpifunc{MPI\_BARRIER}
 at the last (Default 8th) segment to prevent race conditions.  The
 extra 2 segments in the \const{CONTROL\_SECTION} are used exclusively
 for explicit \mpifunc{MPI\_BARRIER}. 

Only some of the collectives have been optimized for SMP environments.
Table~\ref{tbl:mca-ompi-coll-shmem-algorithms} shows which collective
functions have been optimized, which were already optimal (from the
\coll{lam\_\-basic} module), and which will eventually be optimized.

%%%%%

\subsubsection{List of Algorithms}

Only some of the collectives have been implemented using shared memory
Table~\ref{tbl:mca-ompi-coll-shmem-algorithms} shows which collective
functions have been implemented and which uses \coll{lam\_\-basic}
module)

\def\shmemopt{Implemented using shared memory.}
\def\shmemnoopt{Uses \coll{lam\_\-basic} algorithm.}

\begin{table}[htbp]
  \centering
  \begin{tabular}{|l|p{3.7in}|}
    \hline
    \multicolumn{1}{|c|}{MPI function} &
    \multicolumn{1}{|c|}{Status} \\
    \hline
    \hline
    \mpifunc{MPI\_\-ALLGATHER} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-ALLGATHERV} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-ALLREDUCE} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-ALLTOALL} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-ALLTOALLV} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-ALLTOALLW} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-BARRIER} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-BCAST} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-EXSCAN} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-GATHER} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-GATHERV} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-REDUCE} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-REDUCE\_\-SCATTER} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-SCAN} & \shmemnoopt \\
    \hline
    \mpifunc{MPI\_\-SCATTER} & \shmemopt \\
    \hline
    \mpifunc{MPI\_\-SCATTERV} & \shmemnoopt \\
    \hline
  \end{tabular}
  \caption{Listing of MPI collective functions indicating which have
    been implemented using Shared Memory}
  \label{tbl:mca-ompi-coll-shmem-algorithms}
\end{table}  

%%%%%

\subsubsection{Tunable Parameters}

Table~\ref{tbl:mca-ompi-shmem-mca-params} shows the MCA parameters that
may be changed at run-time.  Each of these parameters were discussed
in the previous sections.

\begin{table}[htbp]
  \begin{ssiparamtb}
%
    \ssiparamentry{coll\_\-base\_\-shmem\_\-message\_\-pool\_\-size}{$16384
      \times 8$}{Size of the shared memory pool for the messages.}
%
    \ssiparamentry{coll\_\-base\_\-shmem\_\-num\_\-segments}{8}{Number
      of segments in the message pool section.}
  \end{ssiparamtb}
  \caption{MCA parameters for the \coll{shmem} \kind{coll} module.}
  \label{tbl:mca-ompi-shmem-mca-params}
\end{table}

%%%%%

\subsubsection{Special Notes}

Open MPI provides \rpi{sysv} and \rpi{usysv} RPI for the intra\-node
communication.  In this case, the collective communication also
happens through the shared memory but indirectly in terms of Sends and
Recvs.  Shared Memory Collective algorithms avoid all the overhead
associated with the indirection and provide a minimum blocking way for
the collective operations.

The shared memory is created by only one process in the communicator
and rest of the processes simply attach the shared memory region to
their address space.  The process which finalizes last, hands back the
shared memory region to the kernel while processes leaving before
simply detach the shared memory region from their address space. 

\index{collective MCA modules|)}

\changeend{7.1}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Checkpoint/Restart of MPI Jobs}
\label{sec:mca-ompi-cr}
\index{checkpoint/restart MCA modules|(}

Open MPI supports the ability to involuntarily checkpoint and restart
parallel MPI jobs.  Due to the asynchronous nature of the
checkpoint/restart design, such jobs must run with a thread level of
at least \mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.  This allows the
checkpoint/restart framework to interrupt the user's job for a
checkpoint regardless of whether it is performing message passing
functions or not in the MPI communications layer.

Open MPI does not provide checkpoint/restart functionality itself;
\kind{cr} MCA modules are used to invoke back-end systems that save
and restore checkpoints.  The following notes apply to checkpointing
parallel MPI jobs:

\begin{itemize}
\item No special code is required in MPI applications to take
  advantage of Open MPI's checkpoint/restart functionality, although
  some limitations may be imposed (depending on the back-end
  checkpointing system that is used).  
  
\item Open MPI's checkpoint/restart functionality {\em only} involves MPI
  processes; the Open MPI universe is not checkpointed.  A Open MPI universe
  must be independently established before an MPI job can be restored.

\item Open MPI does not yet support checkpointing/restarting MPI-2
  applications.  In particular, Open MPI's behavior is undefined when
  checkpointing MPI processes that invoke any non-local MPI-2
  functionality (including dynamic functions and IO).  
  
\item Migration of restarted processes is available on a limited
  basis; the \rpi{crtcp} RPI will start up properly regardless of what
  nodes the MPI processes are re-started on, but other system-level
  resources may or may not be restarted properly (e.g., open files,
  shared memory, etc.).

\changebegin{7.1}
\item Checkpoint files are saved using a two-phase commit protocol that is
  coordinated by \cmd{mpirun}.  \cmd{mpirun} initiates a checkpoint request
  for each process in the MPI job by supplying a temporary context filename.
  If all the checkpoint requests completed successfully, the saved context
  files are renamed to their respective target filenames; otherwise, the
  checkpoint files are discarded.
\changeend{7.1}

\item Checkpoints can only be performed after all processes have
  invoked \mpifunc{MPI\_\-INIT} and before any process has invoked
  \mpifunc{MPI\_\-FINALIZE}.
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Selecting a \kind{cr} Module}
\index{checkpoint/restart MCA modules!selection process}

The \kind{cr} framework coordinates with all other MCA modules to
ensure that the entire MPI application is ready to be checkpointed
before the back-end system is invoked.  Specifically, for a parallel
job to be able to checkpoint and restart, all the MCA modules that it
uses must support checkpoint/restart capabilities.

All \kind{coll} modules in the Open MPI distribution currently support
checkpoint/restart capability because they are layered on MPI
point-to-point functionality -- as long as the RPI module being used
supports checkpoint/restart, so do the \kind{coll} modules.
%
However, only one RPI module currently supports checkpoint/restart:
\rpi{crtcp}.  Attempting to checkpoint an MPI job when using any other
\kind{rpi} module will result in undefined behavior.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{\kind{cr} MCA Parameters}

The \issiparam{cr} MCA parameter can be used to specify which
\kind{cr} module should be used for an MPI job.  An error will occur
if a \kind{cr} module is requested and an \kind{rpi} or \kind{coll}
module cannot be found that supports checkpoint/restart functionality.

Additionally, the \issiparam{cr\_\-base\_\-dir} MCA parameter can be
used to specify the directory where checkpoint file(s) will be saved.
If it is not set, and no default value was provided when Open MPI was
configured (with the \confflag{with-cr-file-dir} flag) the user's home
directory is used.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The \crssi{blcr} Module}
\index{Berkeley Lab Checkpoint/Restart single-node checkpointer}
\index{blcr checkpoint/restart MCA module@\crssi{blcr} checkpoint/restart MCA module}
\index{checkpoint/restart MCA modules!blcr@\crssi{blcr}}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt blcr} \\
  {\bf Kind:} & \kind{cr} \\
  {\bf Default MCA priority:} & 50 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

Berkeley Lab's Checkpoint/Restart (BLCR)~\cite{BLCR02} single-node
checkpointer provides the capability for checkpointing and restarting
processes under Linux.  The \crssi{blcr} module, when used with
checkpoint/restart MCA modules, will invoke the BLCR system to save
and restore checkpoints.

%%%%%

\subsubsection{Overview}

The \crssi{blcr} module will only automatically be selected when the
thread level is \mpiconst{MPI\_\-THREAD\_\-SERIALIZED} and all
selected MCA modules support checkpoint/restart functionality (see the
MCA module selection algorithm,
Section~\ref{sec:mca-ompi-component-selection},
page~\pageref{sec:mca-ompi-component-selection}).  The \crssi{blcr} module
can be specifically selected by setting the \ssiparam{cr} MCA
parameter to the value \issivalue{cr}{blcr}.  Manually selecting the
\crssi{blcr} module will force the MPI thread level to be at least
\mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.

%%%%%

\subsubsection{Running a Checkpoint/Restart-Capable MPI Job}

There are multiple ways to run a job with checkpoint/restart support:

\begin{itemize}
\item Use the \rpi{crtcp} RPI, and invoke
  \mpifunc{MPI\_\-INIT\_\-THREAD} with a requested thread level of
  \mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.  This will automatically
  make the \crssi{blcr} module available.

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -ssi rpi crtcp my_mpi_program
\end{lstlisting}
  % stupid emacs mode: $
  
\item Use the \rpi{crtcp} RPI and manually select the \crssi{blcr}
  module:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr my_mpi_program
\end{lstlisting}
% stupid emacs mode: $
\end{itemize}

\changebegin{7.0.5}

Depending on the location of the BLCR shared library, it may be
necessary to use the \ienvvar{LD\_\-LIBRARY\_\-PATH} environment
variable to specify where it can be found.  Specifically, if the BLCR
library is not in the default path searched by the linker, errors will
occur at run time because it cannot be found.  In such cases, adding
the directory where the \file{libcr.so*} file(s) can be found to the
\ienvvar{LD\_\-LIBRARY\_\-PATH} environment variable {\em on all nodes
  where the MPI application will execute} will solve the problem.
%
Note that this may entail editing user's ``dot'' files to augment the
\ienvvar{LD\_\-LIBRARY\_\-PATH} variable.\footnote{Ensure to see
  Section~\ref{sec:getting-started-path} for details about which
  shell startup files should be edited.  Also note that 
  shell startup files are {\em only} read when starting the Open MPI
  universe.  Hence, if you change values in shell startup files, you
  will likely need to re-invoke the \icmd{lamboot} command to put your
  changes into effect.}  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
# ...edit user's shell startup file to augment LD_LIBRARY_PATH...
shell$ lamboot hostfile
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr my_mpi_program
\end{lstlisting}

Alternatively, the ``\cmd{-x}'' option to \icmd{mpirun} can be used to
export the \ienvvar{LD\_\-LIBRARY\_\-PATH} environment variable to all
MPI processes.  For example (Bourne shell and derrivates):

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ LD_LIBRARY_PATH=/location/of/blcr/lib:$LD_LIBRARY_PATH
shell$ export LD_LIBRARY_PATH
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr -x LD_LIBRARY_PATH my_mpi_program
\end{lstlisting}
% stupid emacs mode: $

For C shell and derivates:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell% setenv LD_LIBRARY_PATH /location/of/blcr/lib:$LD_LIBRARY_PATH
shell% mpirun C -ssi rpi crtcp -ssi cr blcr -x LD_LIBRARY_PATH my_mpi_program
\end{lstlisting}

\changeend{7.0.5}

%%%%%
 
\subsubsection{Checkpointing and Restarting}

Once a checkpoint-capable job is running, the BLCR command
\icmd{cr\_\-checkpoint} can be used to invoke a checkpoint.  Running
\cmd{cr\_\-checkpoint} with the PID of \cmd{mpirun} will cause a
context file to be created for \cmd{mpirun} as well as a context file
for each running MPI process.  Before it is checkpointed, \cmd{mpirun}
will also create an application schema file to assist in restoring the
MPI job.  These files will all be created in the directory specified
by Open MPI's configured default, the \issiparam{cr\_\-base\_\-dir}, or
the user's home directory if no default is specified.

The BLCR \icmd{cr\_\-restart} command can then be invoked with the PID
and context file generated from \cmd{mpirun}, which will restore the
entire MPI job.

%%%%%

\subsubsection{Tunable Parameters}

There are no tunable parameters to the \crssi{blcr} \kind{cr} module.

%%%%%
 
\subsubsection{Known Issues}

\begin{itemize}
\item BLCR has its own limitations (e.g., BLCR does not yet support
  saving and restoring file descriptors); see the documentation
  included in BLCR for further information.  Check the project's main
  web site\footnote{\url{http://ftg.lbl.gov/}} to find out more about
  BLCR.

\changebegin{7.1} 
\item Since a checkpoint request is initiated by invoking
  \cmd{cr\_checkpoint} with the PID of \cmd{mpirun}, it is not
  possible to checkpoint MPI jobs that were started using the
  \cmdarg{-nw} option to \cmd{mpirun}, or directly from the
  command-line without using \cmd{mpirun}.
   
\item While the two-phase commit protocol that is used to save
  checkpoints provides a reasonable guarantee of consistency of saved
  global state, there is at least one case in which this guarantee
  fails.  For example, the renaming of checkpoint files by
  \cmd{mpirun} is not atomic; if a failure occurs when \cmd{mpirun} is
  in the process of renaming the checkpoint files, the collection of
  checkpoint files might result in an inconsistent global state.
  
\item If the BLCR module(s) are compiled dynamically, the
  \ienvvar{LD\_\-PRELOAD} environment variable must include the
  location of the \ifile{libcr.so} library.  This is to ensure that
  \ifile{libcr.so} is loaded before the PThreads library.
\changeend{7.1}

\end{itemize}


\changebegin{7.1} 
\subsection{The \crssi{self} Module}

\begin{tabular}{rl}
  \multicolumn{2}{c}{Module Summary} \\
  \hline
  {\bf Name:} & {\tt blcr} \\
  {\bf Kind:} & \kind{cr} \\
  {\bf Default MCA priority:} & 25 \\
  {\bf Checkpoint / restart:} & yes \\
  \hline
\end{tabular}
\vspace{11pt}

The \crssi{self} module, when used with checkpoint/restart MCA modules, will 
invoke the user-defined functions to save and restore checkpoints.

%%%%%

\subsubsection{Overview}

The \crssi{self} module can be specifically selected by setting the 
\ssiparam{cr} MCA parameter to the value \issivalue{cr}{self}.  Manually 
selecting the \crssi{self} module will force the MPI thread level to be at 
least \mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.

%%%%%

\subsubsection{Running a Checkpoint/Restart-Capable MPI Job}

Use the \rpi{crtcp} RPI and manually select the \crssi{self} module:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -ssi rpi crtcp -ssi cr self my_mpi_program
\end{lstlisting}
% stupid emacs mode: $

%%%%%
 
\subsubsection{Checkpointing and Restarting}

The names of the Checkpoint, Restart and Continue functions can be specified 
in one of the following ways:

\begin{itemize}
  \item Use the \ssiparam{cr\_\-self\_\-user\_\-prefix} to specify a prefix.  This will 
    cause Open MPI to assume that the Checkpoint, Restart and Continue functions 
    are prefix\_checkpoint, prefix\_restart and prefix\_continue respectively, 
    where prefix is the value of the \ssiparam{cr\_\-self\_\-user\_\-prefix} MCA parameter.

  \item To specify the names of the Checkpoint, Restart and Continue functions 
    separately, use the \ssiparam{cr\_\-self\_\-user\_\-checkpoint}, 
    \ssiparam{cr\_\-ssi\_\-user\_\-restart} and the 
    \ssiparam{cr\_\-self\_\-user\_\-continue} MCA parameters respectively.  In 
    case both the \ssiparam{cr\_\-ssi\_\-user\_\-prefix} and any of these 
    above three parameters are specified, these parameters are given higher preference. 


\end{itemize}

%%%%%

In case none of the above four parameters are supplied, and the \crssi{self} 
module is selected, \cmd{mpirun} aborts with the error message indicating 
that symbol-lookup failed.  \cmd{mpirun} also aborts in case any of the 
Checkpoint, Restart or Continue functions is not found in the MPI application.

Once a checkpoint-capable job is running, the Open MPI command
\cmd{lamcheckpoint} can be used to invoke a checkpoint.  Running
\cmd{lamcheckpoint} with the PID of \cmd{mpirun} will cause the user-defined 
Checkpoint function to be invoked.  Before it is checkpointed, \cmd{mpirun}
will also create an application schema file to assist in restoring the
MPI job.  These files will all be created in the directory specified
by Open MPI's configured default, the \issiparam{cr\_\-base\_\-dir}, or
the user's home directory if no default is specified.

\cmd{lamboot} can be used to restart an MPI application.  In case the 
\crssi{self} module is selected the second argument to \cmd{lamboot} is the 
set of arguments to be passed to the new \cmd{mpirun}.    
%%%%%

\subsubsection{Tunable Parameters}

There are no tunable parameters to the \crssi{self} \kind{cr} module.

%%%%%
 
\subsubsection{Known Issues}

\begin{itemize}

\item Since a checkpoint request is initiated by invoking
  \cmd{lamcheckpoint} with the PID of \cmd{mpirun}, it is not possible to
  checkpoint MPI jobs that were started using the \cmdarg{-nw} option to
  \cmd{mpirun}, or directly from the command-line without using \cmd{mpirun}. 

\end{itemize}
\changeend{7.1}

\index{checkpoint/restart MCA modules|)}
\index{checkpoint/restart MCA modules|)}