2005-04-02 19:11:28 +04:00
|
|
|
% -*- latex -*-
|
|
|
|
%
|
2005-11-05 22:57:48 +03:00
|
|
|
% Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
% University Research and Technology
|
|
|
|
% Corporation. All rights reserved.
|
|
|
|
% Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
% of Tennessee Research Foundation. All rights
|
|
|
|
% reserved.
|
2005-04-02 19:11:28 +04:00
|
|
|
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
% University of Stuttgart. All rights reserved.
|
|
|
|
% Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
% All rights reserved.
|
|
|
|
% $COPYRIGHT$
|
|
|
|
%
|
|
|
|
% Additional copyrights may follow
|
|
|
|
%
|
|
|
|
% $HEADER$
|
|
|
|
%
|
|
|
|
|
|
|
|
\chapter{Open MPI Components}
|
|
|
|
\label{sec:mca-ompi}
|
|
|
|
|
|
|
|
{\Huge JMS Needs massive overhaul}
|
|
|
|
|
|
|
|
There are multiple types of MPI components:
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item \mcafw{rpi}: MPI point-to-point communication, also known as the
|
|
|
|
Open MPI Request Progression Interface (RPI).
|
|
|
|
|
|
|
|
\item \mcafw{coll}: MPI collective communication.
|
|
|
|
|
|
|
|
\item \mcafw{cr}: Checkpoint/restart support for MPI programs.
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
Each of these types, and the modules that are available in the default
|
|
|
|
Open MPI distribution, are discussed in detail below.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{General MPI MCA Parameters}
|
|
|
|
\label{sec:mca-ompi:mpi-params}
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
The default hostmap file is \ifile{\$sysconf/lam-hostmap} (typically
|
|
|
|
\file{\$prefix/etc/lam-hostmap.txt}). This file is only useful in
|
|
|
|
environments with multiple TCP networks, and is typically populated by
|
|
|
|
the system administrator (see the Open MPI Installation Guide for more
|
|
|
|
details on this file).
|
|
|
|
|
|
|
|
The MCA parameter \issiparam{mpi\_hostmap} can be used to specify an
|
|
|
|
alternate hostmap file. For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -ssi mpi_hostmap my_hostmap.txt my_mpi_application
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
This tells Open MPI to use the hostmap \file{my\_hostmap.txt} instead of
|
|
|
|
\file{\$sysconf/lam-hostmap.txt}. The special filename
|
|
|
|
``\cmdarg{none}'' can also be used to indicate that no address
|
|
|
|
remapping should be performed.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{MPI Module Selection Process}
|
|
|
|
\label{sec:mca-ompi-component-selection}
|
|
|
|
|
|
|
|
The modules used in an MPI process may be related or dependent upon
|
|
|
|
external factors. For example, the \rpi{gm} RPI cannot be used for
|
|
|
|
MPI point-to-point communication unless there is Myrinet hardware
|
|
|
|
present in the node. The \crssi{blcr} checkpoint/restart module
|
|
|
|
cannot be used unless thread support was included. And so on. As
|
|
|
|
such, it is important for users to understand the module selection
|
|
|
|
algorithm.
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item Set the thread level to be what was requested, either via
|
|
|
|
\mpifunc{MPI\_\-INIT\_\-THREAD} or the environment variable
|
|
|
|
\ienvvar{Open MPI\_\-MPI\_\-THREAD\_\-LEVEL}.
|
|
|
|
|
|
|
|
\item Query relevant modules and make lists of the resulting available
|
|
|
|
modules. ``Relevant'' means either a specific module (or set of
|
|
|
|
modules) if the user specified them through MCA parameters, or all
|
|
|
|
modules if not specified.
|
|
|
|
|
|
|
|
\item Eliminate all modules who do not support the current MPI thread
|
|
|
|
level.
|
|
|
|
|
|
|
|
\item If no \kind{rpi} modules remain, try a lower thread support
|
|
|
|
level until all levels have been tried. If no thread support level
|
|
|
|
can provide an \kind{rpi} module, abort.
|
|
|
|
|
|
|
|
\item Select the highest priority \kind{rpi} module. Reset the thread
|
|
|
|
level (if necessary) to be at least the lower bound of thread levels
|
|
|
|
that the selected \kind{rpi} module supports.
|
|
|
|
|
|
|
|
\item Eliminate all \kind{coll} and \kind{cr} modules that cannot
|
|
|
|
operate at the current thread level.
|
|
|
|
|
|
|
|
\item If no \kind{coll} modules remain, abort. Final selection
|
|
|
|
\kind{coll} modules is discussed in
|
|
|
|
Section~\ref{sec:mca-ompi-coll-select}
|
|
|
|
(page~\pageref{sec:mca-ompi-coll-select}).
|
|
|
|
|
|
|
|
\item If no \kind{cr} modules remain and checkpoint/restart support
|
|
|
|
was specifically requested, abort. Otherwise, select the highest
|
|
|
|
priority \kind{cr} module.
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{MPI Point-to-point Communication (Request Progression
|
|
|
|
Interface / RPI)}
|
|
|
|
\label{sec:mca-ompi-rpi}
|
|
|
|
|
|
|
|
Open MPI provides multiple MCA modules for MPI point-to-point
|
|
|
|
communication. Also known as the Request Progression Interface (RPI),
|
|
|
|
these modules are used for all aspects of MPI point-to-point
|
|
|
|
communication in an MPI application. Some of the modules require
|
|
|
|
external hardware and/or software (e.g., the native Myrinet RPI module
|
|
|
|
requires both Myrinet hardware and the GM message passing library).
|
|
|
|
The \icmd{laminfo} command can be used to determine which RPI modules
|
|
|
|
are available in a Open MPI installation.
|
|
|
|
|
|
|
|
Although one RPI module will likely be the default, the selection of
|
|
|
|
which RPI module is used can be changed through the MCA parameter
|
|
|
|
\issiparam{rpi}. For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun -ssi rpi tcp C my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% Stupid emacs mode: $
|
|
|
|
|
|
|
|
\noindent runs the \file{my\_\-mpi\_\-program} executable on all
|
|
|
|
available CPUs using the \rpi{tcp} RPI module, while:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun -ssi rpi gm C my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% Stupid emacs mode: $
|
|
|
|
|
|
|
|
\noindent runs the \file{my\_\-mpi\_\-program} executable on all
|
|
|
|
available CPUs using the \rpi{gm} RPI module.
|
|
|
|
|
|
|
|
It should be noted that the choice of RPI usually does not affect the
|
|
|
|
\kind{boot} MCA module -- hence, the \cmd{lamboot} command
|
|
|
|
requirements on hostnames specified in the boot schema is not
|
|
|
|
dependent upon the RPI. For example, if the \rpi{gm} RPI is selected,
|
|
|
|
\cmd{lamboot} may still require TCP/IP hostnames in the boot schema,
|
|
|
|
not Myrinet hostnames. Also note that selecting a particular module
|
|
|
|
does not guarantee that it will be able to be used. For example,
|
|
|
|
selecting the \rpi{gm} RPI module will still cause a run-time failure
|
|
|
|
if there is no Myrinet hardware present.
|
|
|
|
|
|
|
|
The available modules are described in the sections below. Note that
|
|
|
|
much of this information (particularly the tunable MCA parameters) is
|
|
|
|
also available in the \file{lamssi\_\-rpi(7)} manual page.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\changebegin{7.0.3}
|
|
|
|
|
|
|
|
\subsection{Two Different Shared Memory RPI Modules}
|
|
|
|
\label{sec:mca-ompi-shmem}
|
|
|
|
|
|
|
|
The \rpi{sysv} (Section~\ref{sec:mca-ompi-sysv},
|
|
|
|
page~\pageref{sec:mca-ompi-sysv}) and the \rpi{usysv}
|
|
|
|
(Section~\ref{sec:mca-ompi-usysv}, page~\pageref{sec:mca-ompi-usysv})
|
|
|
|
modules differ only in the mechanism used to synchronize the transfer
|
|
|
|
of messages via shared memory.
|
|
|
|
%
|
|
|
|
The \rpi{sysv} module uses System V semaphores while the \rpi{usysv}
|
|
|
|
module uses spin locks with back-off.
|
|
|
|
%
|
|
|
|
Both modules use a small number of System V semaphores for
|
|
|
|
synchronizing both the deallocation of shared structures and access to
|
|
|
|
the shared pool.
|
|
|
|
|
|
|
|
The blocking nature of the \rpi{sysv} module should generally provide
|
|
|
|
better performance than \rpi{usysv} on oversubscribed nodes (i.e.,
|
|
|
|
when the number of processes is greater than the number of available
|
|
|
|
processors). System V semaphores will effectively force processes
|
|
|
|
yield to other processes, allowing at least some degree of
|
|
|
|
fair/regular scheduling. In non-oversubscribed environments (i.e.,
|
|
|
|
where the number of processes is less than or equal to the number of
|
|
|
|
available processors), the \rpi{usysv} RPI should generally provide
|
|
|
|
better performance than the \rpi{sysv} RPI because spin locks keep
|
|
|
|
processors busy-waiting. This hopefully keeps the operating system
|
|
|
|
from suspending or swapping out the processes, allowing them to react
|
|
|
|
immediately when the lock becomes available.
|
|
|
|
|
|
|
|
\changeend{7.0.3}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{crtcp} Module (Checkpoint-able TCP
|
|
|
|
Communication)}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt crtcp} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 25 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{crtcp} RPI module is almost identical to the \rpi{tcp}
|
|
|
|
module, described in Section~\ref{sec:mca-ompi-tcp}. TCP sockets are
|
|
|
|
used for communication between MPI processes.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
The following are the main differences between the \rpi{tcp} and
|
|
|
|
\rpi{crtcp} RPI modules:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item The \rpi{crtcp} module can be checkpointed and restarted. It is
|
|
|
|
currently the {\em only} RPI module in Open MPI that supports
|
|
|
|
checkpoint/restart functionality.
|
|
|
|
|
|
|
|
\item The \rpi{crtcp} module does not have the ``fast'' message
|
|
|
|
passing optimization that is in the \rpi{tcp} module. As result,
|
|
|
|
there is a small performance loss in certain types of MPI
|
|
|
|
applications.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
All other aspects of the \rpi{crtcp} module are the same as the
|
|
|
|
\rpi{tcp} module.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Checkpoint/Restart Functionality}
|
|
|
|
|
|
|
|
The \rpi{crtcp} module is designed to work in conjunction with a
|
|
|
|
\kind{cr} module to provide checkpoint/restart functionality. See
|
|
|
|
Section~\ref{sec:mca-ompi-cr} for a description of how Open MPI's overall
|
|
|
|
checkpoint/restart functionality is used.
|
|
|
|
|
|
|
|
The \rpi{crtcp} module's checkpoint/restart functionality is invoked
|
|
|
|
when the \kind{cr} module indicates that it is time to perform a
|
|
|
|
checkpoint. The \rpi{crtcp} then quiesces all ``in-flight'' MPI
|
|
|
|
messages and then allows the checkpoint to be performed. Upon
|
|
|
|
restart, TCP connections are re-formed, and message passing processing
|
|
|
|
continues. No additional buffers or ``rollback'' mechanisms are
|
|
|
|
required, nor is any special coding required in the user's MPI
|
|
|
|
application.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
The \rpi{crtcp} module has the same tunable parameters as the
|
|
|
|
\rpi{tcp} module (maximum size of a short message and amount of OS
|
|
|
|
socket buffering), although they have different names:
|
|
|
|
\issiparam{rpi\_\-crtcp\_\-short},
|
|
|
|
\issiparam{rpi\_\-crtcp\_\-sockbuf}.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-crtcp\_\-priority}{25}{Default priority
|
|
|
|
level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-crtcp\_\-short}{65535}{Maximum length
|
|
|
|
(in bytes) of a ``short'' message.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-crtcp\_\-sockbuf}{-1}{Socket buffering in
|
|
|
|
the OS kernel (-1 means use the short message size).}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{crtcp} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-crtcp-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{gm} Module (Myrinet)}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt gm} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 50 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes (*) \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{gm} RPI module is for native message passing over Myrinet
|
|
|
|
networking hardware. The \rpi{gm} RPI provides low latency, high
|
|
|
|
bandwidth message passing performance.
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
Be sure to also read the release notes entitled ``Operating System
|
|
|
|
Bypass Communication: Myrinet and Infiniband'' in the Open MPI
|
|
|
|
Installation Guide for notes about memory management with Myrinet.
|
|
|
|
Specifically, it deals with Open MPI's automatic overrides of the
|
|
|
|
\func{malloc()}, \func{calloc()}, and \func{free()} functions.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
In general, using the \rpi{gm} RPI module is just like using any other
|
|
|
|
RPI module -- MPI functions will simply use native GM message passing
|
|
|
|
for their back-end message transport.
|
|
|
|
|
|
|
|
Although it is not required, users are strongly encouraged to use the
|
|
|
|
\mpifunc{MPI\_\-ALLOC\_\-MEM} and \mpifunc{MPI\_\-FREE\_\-MEM}
|
|
|
|
functions to allocate and free memory (instead of, for example,
|
|
|
|
\func{malloc()} and \func{free()}.
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
The \rpi{gm} RPI module is marked as ``yes'' for checkpoint / restart
|
|
|
|
support, but this is only true when the module was configured and
|
|
|
|
compiled with the \confflag{with-rpi-gm-get} configure flag. This
|
|
|
|
enables Open MPI to use the GM 2.x function \func{gm\_\-get()}. Note that
|
|
|
|
enabling this feature slightly with the \ssiparam{rpi\_\-gm\_\-cr} MCA
|
|
|
|
parameter decreases the performance of the \rpi{gm} module (which is
|
|
|
|
why it is disabled by default) because of additional bookkeeping that
|
|
|
|
is necessary. The performance difference is actually barely
|
|
|
|
measurable -- it is well below one microsecond. It is not the default
|
|
|
|
behavior simply on principle.
|
|
|
|
|
|
|
|
At the time of this writing, there still appeared to be problems with
|
|
|
|
\func{gm\_\-get()}, so this behavior is disabled by default. It is
|
|
|
|
not clear whether the problems with \func{gm\_\-get()} are due to a
|
|
|
|
problem with Myricom's GM library or a problem in Open MPI itself; the
|
|
|
|
\confflag{with-rpi-gm-get} option is provided as a ``hedging our
|
|
|
|
bets'' solution; if the problem does turn out to be with the GM
|
|
|
|
library, Open MPI users can enable checkpoint support (and slightly lower
|
|
|
|
long message latency) by using this switch.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Table~\ref{tbl:mca-ompi-gm-mca-params} shows the MCA parameters that
|
|
|
|
may be changed at run-time; the text below explains each one in
|
|
|
|
detail.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-cr}{0}{Whether to enable checkpoint /
|
|
|
|
restart support or not.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-fast}{0}{Whether to enable the
|
|
|
|
``fast'' algorithm for sending short messages. This is an
|
|
|
|
unreliable transport and is not recommended for MPI applications
|
|
|
|
that do not continually invoke the MPI progression engine.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-maxport}{32}{Maximum GM port
|
|
|
|
number to check during \mpifunc{MPI\_\-INIT} when looking for an
|
|
|
|
available port.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-nopin}{0}{Whether to let Open MPI
|
|
|
|
register (``pin'') arbitrary buffers or not.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-port}{-1}{Specific GM port to use
|
|
|
|
(-1 indicates none).}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-priority}{50}{Default priority level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-gm\_\-tinymsglen}{1024}{Maximum length
|
|
|
|
(in bytes) of a ``tiny'' message.}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{gm} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-gm-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Port Allocation}
|
|
|
|
|
|
|
|
It is usually unnecessary to specify which Myrinet/GM port to use.
|
|
|
|
Open MPI will automatically attempt to acquire ports greater than 1.
|
|
|
|
|
|
|
|
By default, Open MPI will check for any available port between 1 and 8. If
|
|
|
|
your Myrinet hardware has more than 8 possible ports, you can change
|
|
|
|
the upper port number that Open MPI will check with the
|
|
|
|
\issiparam{rpi\_\-gm\_\-maxport} MCA parameter.
|
|
|
|
|
|
|
|
However, if you wish Open MPI to use a specific GM port number (and not
|
|
|
|
check all the ports from $[1, {\rm maxport}]$), you can tell Open MPI which
|
|
|
|
port to use with the \issiparam{rpi\_\-gm\_\-port} MCA parameter.
|
|
|
|
%
|
|
|
|
Specifying which port to use has precedence over the port range check
|
|
|
|
-- if a specific port is indicated, Open MPI will try to use that and not
|
|
|
|
check a range of ports. Specifying to use port ``-1'' (or not
|
|
|
|
specifying to use a specific port) will tell Open MPI to check the range of
|
|
|
|
ports to find any available port.
|
|
|
|
|
|
|
|
Note that in all cases, if Open MPI cannot acquire a valid port for every
|
|
|
|
MPI process in the job, the entire job will be aborted.
|
|
|
|
|
|
|
|
Be wary of forcing a specific port to be used, particularly in
|
|
|
|
conjunction with the MPI dynamic process calls (e.g.,
|
|
|
|
\mpifunc{MPI\_\-COMM\_\-SPAWN}). For example, attempting to spawn a
|
|
|
|
child process on a node that already has an MPI process in the same
|
|
|
|
job, Open MPI will try to use the same specific port, which will result in
|
|
|
|
failure because the MPI process already on that node will have already
|
|
|
|
claimed that port.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Adjusting Message Lengths}
|
|
|
|
|
|
|
|
The \rpi{gm} RPI uses two different protocols for passing data between
|
|
|
|
MPI processes: tiny and long. Selection of which protocol to use is
|
|
|
|
based solely on the length of the message. Tiny messages are sent
|
|
|
|
(along with tag and communicator information) in one transfer to the
|
|
|
|
receiver. Long messages use a rendezvous protocol -- the envelope is
|
|
|
|
sent to the destination, the receiver responds with an ACK (when it is
|
|
|
|
ready), and then the sender sends another envelope followed by the
|
|
|
|
data of the message.
|
|
|
|
|
|
|
|
The message lengths at which the different protocols are used can be
|
|
|
|
changed with the MCA parameter \issiparam{rpi\_\-gm\_\-tinymsglen} ,
|
|
|
|
which represent the maximum length of tiny messages. Open MPI defaults to
|
|
|
|
1,024 bytes for the maximum lengths of tiny messages.
|
|
|
|
|
|
|
|
It may be desirable to adjust these values for different kinds of
|
|
|
|
applications and message passing patterns. The Open MPI Team would
|
|
|
|
appreciate feedback on the performance of different values for real
|
|
|
|
world applications.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Pinning Memory}
|
|
|
|
\label{sec:mca-ompi-gm-ptmalloc}
|
|
|
|
|
|
|
|
The Myrinet native communication library (gm) can only communicate
|
|
|
|
through ``registered'' (sometimes called ``pinned'') memory. In most
|
|
|
|
operating systems, Open MPI handles this automatically by pinning
|
|
|
|
user-provided buffers when required. This allows for good message
|
|
|
|
passing performance, especially when re-using buffers to send/receive
|
|
|
|
multiple messages.
|
|
|
|
|
|
|
|
However, the gm library does not have the ability to pin arbitrary
|
|
|
|
memory on Solaris systems -- auxiliary buffers must be used. Although
|
|
|
|
Open MPI controls all pinned memory, this has a detrimental effect on
|
|
|
|
performance of large messages: Open MPI must copy all messages from the
|
|
|
|
application-provided buffer to an auxiliary buffer before it can be
|
|
|
|
sent (and vice versa for receiving messages). As such, users are
|
|
|
|
strongly encouraged to use the \mpifunc{MPI\_\-ALLOC\_\-MEM} and
|
|
|
|
\mpifunc{MPI\_\-FREE\_\-MEM} functions instead of \func{malloc()} and
|
|
|
|
\func{free()}. Using these functions will allocate ``pinned'' memory
|
|
|
|
such that Open MPI will not have to use auxiliary buffers and an extra
|
|
|
|
memory copy.
|
|
|
|
|
|
|
|
The \ssiparam{rpi\_\-gm\_\-nopin} MCA parameter can be used to force
|
|
|
|
Solaris-like behavior. On Solaris platforms, the default value is
|
|
|
|
``1'', specifying to use auxiliary buffers as described above. On
|
|
|
|
non-Solaris platforms, the default value is ``0'', meaning that
|
|
|
|
Open MPI will attempt to pin and send/receive directly from user
|
|
|
|
buffers.
|
|
|
|
|
|
|
|
Note that since Open MPI manages all pinned memory, Open MPI must be
|
|
|
|
aware of memory that is freed so that it can be properly unpinned
|
|
|
|
before it is returned to the operating system. Hence, Open MPI must
|
|
|
|
intercept calls to functions such as \func{sbrk()} and \func{munmap()}
|
|
|
|
to effect this behavior. Since gm cannot pin arbitrary memory on
|
|
|
|
Solaris, Open MPI does not need to intercept these calls on Solaris
|
|
|
|
machines.
|
|
|
|
|
|
|
|
To this end, support for additional memory allocation packages are
|
|
|
|
included in Open MPI and will automatically be used on platforms that
|
|
|
|
support arbitrary pinning. These memory allocation managers allow
|
|
|
|
Open MPI to intercept the relevant functions and ensure that memory is
|
|
|
|
unpinned before returning it to the operating system. Use of these
|
|
|
|
managers will effectively overload all memory allocation functions
|
|
|
|
(e.g., \func{malloc()}, \func{calloc()}, \func{free()}, etc.) for all
|
|
|
|
applications that are linked against the Open MPI libraries
|
|
|
|
(potentially regardless of whether they are using the ib RPI module or
|
|
|
|
not).
|
|
|
|
|
|
|
|
See Section \ref{release-notes:os-bypass} (page
|
|
|
|
\pageref{release-notes:os-bypass}) for more information on Open MPI's
|
|
|
|
memory allocation managers.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Memory Checking Debuggers}
|
|
|
|
|
|
|
|
When running Open MPI's \rpi{gm} RPI through a memory checking debugger
|
|
|
|
(see Section~\ref{sec:debug-mem}), a number of ``Read from
|
|
|
|
unallocated'' (RUA) and/or ``Read from uninitialized'' (RFU) errors
|
|
|
|
may appear, originating from functions beginning with ``\func{gm\_*}''
|
|
|
|
or ``\func{lam\_\-ssi\_\-rpi\_\-gm\_*}''. These RUA/RFU errors are
|
|
|
|
normal -- they are not actually reads from unallocated sections of
|
|
|
|
memory. The Myrinet hardware and gm kernel device driver handle some
|
|
|
|
aspects of memory allocation, and therefore the operating
|
|
|
|
system/debugging environment is not always aware of all valid memory.
|
|
|
|
As a result, a memory checking debugger will often raise warnings,
|
|
|
|
even though this is valid behavior.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Known Issues}
|
|
|
|
|
|
|
|
As of Open MPI \lamversion, the following issues still remain in the
|
|
|
|
\rpi{gm} RPI module:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Heterogeneity between big and little endian machines is not
|
|
|
|
supported.
|
|
|
|
|
|
|
|
\item The \rpi{gm} RPI is not supported with IMPI.
|
|
|
|
|
|
|
|
\item Mixed shared memory / GM message passing is not yet supported;
|
|
|
|
all message passing is through Myrinet / GM.
|
|
|
|
|
|
|
|
\item XMPI tracing is not yet supported.
|
|
|
|
|
|
|
|
\changebegin{7.0.3}
|
|
|
|
|
|
|
|
\item The \rpi{gm} RPI module is designed to run in environments where
|
|
|
|
the number of available processors is greater than or equal to the
|
|
|
|
number of MPI processes on a given node. The \rpi{gm} RPI module
|
|
|
|
will perform poorly (particularly in blocking MPI communication
|
|
|
|
calls) if there are less processors than processes on a node.
|
|
|
|
|
|
|
|
\changeend{7.0.3}
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
\item ``Fast'' support is available and slightly decreases the latency
|
|
|
|
for short gm messages. However, it is unreliable and is subject to
|
|
|
|
timeouts for MPI applications that do not invoke the MPI progression
|
|
|
|
engine often, and is therefore not the default behavior.
|
|
|
|
|
|
|
|
\item Support for the \func{gm\_\-get()} function in the GM 2.x series
|
|
|
|
is available starting with Open MPI 7.1, but is disabled by support.
|
|
|
|
See the Installation Guide for more details.
|
|
|
|
|
|
|
|
\item Checkpoint/restart support is included for the \rpi{gm} module,
|
|
|
|
but is only possible when the \rpi{gm} module was compiled with
|
|
|
|
support for \func{gm\_\-get()}.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{ib} Module (Infiniband)}
|
|
|
|
\label{sec:mca-ompi-ib}
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt ib} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 50 \\
|
|
|
|
{\bf Checkpoint / restart:} & no \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
|
|
|
|
The \rpi{ib} RPI module is for native message passing over Infiniband
|
|
|
|
networking hardware. The \rpi{ib} RPI provides low latency, high
|
|
|
|
bandwidth message passing performance.
|
|
|
|
|
|
|
|
Be sure to also read the release notes entitled ``Operating System
|
|
|
|
Bypass Communication: Myrinet and Infiniband'' in the Open MPI
|
|
|
|
Installation Guide for notes about memory management with Infiniband.
|
|
|
|
Specifically, it deals with Open MPI's automatic overrides of the
|
|
|
|
\func{malloc()}, \func{calloc()}, and \func{free()} functions.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
In general, using the \rpi{ib} RPI module is just like using any other
|
|
|
|
RPI module -- MPI functions will simply use native Infiniband message
|
|
|
|
passing for their back-end message transport.
|
|
|
|
|
|
|
|
Although it is not required, users are strongly encouraged to use the
|
|
|
|
\mpifunc{MPI\_\-ALLOC\_\-MEM} and \mpifunc{MPI\_\-FREE\_\-MEM}
|
|
|
|
functions to allocate and free memory used for communication (instead
|
|
|
|
of, for example, \func{malloc()} and \func{free()}. This would avoid
|
|
|
|
the need to pin the memory during communication time and hence save on
|
|
|
|
message passsing latency.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Table~\ref{tbl:mca-ompi-ib-mca-params} shows the MCA parameters that
|
|
|
|
may be changed at run-time; the text below explains each one in
|
|
|
|
detail.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-hca\_\-id}{X}{The string ID of the
|
|
|
|
Infiniband hardware HCA to be used}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-num\_\-envelopes}{64}{Number of
|
|
|
|
envelopes to be preposted per peer process.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-port}{-1}{Specific IB port to use
|
|
|
|
(-1 indicates none).}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-priority}{50}{Default priority level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-tinymsglen}{1024}{Maximum length
|
|
|
|
(in bytes) of a ``tiny'' message.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-ib\_\-mtu}{1024}{Maximum Transmission
|
|
|
|
Unit (MTU) value to be used for IB.}
|
|
|
|
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{ib} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-ib-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Port Allocation}
|
|
|
|
|
|
|
|
It is usually unnecessary to specify which Infiniband port to use.
|
|
|
|
Open MPI will automatically attempt to acquire ports greater than 1.
|
|
|
|
|
|
|
|
However, if you wish Open MPI to use a specific Infiniband port number, you
|
|
|
|
can tell Open MPI which port to use with the \issiparam{rpi\_\-ib\_\-port}
|
|
|
|
MCA parameter.
|
|
|
|
%
|
|
|
|
Specifying which port to use has precedence over the port range check
|
|
|
|
-- if a specific port is indicated, Open MPI will try to use that and not
|
|
|
|
check a range of ports. Specifying to use port ``-1'' (or not
|
|
|
|
specifying to use a specific port) will tell Open MPI to check the range of
|
|
|
|
ports to find any available port.
|
|
|
|
|
|
|
|
Note that in all cases, if Open MPI cannot acquire a valid port for every
|
|
|
|
MPI process in the job, the entire job will be aborted.
|
|
|
|
|
|
|
|
Be wary of forcing a specific port to be used, particularly in
|
|
|
|
conjunction with the MPI dynamic process calls (e.g.,
|
|
|
|
\mpifunc{MPI\_\-COMM\_\-SPAWN}). For example, attempting to spawn a
|
|
|
|
child process on a node that already has an MPI process in the same
|
|
|
|
job, Open MPI will try to use the same specific port, which will result in
|
|
|
|
failure because the MPI process already on that node will have already
|
|
|
|
claimed that port.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Choosing an HCA ID}
|
|
|
|
|
|
|
|
The HCA ID is the Mellanox Host Channel Adapter ID. For example:
|
|
|
|
InfiniHost0. It is usually unnecessary to specify which HCA ID to
|
|
|
|
use. Open MPI will search for all HCAs available and select the first
|
|
|
|
one which is available. If you want to use a fixed HCA ID, then you
|
|
|
|
can specify that using the \ssiparam{rpi\_\-ib\_\-hca\_\-id} MCA
|
|
|
|
parameter.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Adjusting Message Lengths}
|
|
|
|
|
|
|
|
The \rpi{ib} RPI uses two different protocols for passing data between
|
|
|
|
MPI processes: tiny and long. Selection of which protocol to use is
|
|
|
|
based solely on the length of the message. Tiny messages are sent
|
|
|
|
(along with tag and communicator information) in one transfer to the
|
|
|
|
receiver. Long messages use a rendezvous protocol -- the envelope is
|
|
|
|
sent to the destination, the receiver responds with an ACK (when it is
|
|
|
|
ready), and then the sender sends another envelope followed by the
|
|
|
|
data of the message.
|
|
|
|
|
|
|
|
The message lengths at which the different protocols are used can be
|
|
|
|
changed with the MCA parameter \issiparam{rpi\_\-ib\_\-tinymsglen} ,
|
|
|
|
which represent the maximum length of tiny messages. Open MPI defaults to
|
|
|
|
1,024 bytes for the maximum lengths of tiny messages.
|
|
|
|
|
|
|
|
It may be desirable to adjust these values for different kinds of
|
|
|
|
applications and message passing patterns. The Open MPI Team would
|
|
|
|
appreciate feedback on the performance of different values for real
|
|
|
|
world applications.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Posting Envelopes to Recieve / Scalability}
|
|
|
|
\label{sec:mca-ompi-ib-envelopes}
|
|
|
|
|
|
|
|
Receive buffers must be posted to the IB communication
|
|
|
|
hardware/library before any receives can occur. Open MPI uses
|
|
|
|
enevelopes that contain MPI signature information, and in the case of
|
|
|
|
tiny messages, they also hold the actual message contents.
|
|
|
|
%
|
|
|
|
The size of each envelope is therefore sum of the size of the headers
|
|
|
|
and the maximum size of a tiny message (controlled by
|
|
|
|
\issiparam{rpi\_\-ib\_\-tinymsglen} MCA parameter). Open MPI pre-posts 64
|
|
|
|
evnvelope buffers per peer process by default, but can be overridden
|
|
|
|
at run-time with then \issiparam{rpi\_\-ib\_\-num\_\-envelopes} MCA
|
|
|
|
parameter.
|
|
|
|
|
|
|
|
\changebegin{7.1.2}
|
|
|
|
|
|
|
|
These two MCA parameters can have a large effect on scalability.
|
|
|
|
Since Open MPI pre-posts a total of $((num\_processes - 1) \times
|
|
|
|
num\_envelopes \times tinymsglen)$ bytes, this can be prohibitive if
|
|
|
|
$num\_processes$ grows large. However, $num\_envelopes$ and
|
|
|
|
$tinymsglen$ can be adjusted to help keep this number low, although
|
|
|
|
they may have an effect on run-time performance. Changing the number
|
|
|
|
of pre-posted envelopes effectively controls how many messages can be
|
|
|
|
simultaneously flowing across the network; changing the tiny message
|
|
|
|
size affects when Open MPI switches to use a rendezvous sending protocol
|
|
|
|
instead of an eager send protocol. Relevant values for these
|
|
|
|
parameters are likely to be application-specific; keep this in mind
|
|
|
|
when running large parallel jobs.
|
|
|
|
|
|
|
|
\changeend{7.1.2}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Modifying the MTU value}
|
|
|
|
\label{sec:mca-ompi-ib-mtu}
|
|
|
|
|
|
|
|
\changebegin{7.1.2}
|
|
|
|
|
|
|
|
The Maximum Transmission Unit (MTU) values to be used for Infiniband
|
|
|
|
can be configured at runtime using the
|
|
|
|
\issiparam{rpi\_\-ib\_\-mtu} MCA parameter. It can take in
|
|
|
|
values of 256, 512, 1024, 2048 and 4096 corresponding to MTU256,
|
|
|
|
MTU512, MTU1024, MTU2048 and MTU4096 values of Infiniband MTUs
|
|
|
|
respectively. The default value is 1024 (corresponding to MTU1024).
|
|
|
|
|
|
|
|
\changeend{7.1.2}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Pinning Memory}
|
|
|
|
\label{sec:mca-ompi-ib-ptmalloc}
|
|
|
|
|
|
|
|
The Infiniband communication library can only communicate through
|
|
|
|
``registered'' (sometimes called ``pinned'') memory. Open MPI handles
|
|
|
|
this automatically by pinning user-provided buffers when required.
|
|
|
|
This allows for good message passing performance, especially when
|
|
|
|
re-using buffers to send/receive multiple messages.
|
|
|
|
|
|
|
|
Note that since Open MPI manages all pinned memory, Open MPI must be
|
|
|
|
aware of memory that is freed so that it can be properly unpinned
|
|
|
|
before it is returned to the operating system. Hence, Open MPI must
|
|
|
|
intercept calls to functions such as \func{sbrk()} and \func{munmap()}
|
|
|
|
to effect this behavior.
|
|
|
|
|
|
|
|
To this end, support for additional memory allocation packages are
|
|
|
|
included in Open MPI and will automatically be used on platforms that
|
|
|
|
support arbitrary pinning. These memory allocation managers allow
|
|
|
|
Open MPI to intercept the relevant functions and ensure that memory is
|
|
|
|
unpinned before returning it to the operating system. Use of these
|
|
|
|
managers will effectively overload all memory allocation functions
|
|
|
|
(e.g., \func{malloc()}, \func{calloc()}, \func{free()}, etc.) for all
|
|
|
|
applications that are linked against the Open MPI libraries
|
|
|
|
(potentially regardless of whether they are using the ib RPI module or
|
|
|
|
not).
|
|
|
|
|
|
|
|
See Section \ref{release-notes:os-bypass} (page
|
|
|
|
\pageref{release-notes:os-bypass}) for more information on Open MPI's
|
|
|
|
memory allocation managers.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Memory Checking Debuggers}
|
|
|
|
|
|
|
|
When running Open MPI's \rpi{ib} RPI through a memory checking debugger
|
|
|
|
(see Section~\ref{sec:debug-mem}), a number of ``Read from
|
|
|
|
unallocated'' (RUA) and/or ``Read from uninitialized'' (RFU) errors
|
|
|
|
may appear pertaining to VAPI. These RUA/RFU errors are
|
|
|
|
normal -- they are not actually reads from unallocated sections of
|
|
|
|
memory. The Infiniband hardware and kernel device driver handle some
|
|
|
|
aspects of memory allocation, and therefore the operating
|
|
|
|
system/debugging environment is not always aware of all valid memory.
|
|
|
|
As a result, a memory checking debugger will often raise warnings,
|
|
|
|
even though this is valid behavior.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Known Issues}
|
|
|
|
|
|
|
|
As of Open MPI \lamversion, the following issues remain in the
|
|
|
|
\rpi{ib} RPI module:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
|
|
\changebegin{7.1.2}
|
|
|
|
|
|
|
|
\item The \rpi{ib} \kind{rpi} will not scale well to large numbers of
|
|
|
|
processes. See the section entitled ``Posting Envelopes to Receive
|
|
|
|
/ Scalability,'' above.
|
|
|
|
|
|
|
|
\changeend{7.1.2}
|
|
|
|
|
|
|
|
\item On machines which have IB (VAPI) shared libraries but not the IB
|
|
|
|
hardware, and when Open MPI is compiled with IB support, you may see some
|
|
|
|
error messages like ``can't open device file'' when trying to use
|
|
|
|
Open MPI, even when you are not using the IB module. This error
|
|
|
|
message pertains to IB (VAPI) shared libraries and is not from
|
|
|
|
within Open MPI. It results because when Open MPI tries to query the
|
|
|
|
shared libraries, VAPI tries to open the IB device during the shared
|
|
|
|
library init phase, which is not proper.
|
|
|
|
|
|
|
|
\item Heterogeneity between big and little endian machines is not
|
|
|
|
supported.
|
|
|
|
|
|
|
|
\item The \rpi{ib} RPI is not supported with IMPI.
|
|
|
|
|
|
|
|
\item Mixed shared memory / IB message passing is not yet supported;
|
|
|
|
all message passing is through Infiniband.
|
|
|
|
|
|
|
|
\item XMPI tracing is not yet supported.
|
|
|
|
|
|
|
|
\item The \rpi{ib} RPI module is designed to run in environments where
|
|
|
|
the number of available processors is greater than or equal to the
|
|
|
|
number of MPI processes on a given node. The \rpi{ib} RPI module
|
|
|
|
will perform poorly (particularly in blocking MPI communication
|
|
|
|
calls) if there are less processors than processes on a node.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{lamd} Module (Daemon-Based Communication)}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt lamd} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 10 \\
|
|
|
|
{\bf Checkpoint / restart:} & no \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{lamd} RPI module uses the Open MPI daemons for all interprocess
|
|
|
|
communication. This allows for true asynchronous message passing
|
|
|
|
(i.e., messages can progress even while the user's program is
|
|
|
|
executing), albeit at the cost of a significantly higher latency and
|
|
|
|
lower bandwidth.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
Rather than send messages directly from one MPI process to another,
|
|
|
|
all messages are routed through the local Open MPI daemon, the remote Open MPI
|
|
|
|
daemon (if the target process is on a different node), and then
|
|
|
|
finally to the target MPI process. This potentially adds two hops to
|
|
|
|
each MPI message.
|
|
|
|
|
|
|
|
Although the latency incurred can be significant, the \rpi{lamd} RPI
|
|
|
|
can actually make message passing progress ``in the background.''
|
|
|
|
Specifically, since Open MPI is an single-threaded MPI implementation,
|
|
|
|
it can typically only make progress passing messages when the user's
|
|
|
|
program is in an MPI function call. With the \rpi{lamd} RPI, since
|
|
|
|
the messages are all routed through separate processes, message
|
|
|
|
passing can actually occur when the user's program is {\em not} in an
|
|
|
|
MPI function call.
|
|
|
|
|
|
|
|
User programs that utilize latency-hiding techniques can exploit this
|
|
|
|
asynchronous message passing behavior, and therefore actually achieve
|
|
|
|
high performance despite of the high overhead associated with the
|
|
|
|
\rpi{lamd} RPI.\footnote{Several users on the Open MPI mailing list
|
|
|
|
have mentioned this specifically; even though the \rpi{lamd} RPI is
|
|
|
|
slow, it provides {\em significantly} better performance because it
|
|
|
|
can provide true asynchronous message passing.}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
The \rpi{lamd} module has only one tunable parameter: its priority.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-lamd\_\-priority}{10}{Default priority
|
|
|
|
level.}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{lamd} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-lamd-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{sysv} Module (Shared Memory Using System V
|
|
|
|
Semaphores)}
|
|
|
|
\label{sec:mca-ompi-sysv}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt sysv} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 30 \\
|
|
|
|
{\bf Checkpoint / restart:} & no \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{sysv} RPI is the one of two combination shared-memory/TCP
|
|
|
|
message passing modules. Shared memory is used for passing messages
|
|
|
|
to processes on the same node; TCP sockets are used for passing
|
|
|
|
messages to processes on other nodes. System V semaphores are used
|
|
|
|
for synchronization of the shared memory pool.
|
|
|
|
|
|
|
|
\changebegin{7.0.3}
|
|
|
|
|
|
|
|
Be sure to read Section~\ref{sec:mca-ompi-shmem}
|
|
|
|
(page~\pageref{sec:mca-ompi-shmem}) on the difference between this
|
|
|
|
module and the \rpi{usysv} module.
|
|
|
|
|
|
|
|
\changeend{7.0.3}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
Processes located on the same node communicate via shared memory. One
|
|
|
|
System V shared segment is shared by all processes on the same node.
|
|
|
|
This segment is logically divided into three areas. The total size of
|
|
|
|
the shared segment (in bytes) allocated on each node is:
|
|
|
|
|
|
|
|
\[
|
|
|
|
(2 \times C) + (N \times (N-1) \times (S + C)) + P
|
|
|
|
\]
|
|
|
|
|
|
|
|
where $C$ is the cache line size, $N$ is the number of processes on
|
|
|
|
the node, $S$ is the maximum size of short messages, and $P$ is the
|
|
|
|
size of the pool for large messages,
|
|
|
|
|
|
|
|
The first area (of size $(2 \times C)$) is for the global pool lock.
|
|
|
|
The \rpi{sysv} module allocates a semaphore set (of size six) for each
|
|
|
|
process pair communicating via shared memory. On some systems, the
|
|
|
|
operating system may need to be reconfigured to allow for more
|
|
|
|
semaphore sets if running tasks with many processes communicating via
|
|
|
|
shared memory.
|
|
|
|
|
|
|
|
The second area is for ``postboxes,'' or short message passing. A
|
|
|
|
postbox is used for communication one-way between two processes. Each
|
|
|
|
postbox is the size of a short message plus the length of a cache
|
|
|
|
line. There is enough space allocated for $(N \times (N-1))$
|
|
|
|
postboxes. The maximum size of a short message is configurable with
|
|
|
|
the \issiparam{rpi\_\-ssi\_\-sysv\_\-short} MCA parameter.
|
|
|
|
|
|
|
|
% JMS: Someday, cache line should be an MCA parameter.
|
|
|
|
|
|
|
|
% \const{CACHELINESIZE} must be the size of a cache line or a multiple
|
|
|
|
% thereof. The default setting is 64 bytes. You shouldn't need to
|
|
|
|
% change it. \const{CACHELINESIZE} bytes in the postbox are used for a
|
|
|
|
% cache-line sized synchronization location.
|
|
|
|
|
|
|
|
The final area in the shared memory area (of size $P$) is used as a
|
|
|
|
global pool from which space for long message transfers is allocated.
|
|
|
|
Allocation from this pool is locked. The default lock mechanism is a
|
|
|
|
System V semaphore but can be changed to a process-shared pthread
|
|
|
|
mutex lock. The size of this pool is configurable with the
|
|
|
|
\issiparam{rpi\_\-ssi\_\-sysv\_\-shmpoolsize} MCA parameter. Open MPI will
|
|
|
|
try to determine $P$ at configuration time if none is explicitly
|
|
|
|
specified. Larger values should improve performance (especially when
|
|
|
|
an application passes large messages) but will also increase the
|
|
|
|
system resources used by each task.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Use of the Global Pool}
|
|
|
|
|
|
|
|
When a message larger than ($2S$) is sent, the transport sends $S$
|
|
|
|
bytes with the first packet. When the acknowledgment is received, it
|
|
|
|
allocates (${\rm message length} - S$) bytes from the global pool to
|
|
|
|
transfer the rest of the message.
|
|
|
|
|
|
|
|
To prevent a single large message transfer from monopolizing the
|
|
|
|
global pool, allocations from the pool are actually restricted to a
|
|
|
|
maximum of \issiparam{rpi\_\-ssi\_\-sysv\_\-shmmaxalloc} bytes. Even
|
|
|
|
with this restriction, it is possible for the global pool to
|
|
|
|
temporarily become exhausted. In this case, the transport will fall
|
|
|
|
back to using the postbox area to transfer the message. Performance
|
|
|
|
will be degraded, but the application will progress.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Table~\ref{tbl:mca-ompi-sysv-mca-params} shows the MCA parameters that
|
|
|
|
may be changed at run-time. Each of these parameters were discussed
|
|
|
|
in the previous sections.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-sysv\_\-priority}{30}{Default priority level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-sysv\_\-pollyield}{1}{Whether or not to force
|
|
|
|
the use of \func{yield()} to yield the processor.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-sysv\_\-shmmaxalloc}{From configure}{Maximum
|
|
|
|
size of a large message atomic transfer. The default value is
|
|
|
|
calculated when Open MPI is configured.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-sysv\_\-shmpoolsize}{From configure}{Size of
|
|
|
|
the shared memory pool for large messages. The default value is
|
|
|
|
calculated when Open MPI is configured.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-sysv\_\-short}{8192}{Maximum length (in bytes)
|
|
|
|
of a ``short'' message for sending via shared memory (i.e.,
|
|
|
|
on-node). Directly affects the size of the allocated ``postbox''
|
|
|
|
shared memory area.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
|
|
|
|
(in bytes) of a ``short'' message for sending via TCP sockets
|
|
|
|
(i.e., off-node).}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
|
|
|
|
kernel (-1 means use the short message size).}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{sysv} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-sysv-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{tcp} Module (TCP Communication)}
|
|
|
|
\label{sec:mca-ompi-tcp}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt tcp} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 20 \\
|
|
|
|
{\bf Checkpoint / restart:} & no \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{tcp} RPI module uses TCP sockets for MPI point-to-point
|
|
|
|
communication.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Two different protocols are used to pass messages between processes:
|
|
|
|
short and long. Short messages are sent eagerly and will not block
|
|
|
|
unless the operating system blocks. Long messages use a rendezvous
|
|
|
|
protocol; the body of the message is not sent until a matching MPI
|
|
|
|
receive is posted. The crossover point between the short and long
|
|
|
|
protocol defaults to 64KB, but can be changed with the
|
|
|
|
\issiparam{rpi\_\-tcp\_\-short} MCA parameter, an integer specifying
|
|
|
|
the maximum size (in bytes) of a short message.
|
|
|
|
%
|
|
|
|
\changebegin{7.1}
|
|
|
|
%
|
|
|
|
Additionally, the amount of socket buffering requested of the kernel
|
|
|
|
defaults to the size of short messages. It can be altered with the
|
|
|
|
\issiparam{rpi\_\-tcp\_\-sockbuf} parameter. When this value is -1,
|
|
|
|
the value of the \issiparam{rpi\_\-tcp\_\-short} parameter is used.
|
|
|
|
Otherwise, its value is passed to the \func{setsockopt(2)} system call
|
|
|
|
to set the amount of operating system buffering on every socket that
|
|
|
|
is used for MPI communication.
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-priority}{20}{Default priority level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
|
|
|
|
(in bytes) of a ``short'' message.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
|
|
|
|
kernel (-1 means use the short message size).}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{tcp} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-tcp-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \rpi{usysv} Module (Shared Memory Using Spin Locks)}
|
|
|
|
\label{sec:mca-ompi-usysv}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt usysv} \\
|
|
|
|
{\bf Kind:} & \kind{rpi} \\
|
|
|
|
{\bf Default MCA priority:} & 40 \\
|
|
|
|
{\bf Checkpoint / restart:} & no \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \rpi{usysv} RPI is the one of two combination shared-memory/TCP
|
|
|
|
message passing modules. Shared memory is used for passing messages
|
|
|
|
to processes on the same node; TCP sockets are used for passing
|
|
|
|
messages to processes on other nodes. Spin locks with back-off are
|
|
|
|
used for synchronization of the shared memory pool (a System V
|
|
|
|
semaphore or pthread mutex is also used for access to the per-node
|
|
|
|
shared memory pool).
|
|
|
|
|
|
|
|
\changebegin{7.0.3}
|
|
|
|
|
|
|
|
The nature of spin locks means that the \rpi{usysv} RPI will perform
|
|
|
|
poorly when there are more processes than processors (particularly in
|
|
|
|
blocking MPI communication calls). If no higher priority RPI modules
|
|
|
|
are available (e.g., Myrinet/\rpi{gm}) and the user does not select a
|
|
|
|
specific RPI module through the \ssiparam{rpi} MCA parameter,
|
|
|
|
\rpi{usysv} may be selected as the default -- even if there are more
|
|
|
|
processes than processors. Users should keep this in mind; in such
|
|
|
|
circumstances, it is probably better to manually select the \rpi{sysv}
|
|
|
|
or \rpi{tcp} RPI modules.
|
|
|
|
|
|
|
|
\changeend{7.0.3}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
Aside from synchronization, the \rpi{usysv} RPI module is almost
|
|
|
|
identical to the \rpi{sysv} module.
|
|
|
|
%
|
|
|
|
The \rpi{usysv} module uses spin locks with back-off. When a process
|
|
|
|
backs off, it attempts to yield the processor. If the configure
|
|
|
|
script found a system-provided yield function,\footnote{Such as
|
|
|
|
\func{yield()} or \func{sched\_\-yield()}.} it is used. If no such
|
|
|
|
function is found, then \func{select()} on \const{NULL} file
|
|
|
|
descriptor sets with a timeout of 10us is used.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Table~\ref{tbl:mca-ompi-usysv-mca-params} shows the MCA parameters that
|
|
|
|
may be changed at run-time. Many of these parameters are identical to
|
|
|
|
their \rpi{sysv} counterparts and are not re-described here.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-short}{65535}{Maximum length
|
|
|
|
(in bytes) of a ``short'' message for sending via TCP sockets
|
|
|
|
(i.e., off-node).}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-tcp\_\-sockbuf}{-1}{Socket buffering in the OS
|
|
|
|
kernel (-1 means use the short message size).}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-pollyield}{1}{Same as \rpi{sysv}
|
|
|
|
counterpart.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-priority}{40}{Default priority level.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-readlockpoll}{10,000}{Number of
|
|
|
|
iterations to spin before yielding the processing while waiting to
|
|
|
|
read.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-shmmaxalloc}{From configure}{Same as
|
|
|
|
\rpi{sysv} counterpart.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-shmpoolsize}{From configure}{Same as
|
|
|
|
\rpi{sysv} counterpart.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-short}{8192}{Same as \rpi{sysv}
|
|
|
|
counterpart.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{rpi\_\-usysv\_\-writelockpoll}{10}{Number of
|
|
|
|
iterations to spin before yielding the processing while waiting to
|
|
|
|
write.}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \rpi{usysv} RPI module.}
|
|
|
|
\label{tbl:mca-ompi-usysv-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{MPI Collective Communication}
|
|
|
|
\label{sec:mca-ompi-coll}
|
|
|
|
\index{collective MCA modules|(}
|
|
|
|
\index{MPI collective modules|see {collective MCA modules}}
|
|
|
|
\index{MCA collective modules|see {collective MCA modules}}
|
|
|
|
|
|
|
|
MPI collective communication functions have their basic functionality
|
|
|
|
outlined in the MPI standard. However, the implementation of this
|
|
|
|
functionality can be optimized and/or implemented in different ways.
|
|
|
|
As such, Open MPI provides modules for implementing the MPI collective
|
|
|
|
routines that are targeted for different environments.
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Basic algorithms
|
|
|
|
\item SMP-optimized algorithms
|
|
|
|
\item Shared Memory algorithms
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
These modules are discussed in detail below. Note that the sections
|
|
|
|
below each assume that support for these modules have been compiled
|
|
|
|
into Open MPI. The \icmd{laminfo} command can be used to determine
|
|
|
|
exactly which modules are supported in your installation (see
|
|
|
|
Section~\ref{sec:commands-laminfo},
|
|
|
|
page~\pageref{sec:commands-laminfo}).
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Selecting a \kind{coll} Module}
|
|
|
|
\label{sec:mca-ompi-coll-select}
|
|
|
|
\index{collective MCA modules!selection process}
|
|
|
|
|
|
|
|
\kind{coll} modules are selected on a per-communicator basis. Most
|
|
|
|
users will not need to override the \kind{coll} selection mechanisms;
|
|
|
|
the \kind{coll} modules currently included in Open MPI usually select
|
|
|
|
the best module for each communicator. However, mechanisms are
|
|
|
|
provided to override which \kind{coll} module will be selected on a
|
|
|
|
given communicator.
|
|
|
|
|
|
|
|
When each communicator is created (including \mcw\ and \mcs), all
|
|
|
|
available \kind{coll} modules are queried to see if they want to be
|
|
|
|
selected. A \kind{coll} module may therefore be in use by zero or
|
|
|
|
more communicators at any given time. The final selection of which
|
|
|
|
module will be used for a given communicator is based on priority; the
|
|
|
|
module with the highest priority from the set of available modules
|
|
|
|
will be used for all collective calls on that communicator.
|
|
|
|
|
|
|
|
Since the selection of which module to use is inherently dynamic and
|
|
|
|
potentially different for each communicator, there are two levels of
|
|
|
|
parameters specifying which modules should be used. The first level
|
|
|
|
specifies the overall set of \kind{coll} modules that will be
|
|
|
|
available to {\em all} communicators; the second level is a
|
|
|
|
per-communicator parameter indicating which specific module should be
|
|
|
|
used.
|
|
|
|
|
|
|
|
The first level is provided with the \issiparam{coll} MCA parameter.
|
|
|
|
Its value is a comma-separated list of \kind{coll} module names. If
|
|
|
|
this parameter is supplied, only these modules will be queried at run
|
|
|
|
time, effectively determining the set of modules available for
|
|
|
|
selection on all communicators. If this parameter is not supplied,
|
|
|
|
all \kind{coll} modules will be queried.
|
|
|
|
|
|
|
|
The second level is provided with the MPI attribute
|
|
|
|
\mpiattr{Open MPI\_\-MPI\_\-MCA\_\-COLL}. This attribute can be set to
|
|
|
|
the string name of a specific \kind{coll} module on a parent
|
|
|
|
communicator before a new communicator is created. If set, the
|
|
|
|
attribute's value indicates the {\em only} module that will be
|
|
|
|
queried. If this attribute is not set, all available modules are
|
|
|
|
queried.
|
|
|
|
|
|
|
|
Note that no coordination is done between the MCA frameworks in each
|
|
|
|
MPI process to ensure that the same modules are available and/or are
|
|
|
|
selected for each communicator. Although \cmd{mpirun} allows
|
|
|
|
different environment variables to be exported to each MPI process,
|
|
|
|
and the value of an MPI attribute is local to each process, Open MPI's
|
|
|
|
behavior is undefined if the same MCA parameters are not available in
|
|
|
|
all MPI processes.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{\kind{coll} MCA Parameters}
|
|
|
|
|
|
|
|
There are three parameters that apply to all \kind{coll} modules.
|
|
|
|
Depending on when their values are checked, they may be set by
|
|
|
|
environment variables, command line switches, or attributes on MPI
|
|
|
|
communicators.
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item \issiparam{coll\_\-base\_\-associative}: The MPI standard defines
|
|
|
|
whether reduction operations are commutative or not, but makes no
|
|
|
|
provisions for whether an operator is associative or not. This
|
|
|
|
parameter, if defined to 1, asserts that all reduction operations on
|
|
|
|
a communicator are assumed to be associative. If undefined or
|
|
|
|
defined to 0, all reduction operations are assumed to be
|
|
|
|
non-associative.
|
|
|
|
|
|
|
|
This parameter is examined during every reduction operation. See
|
|
|
|
{\bf Commutative and Associative Reduction Operators}, below.
|
|
|
|
|
|
|
|
\item \issiparam{coll\_\-crossover}: If set, define the maximum number
|
|
|
|
of processes that will be used with a linear algorithm. More than
|
|
|
|
this number of processes may use some other kind of algorithm.
|
|
|
|
|
|
|
|
This parameter is only examined during \mpifunc{MPI\_\-INIT}.
|
|
|
|
|
|
|
|
\item \issiparam{coll\_\-reduce\_\-crossover}: For reduction
|
|
|
|
operations, the determination as to whether an algorithm should be
|
|
|
|
linear or not is not based on the number of process, but rather by
|
|
|
|
the number of bytes to be transferred by each process. If this
|
|
|
|
parameter is set, it defines the maximum number of bytes transferred
|
|
|
|
by a single process with a linear algorithm. More than this number
|
|
|
|
of bytes may result in some other kind of algorithm.
|
|
|
|
|
|
|
|
This parameter is only examined during \mpifunc{MPI\_\-INIT}.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Commutative and Associative Reduction Operators}
|
|
|
|
|
|
|
|
MPI-1 defines that all built-in reduction operators are commutative.
|
|
|
|
User-defined reduction operators can specify whether they are
|
|
|
|
commutative or not. The MPI standard makes no provisions for whether
|
|
|
|
a reduction operation is associative or not.
|
|
|
|
%
|
|
|
|
For some operators and datatypes, this distinction is largely
|
|
|
|
irrelevant (e.g., find the maximum in a set of integers). However,
|
|
|
|
for operations involving the combination of floating point numbers,
|
|
|
|
associativity and commutativity matter. An {\em Advice to
|
|
|
|
Implementors} note in MPI-1, section 4.9.1, 114:20, states:
|
|
|
|
|
|
|
|
\begin{quote}
|
|
|
|
It is strongly recommended that \mpifunc{MPI\_\-REDUCE} be
|
|
|
|
implemented so that the same result be obtained whenever the
|
|
|
|
function is applied on the same arguments, appearing in the same
|
|
|
|
order. Note that this may prevent optimizations that take advantage
|
|
|
|
of the physical location of processors.
|
|
|
|
\end{quote}
|
|
|
|
|
|
|
|
Some implementations of the reduction operations may specifically take
|
|
|
|
advantage of data locality, and therefore assume that the reduction
|
|
|
|
operator is associative.
|
|
|
|
%
|
|
|
|
As such, Open MPI will always take the conservative approach to reduction
|
|
|
|
operations and fall back to non-associative algorithms (e.g.,
|
|
|
|
\coll{lam\_\-basic}) for the reduction operations unless specifically
|
|
|
|
told to use associative (SMP-optimized) algorithms by setting the MCA
|
|
|
|
parameter \issiparam{coll\_\-base\_\-associative} to 1.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \coll{lam\_\-basic} Module}
|
|
|
|
\index{collective MCA modules!lam\_basic@\coll{lam\_\-basic}}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt lam\_\-basic} \\
|
|
|
|
{\bf Kind:} & \kind{coll} \\
|
|
|
|
{\bf Default MCA priority:} & 0 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \coll{lam\_\-basic} module provides simplistic algorithms for each
|
|
|
|
of the MPI collectives that are layered on top of point-to-point
|
|
|
|
functionality.\footnote{The basic algorithms are the same that have
|
|
|
|
been included in Open MPI since at least version 6.2.} It can be
|
|
|
|
used in any environment. Its priority is sufficiently low that it
|
|
|
|
will be chosen if no other \kind{coll} module is available.
|
|
|
|
|
|
|
|
Many of the algorithms are twofold: for $N$ or less processes, linear
|
|
|
|
algorithms are used. For more than $N$ processes, binomial algorithms
|
|
|
|
are used. No attempt is made to determine the locality of processes,
|
|
|
|
however -- the \coll{lam\_\-basic} module effectively assumes that
|
|
|
|
there is equal latency between all processes. All reduction
|
|
|
|
operations are performed in a strictly-defined order; associativity is
|
|
|
|
not assumed.
|
|
|
|
|
|
|
|
\subsubsection{Collectives for Intercommunicators}
|
|
|
|
|
|
|
|
As of now, only \coll{lam\_\-basic} module supports intercommunicator
|
|
|
|
collectives according to the MPI-2 standard. These algorithms are
|
|
|
|
built over point-to-point layer and they also make use of an
|
|
|
|
intra-communicator collectives with the help of intra-communicator
|
|
|
|
corresponding to the local group. Mapping among the intercommunicator
|
|
|
|
and corresponding local-intracommunicator is separately managed in the
|
|
|
|
\coll{lam\_\-basic} module.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \coll{smp} Module}
|
|
|
|
\index{collective MCA modules!smp@\coll{smp}}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt smp} \\
|
|
|
|
{\bf Kind:} & \kind{coll} \\
|
|
|
|
{\bf Default MCA priority:} & 50 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \coll{smp} module is geared towards SMP nodes in a LAN. Heavily
|
|
|
|
inspired by the MagPIe
|
|
|
|
algorithms~\cite{kielmann++:00:magpie_bandwidth}, the \coll{smp}
|
|
|
|
module determines the locality of processes before setting up a
|
|
|
|
dynamic structure in which to perform the collective function.
|
|
|
|
Although all communication is still layered on MPI point-to-point
|
|
|
|
functions, the algorithms attempt to maximize the use of on-node
|
|
|
|
communication before communicating with off-node processes. This
|
|
|
|
results in lower overall latency for the collective operation.
|
|
|
|
|
|
|
|
The \coll{smp} module assumes that there are only two levels of
|
|
|
|
latency between all processes. As such, it will only allow itself to
|
|
|
|
be available for selection when there are at least two nodes in a
|
|
|
|
communicator and there are at least two processes on the same
|
|
|
|
node.\footnote{As a direct result, \coll{smp} will never be selected
|
|
|
|
for \mcs.}
|
|
|
|
|
|
|
|
Only some of the collectives have been optimized for SMP environments.
|
|
|
|
Table~\ref{tbl:mca-ompi-coll-smp-algorithms} shows which collective
|
|
|
|
functions have been optimized, which were already optimal (from the
|
|
|
|
\coll{lam\_\-basic} module), and which will eventually be optimized.
|
|
|
|
|
|
|
|
\def\lbid{Identical to \coll{lam\_\-basic} algorithm; already
|
|
|
|
optimized for SMP environments.}
|
|
|
|
\def\smpopt{Optimized for SMP environments.}
|
|
|
|
\def\smpnoopt{Not yet optimized for SMP environments; uses
|
|
|
|
\coll{lam\_\-basic} algorithm instead.}
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\centering
|
|
|
|
\begin{tabular}{|l|p{3.7in}|}
|
|
|
|
\hline
|
|
|
|
\multicolumn{1}{|c|}{MPI function} &
|
|
|
|
\multicolumn{1}{|c|}{Status} \\
|
|
|
|
\hline
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLGATHER} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLGATHERV} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLREDUCE} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALL} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALLV} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALLW} & lbid. \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-BARRIER} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-BCAST} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-EXSCAN} & lbid. \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-GATHER} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-GATHERV} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-REDUCE} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-REDUCE\_\-SCATTER} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCAN} & \smpopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCATTER} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCATTERV} & \lbid \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\caption{Listing of MPI collective functions indicating which have
|
|
|
|
been optimized for SMP environments.}
|
|
|
|
\label{tbl:mca-ompi-coll-smp-algorithms}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Special Notes}
|
|
|
|
|
|
|
|
Since the goal of the SMP-optimized algorithms attempt to take
|
|
|
|
advantage of data locality, it is strongly recommended to maximize the
|
|
|
|
proximity of \mcw\ rank neighbors on each node. The \cmdarg{C}
|
|
|
|
nomenclature to \cmd{mpirun} can ensure this automatically.
|
|
|
|
|
|
|
|
Also, as a result of the data-locality exploitation, the
|
|
|
|
\issiparam{coll\_\-base\_\-associative} parameter is highly relevant -- if it
|
|
|
|
is not set to 1, the \coll{smp} module will fall back to the
|
|
|
|
\coll{lam\_\-basic} reduction algorithms.
|
|
|
|
|
|
|
|
\index{collective MCA modules|)}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
\subsection{The \coll{shmem} Module}
|
|
|
|
\index{collective MCA modules!shmem@\coll{shmem}}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt shmem} \\
|
|
|
|
{\bf Kind:} & \kind{coll} \\
|
|
|
|
{\bf Default MCA priority:} & 50 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \coll{shmem} module is developed to facilitate fast collective
|
|
|
|
communication among processes on a single node. Processes on a N-way
|
|
|
|
SMP node can take advantage of the shared memory for message
|
|
|
|
passing. The module will be selected only if the communicator spans
|
|
|
|
over a single node and the all the processes in the communicator can
|
|
|
|
successfully attach the shared memory region to their address space.
|
|
|
|
|
|
|
|
The shared memory region consists two disjoint sections. First section
|
|
|
|
of the shared memory is used for synchronization among the processes
|
|
|
|
while the second section is used for message passing (Copying data
|
|
|
|
into and from shared memory).
|
|
|
|
|
|
|
|
The second section is known as \const{MESSAGE\_POOL} and is divided
|
|
|
|
into N equal segments. Default value of $N$ is 8 and is configurable
|
|
|
|
with the \issiparam{coll\_base\_shmem\_num\_segments} MCA
|
|
|
|
parameter. The size of the \const{MESSAGE\_POOL} can be also configured
|
|
|
|
with the \issiparam{coll\_base\_shmem\_message\_pool\_size} MCA
|
|
|
|
parameter. Default size of the \const{MESSAGE\_POOL} is $(16384 \times
|
|
|
|
8)$.
|
|
|
|
|
|
|
|
The first section is known as \const{CONTROL\_SECTION} and it is
|
|
|
|
logicallu divided into $(2 \times N + 2)$ segments. $N$ is the number
|
|
|
|
of segments in the \const{MESSAGE\_POOL} section. Total size of this
|
|
|
|
section is:
|
|
|
|
|
|
|
|
\[
|
|
|
|
((2 \times N) + 2) \times C \times S
|
|
|
|
\]
|
|
|
|
|
|
|
|
Where $C$ is the cache line size, $S$ is the size of the communicator.
|
|
|
|
Shared variabled for synchronization are placed in different
|
|
|
|
\const{CACHELINE} for each processes to prevent trashing due to cache
|
|
|
|
invalidation.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{General Logic behind Shared Memory Management}
|
|
|
|
|
|
|
|
Each segment in the \const{MESSAGE\_POOL} corresponds to {\em TWO}
|
|
|
|
segments in the \const{CONTROL\_SECTION}. Whenever a particular segment
|
|
|
|
in \const{MESSAGE\_POOL} is active, its corresponding segments in the
|
|
|
|
\const{CONTROL\_SECTION} are used for synchronization. Processes can
|
|
|
|
operate on one segment (Copy the messages), set appropriate
|
|
|
|
synchronizattion variables and can continue with the next message
|
|
|
|
segment. This approach improves performance of the collective
|
|
|
|
algorithms. All the process need to complete a \mpifunc{MPI\_BARRIER}
|
|
|
|
at the last (Default 8th) segment to prevent race conditions. The
|
|
|
|
extra 2 segments in the \const{CONTROL\_SECTION} are used exclusively
|
|
|
|
for explicit \mpifunc{MPI\_BARRIER}.
|
|
|
|
|
|
|
|
Only some of the collectives have been optimized for SMP environments.
|
|
|
|
Table~\ref{tbl:mca-ompi-coll-shmem-algorithms} shows which collective
|
|
|
|
functions have been optimized, which were already optimal (from the
|
|
|
|
\coll{lam\_\-basic} module), and which will eventually be optimized.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{List of Algorithms}
|
|
|
|
|
|
|
|
Only some of the collectives have been implemented using shared memory
|
|
|
|
Table~\ref{tbl:mca-ompi-coll-shmem-algorithms} shows which collective
|
|
|
|
functions have been implemented and which uses \coll{lam\_\-basic}
|
|
|
|
module)
|
|
|
|
|
|
|
|
\def\shmemopt{Implemented using shared memory.}
|
|
|
|
\def\shmemnoopt{Uses \coll{lam\_\-basic} algorithm.}
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\centering
|
|
|
|
\begin{tabular}{|l|p{3.7in}|}
|
|
|
|
\hline
|
|
|
|
\multicolumn{1}{|c|}{MPI function} &
|
|
|
|
\multicolumn{1}{|c|}{Status} \\
|
|
|
|
\hline
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLGATHER} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLGATHERV} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLREDUCE} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALL} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALLV} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-ALLTOALLW} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-BARRIER} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-BCAST} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-EXSCAN} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-GATHER} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-GATHERV} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-REDUCE} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-REDUCE\_\-SCATTER} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCAN} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCATTER} & \shmemopt \\
|
|
|
|
\hline
|
|
|
|
\mpifunc{MPI\_\-SCATTERV} & \shmemnoopt \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\caption{Listing of MPI collective functions indicating which have
|
|
|
|
been implemented using Shared Memory}
|
|
|
|
\label{tbl:mca-ompi-coll-shmem-algorithms}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
Table~\ref{tbl:mca-ompi-shmem-mca-params} shows the MCA parameters that
|
|
|
|
may be changed at run-time. Each of these parameters were discussed
|
|
|
|
in the previous sections.
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\begin{ssiparamtb}
|
|
|
|
%
|
|
|
|
\ssiparamentry{coll\_\-base\_\-shmem\_\-message\_\-pool\_\-size}{$16384
|
|
|
|
\times 8$}{Size of the shared memory pool for the messages.}
|
|
|
|
%
|
|
|
|
\ssiparamentry{coll\_\-base\_\-shmem\_\-num\_\-segments}{8}{Number
|
|
|
|
of segments in the message pool section.}
|
|
|
|
\end{ssiparamtb}
|
|
|
|
\caption{MCA parameters for the \coll{shmem} \kind{coll} module.}
|
|
|
|
\label{tbl:mca-ompi-shmem-mca-params}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Special Notes}
|
|
|
|
|
|
|
|
Open MPI provides \rpi{sysv} and \rpi{usysv} RPI for the intra\-node
|
|
|
|
communication. In this case, the collective communication also
|
|
|
|
happens through the shared memory but indirectly in terms of Sends and
|
|
|
|
Recvs. Shared Memory Collective algorithms avoid all the overhead
|
|
|
|
associated with the indirection and provide a minimum blocking way for
|
|
|
|
the collective operations.
|
|
|
|
|
|
|
|
The shared memory is created by only one process in the communicator
|
|
|
|
and rest of the processes simply attach the shared memory region to
|
|
|
|
their address space. The process which finalizes last, hands back the
|
|
|
|
shared memory region to the kernel while processes leaving before
|
|
|
|
simply detach the shared memory region from their address space.
|
|
|
|
|
|
|
|
\index{collective MCA modules|)}
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Checkpoint/Restart of MPI Jobs}
|
|
|
|
\label{sec:mca-ompi-cr}
|
|
|
|
\index{checkpoint/restart MCA modules|(}
|
|
|
|
|
|
|
|
Open MPI supports the ability to involuntarily checkpoint and restart
|
|
|
|
parallel MPI jobs. Due to the asynchronous nature of the
|
|
|
|
checkpoint/restart design, such jobs must run with a thread level of
|
|
|
|
at least \mpiconst{MPI\_\-THREAD\_\-SERIALIZED}. This allows the
|
|
|
|
checkpoint/restart framework to interrupt the user's job for a
|
|
|
|
checkpoint regardless of whether it is performing message passing
|
|
|
|
functions or not in the MPI communications layer.
|
|
|
|
|
|
|
|
Open MPI does not provide checkpoint/restart functionality itself;
|
|
|
|
\kind{cr} MCA modules are used to invoke back-end systems that save
|
|
|
|
and restore checkpoints. The following notes apply to checkpointing
|
|
|
|
parallel MPI jobs:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item No special code is required in MPI applications to take
|
|
|
|
advantage of Open MPI's checkpoint/restart functionality, although
|
|
|
|
some limitations may be imposed (depending on the back-end
|
|
|
|
checkpointing system that is used).
|
|
|
|
|
|
|
|
\item Open MPI's checkpoint/restart functionality {\em only} involves MPI
|
|
|
|
processes; the Open MPI universe is not checkpointed. A Open MPI universe
|
|
|
|
must be independently established before an MPI job can be restored.
|
|
|
|
|
|
|
|
\item Open MPI does not yet support checkpointing/restarting MPI-2
|
|
|
|
applications. In particular, Open MPI's behavior is undefined when
|
|
|
|
checkpointing MPI processes that invoke any non-local MPI-2
|
|
|
|
functionality (including dynamic functions and IO).
|
|
|
|
|
|
|
|
\item Migration of restarted processes is available on a limited
|
|
|
|
basis; the \rpi{crtcp} RPI will start up properly regardless of what
|
|
|
|
nodes the MPI processes are re-started on, but other system-level
|
|
|
|
resources may or may not be restarted properly (e.g., open files,
|
|
|
|
shared memory, etc.).
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
\item Checkpoint files are saved using a two-phase commit protocol that is
|
|
|
|
coordinated by \cmd{mpirun}. \cmd{mpirun} initiates a checkpoint request
|
|
|
|
for each process in the MPI job by supplying a temporary context filename.
|
|
|
|
If all the checkpoint requests completed successfully, the saved context
|
|
|
|
files are renamed to their respective target filenames; otherwise, the
|
|
|
|
checkpoint files are discarded.
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\item Checkpoints can only be performed after all processes have
|
|
|
|
invoked \mpifunc{MPI\_\-INIT} and before any process has invoked
|
|
|
|
\mpifunc{MPI\_\-FINALIZE}.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Selecting a \kind{cr} Module}
|
|
|
|
\index{checkpoint/restart MCA modules!selection process}
|
|
|
|
|
|
|
|
The \kind{cr} framework coordinates with all other MCA modules to
|
|
|
|
ensure that the entire MPI application is ready to be checkpointed
|
|
|
|
before the back-end system is invoked. Specifically, for a parallel
|
|
|
|
job to be able to checkpoint and restart, all the MCA modules that it
|
|
|
|
uses must support checkpoint/restart capabilities.
|
|
|
|
|
|
|
|
All \kind{coll} modules in the Open MPI distribution currently support
|
|
|
|
checkpoint/restart capability because they are layered on MPI
|
|
|
|
point-to-point functionality -- as long as the RPI module being used
|
|
|
|
supports checkpoint/restart, so do the \kind{coll} modules.
|
|
|
|
%
|
|
|
|
However, only one RPI module currently supports checkpoint/restart:
|
|
|
|
\rpi{crtcp}. Attempting to checkpoint an MPI job when using any other
|
|
|
|
\kind{rpi} module will result in undefined behavior.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{\kind{cr} MCA Parameters}
|
|
|
|
|
|
|
|
The \issiparam{cr} MCA parameter can be used to specify which
|
|
|
|
\kind{cr} module should be used for an MPI job. An error will occur
|
|
|
|
if a \kind{cr} module is requested and an \kind{rpi} or \kind{coll}
|
|
|
|
module cannot be found that supports checkpoint/restart functionality.
|
|
|
|
|
|
|
|
Additionally, the \issiparam{cr\_\-base\_\-dir} MCA parameter can be
|
|
|
|
used to specify the directory where checkpoint file(s) will be saved.
|
|
|
|
If it is not set, and no default value was provided when Open MPI was
|
|
|
|
configured (with the \confflag{with-cr-file-dir} flag) the user's home
|
|
|
|
directory is used.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{The \crssi{blcr} Module}
|
|
|
|
\index{Berkeley Lab Checkpoint/Restart single-node checkpointer}
|
|
|
|
\index{blcr checkpoint/restart MCA module@\crssi{blcr} checkpoint/restart MCA module}
|
|
|
|
\index{checkpoint/restart MCA modules!blcr@\crssi{blcr}}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt blcr} \\
|
|
|
|
{\bf Kind:} & \kind{cr} \\
|
|
|
|
{\bf Default MCA priority:} & 50 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
Berkeley Lab's Checkpoint/Restart (BLCR)~\cite{BLCR02} single-node
|
|
|
|
checkpointer provides the capability for checkpointing and restarting
|
|
|
|
processes under Linux. The \crssi{blcr} module, when used with
|
|
|
|
checkpoint/restart MCA modules, will invoke the BLCR system to save
|
|
|
|
and restore checkpoints.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
The \crssi{blcr} module will only automatically be selected when the
|
|
|
|
thread level is \mpiconst{MPI\_\-THREAD\_\-SERIALIZED} and all
|
|
|
|
selected MCA modules support checkpoint/restart functionality (see the
|
|
|
|
MCA module selection algorithm,
|
|
|
|
Section~\ref{sec:mca-ompi-component-selection},
|
|
|
|
page~\pageref{sec:mca-ompi-component-selection}). The \crssi{blcr} module
|
|
|
|
can be specifically selected by setting the \ssiparam{cr} MCA
|
|
|
|
parameter to the value \issivalue{cr}{blcr}. Manually selecting the
|
|
|
|
\crssi{blcr} module will force the MPI thread level to be at least
|
|
|
|
\mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Running a Checkpoint/Restart-Capable MPI Job}
|
|
|
|
|
|
|
|
There are multiple ways to run a job with checkpoint/restart support:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Use the \rpi{crtcp} RPI, and invoke
|
|
|
|
\mpifunc{MPI\_\-INIT\_\-THREAD} with a requested thread level of
|
|
|
|
\mpiconst{MPI\_\-THREAD\_\-SERIALIZED}. This will automatically
|
|
|
|
make the \crssi{blcr} module available.
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -ssi rpi crtcp my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
\item Use the \rpi{crtcp} RPI and manually select the \crssi{blcr}
|
|
|
|
module:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\changebegin{7.0.5}
|
|
|
|
|
|
|
|
Depending on the location of the BLCR shared library, it may be
|
|
|
|
necessary to use the \ienvvar{LD\_\-LIBRARY\_\-PATH} environment
|
|
|
|
variable to specify where it can be found. Specifically, if the BLCR
|
|
|
|
library is not in the default path searched by the linker, errors will
|
|
|
|
occur at run time because it cannot be found. In such cases, adding
|
|
|
|
the directory where the \file{libcr.so*} file(s) can be found to the
|
|
|
|
\ienvvar{LD\_\-LIBRARY\_\-PATH} environment variable {\em on all nodes
|
|
|
|
where the MPI application will execute} will solve the problem.
|
|
|
|
%
|
|
|
|
Note that this may entail editing user's ``dot'' files to augment the
|
|
|
|
\ienvvar{LD\_\-LIBRARY\_\-PATH} variable.\footnote{Ensure to see
|
|
|
|
Section~\ref{sec:getting-started-path} for details about which
|
|
|
|
shell startup files should be edited. Also note that
|
|
|
|
shell startup files are {\em only} read when starting the Open MPI
|
|
|
|
universe. Hence, if you change values in shell startup files, you
|
|
|
|
will likely need to re-invoke the \icmd{lamboot} command to put your
|
|
|
|
changes into effect.} For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
# ...edit user's shell startup file to augment LD_LIBRARY_PATH...
|
|
|
|
shell$ lamboot hostfile
|
|
|
|
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
|
|
|
|
Alternatively, the ``\cmd{-x}'' option to \icmd{mpirun} can be used to
|
|
|
|
export the \ienvvar{LD\_\-LIBRARY\_\-PATH} environment variable to all
|
|
|
|
MPI processes. For example (Bourne shell and derrivates):
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ LD_LIBRARY_PATH=/location/of/blcr/lib:$LD_LIBRARY_PATH
|
|
|
|
shell$ export LD_LIBRARY_PATH
|
|
|
|
shell$ mpirun C -ssi rpi crtcp -ssi cr blcr -x LD_LIBRARY_PATH my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
For C shell and derivates:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell% setenv LD_LIBRARY_PATH /location/of/blcr/lib:$LD_LIBRARY_PATH
|
|
|
|
shell% mpirun C -ssi rpi crtcp -ssi cr blcr -x LD_LIBRARY_PATH my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
|
|
|
|
\changeend{7.0.5}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Checkpointing and Restarting}
|
|
|
|
|
|
|
|
Once a checkpoint-capable job is running, the BLCR command
|
|
|
|
\icmd{cr\_\-checkpoint} can be used to invoke a checkpoint. Running
|
|
|
|
\cmd{cr\_\-checkpoint} with the PID of \cmd{mpirun} will cause a
|
|
|
|
context file to be created for \cmd{mpirun} as well as a context file
|
|
|
|
for each running MPI process. Before it is checkpointed, \cmd{mpirun}
|
|
|
|
will also create an application schema file to assist in restoring the
|
|
|
|
MPI job. These files will all be created in the directory specified
|
|
|
|
by Open MPI's configured default, the \issiparam{cr\_\-base\_\-dir}, or
|
|
|
|
the user's home directory if no default is specified.
|
|
|
|
|
|
|
|
The BLCR \icmd{cr\_\-restart} command can then be invoked with the PID
|
|
|
|
and context file generated from \cmd{mpirun}, which will restore the
|
|
|
|
entire MPI job.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
There are no tunable parameters to the \crssi{blcr} \kind{cr} module.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Known Issues}
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item BLCR has its own limitations (e.g., BLCR does not yet support
|
|
|
|
saving and restoring file descriptors); see the documentation
|
|
|
|
included in BLCR for further information. Check the project's main
|
|
|
|
web site\footnote{\url{http://ftg.lbl.gov/}} to find out more about
|
|
|
|
BLCR.
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
\item Since a checkpoint request is initiated by invoking
|
|
|
|
\cmd{cr\_checkpoint} with the PID of \cmd{mpirun}, it is not
|
|
|
|
possible to checkpoint MPI jobs that were started using the
|
|
|
|
\cmdarg{-nw} option to \cmd{mpirun}, or directly from the
|
|
|
|
command-line without using \cmd{mpirun}.
|
|
|
|
|
|
|
|
\item While the two-phase commit protocol that is used to save
|
|
|
|
checkpoints provides a reasonable guarantee of consistency of saved
|
|
|
|
global state, there is at least one case in which this guarantee
|
|
|
|
fails. For example, the renaming of checkpoint files by
|
|
|
|
\cmd{mpirun} is not atomic; if a failure occurs when \cmd{mpirun} is
|
|
|
|
in the process of renaming the checkpoint files, the collection of
|
|
|
|
checkpoint files might result in an inconsistent global state.
|
|
|
|
|
|
|
|
\item If the BLCR module(s) are compiled dynamically, the
|
|
|
|
\ienvvar{LD\_\-PRELOAD} environment variable must include the
|
|
|
|
location of the \ifile{libcr.so} library. This is to ensure that
|
|
|
|
\ifile{libcr.so} is loaded before the PThreads library.
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
\subsection{The \crssi{self} Module}
|
|
|
|
|
|
|
|
\begin{tabular}{rl}
|
|
|
|
\multicolumn{2}{c}{Module Summary} \\
|
|
|
|
\hline
|
|
|
|
{\bf Name:} & {\tt blcr} \\
|
|
|
|
{\bf Kind:} & \kind{cr} \\
|
|
|
|
{\bf Default MCA priority:} & 25 \\
|
|
|
|
{\bf Checkpoint / restart:} & yes \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\vspace{11pt}
|
|
|
|
|
|
|
|
The \crssi{self} module, when used with checkpoint/restart MCA modules, will
|
|
|
|
invoke the user-defined functions to save and restore checkpoints.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
The \crssi{self} module can be specifically selected by setting the
|
|
|
|
\ssiparam{cr} MCA parameter to the value \issivalue{cr}{self}. Manually
|
|
|
|
selecting the \crssi{self} module will force the MPI thread level to be at
|
|
|
|
least \mpiconst{MPI\_\-THREAD\_\-SERIALIZED}.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Running a Checkpoint/Restart-Capable MPI Job}
|
|
|
|
|
|
|
|
Use the \rpi{crtcp} RPI and manually select the \crssi{self} module:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -ssi rpi crtcp -ssi cr self my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Checkpointing and Restarting}
|
|
|
|
|
|
|
|
The names of the Checkpoint, Restart and Continue functions can be specified
|
|
|
|
in one of the following ways:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Use the \ssiparam{cr\_\-self\_\-user\_\-prefix} to specify a prefix. This will
|
|
|
|
cause Open MPI to assume that the Checkpoint, Restart and Continue functions
|
|
|
|
are prefix\_checkpoint, prefix\_restart and prefix\_continue respectively,
|
|
|
|
where prefix is the value of the \ssiparam{cr\_\-self\_\-user\_\-prefix} MCA parameter.
|
|
|
|
|
|
|
|
\item To specify the names of the Checkpoint, Restart and Continue functions
|
|
|
|
separately, use the \ssiparam{cr\_\-self\_\-user\_\-checkpoint},
|
|
|
|
\ssiparam{cr\_\-ssi\_\-user\_\-restart} and the
|
|
|
|
\ssiparam{cr\_\-self\_\-user\_\-continue} MCA parameters respectively. In
|
|
|
|
case both the \ssiparam{cr\_\-ssi\_\-user\_\-prefix} and any of these
|
|
|
|
above three parameters are specified, these parameters are given higher preference.
|
|
|
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
In case none of the above four parameters are supplied, and the \crssi{self}
|
|
|
|
module is selected, \cmd{mpirun} aborts with the error message indicating
|
|
|
|
that symbol-lookup failed. \cmd{mpirun} also aborts in case any of the
|
|
|
|
Checkpoint, Restart or Continue functions is not found in the MPI application.
|
|
|
|
|
|
|
|
Once a checkpoint-capable job is running, the Open MPI command
|
|
|
|
\cmd{lamcheckpoint} can be used to invoke a checkpoint. Running
|
|
|
|
\cmd{lamcheckpoint} with the PID of \cmd{mpirun} will cause the user-defined
|
|
|
|
Checkpoint function to be invoked. Before it is checkpointed, \cmd{mpirun}
|
|
|
|
will also create an application schema file to assist in restoring the
|
|
|
|
MPI job. These files will all be created in the directory specified
|
|
|
|
by Open MPI's configured default, the \issiparam{cr\_\-base\_\-dir}, or
|
|
|
|
the user's home directory if no default is specified.
|
|
|
|
|
|
|
|
\cmd{lamboot} can be used to restart an MPI application. In case the
|
|
|
|
\crssi{self} module is selected the second argument to \cmd{lamboot} is the
|
|
|
|
set of arguments to be passed to the new \cmd{mpirun}.
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Tunable Parameters}
|
|
|
|
|
|
|
|
There are no tunable parameters to the \crssi{self} \kind{cr} module.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Known Issues}
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
|
|
\item Since a checkpoint request is initiated by invoking
|
|
|
|
\cmd{lamcheckpoint} with the PID of \cmd{mpirun}, it is not possible to
|
|
|
|
checkpoint MPI jobs that were started using the \cmdarg{-nw} option to
|
|
|
|
\cmd{mpirun}, or directly from the command-line without using \cmd{mpirun}.
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
\index{checkpoint/restart MCA modules|)}
|
|
|
|
\index{checkpoint/restart MCA modules|)}
|