2005-04-02 19:11:28 +04:00
|
|
|
% -*- latex -*-
|
|
|
|
%
|
2005-11-05 22:57:48 +03:00
|
|
|
% Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
% University Research and Technology
|
|
|
|
% Corporation. All rights reserved.
|
|
|
|
% Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
% of Tennessee Research Foundation. All rights
|
|
|
|
% reserved.
|
2005-04-02 19:11:28 +04:00
|
|
|
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
% University of Stuttgart. All rights reserved.
|
|
|
|
% Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
% All rights reserved.
|
|
|
|
% $COPYRIGHT$
|
|
|
|
%
|
|
|
|
% Additional copyrights may follow
|
|
|
|
%
|
|
|
|
% $HEADER$
|
|
|
|
%
|
|
|
|
|
|
|
|
\chapter{Modular Component Architecture (MCA) Overview}
|
|
|
|
\label{sec:mca}
|
|
|
|
\index{MCA!overview|(}
|
|
|
|
\index{Modular Component Architecture|see {MCA}}
|
|
|
|
|
|
|
|
The Modular Component Architecture (MCA) makes up the core of Open MPI.
|
|
|
|
It influences how many commands and MPI processes are executed. This
|
|
|
|
chapter provides an overview of what MCA is and what users need to
|
|
|
|
know about how to use it to maximize performance of MPI applications.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Frameworks and Components}
|
|
|
|
\index{MCA!component frameworks}
|
|
|
|
|
|
|
|
The MCA provides component frameworks for the Open MPI run-time
|
|
|
|
environment (otherwise known as the Open Run-Time Environment, or
|
|
|
|
ORTE) and the MPI communications layer. Components are selected from
|
|
|
|
each type at run-time and used to effect the RTE and MPI library.
|
|
|
|
|
|
|
|
{\Huge JMS Right ideas, but needs much overhauling}
|
|
|
|
|
|
|
|
There are currently four types of components used by
|
|
|
|
Open MPI:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item \kind{boot}: Starting the Open MPI run-time environment, used mainly
|
|
|
|
with the \cmd{lamboot} command.
|
|
|
|
|
|
|
|
\item \kind{coll}: MPI collective communications, only used within MPI
|
|
|
|
processes.
|
|
|
|
|
|
|
|
\item \kind{cr}: Checkpoint/restart functionality, used both within
|
|
|
|
Open MPI commands and MPI processes.
|
|
|
|
|
|
|
|
\item \kind{rpi}: MPI point-to-point communications, only used within
|
|
|
|
MPI processes.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
The Open MPI distribution includes instances of each component type
|
|
|
|
referred to as modules. Each module is an implementation of the
|
|
|
|
component type which can be selected and used at run-time to provide
|
|
|
|
services to the Open MPI RTE and MPI communications layer.
|
|
|
|
Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules that
|
|
|
|
are available in the Open MPI distribution.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Terminology}
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
\item[Available] The term ``available'' is used to describe a module
|
|
|
|
that reports (at run-time) that it is able to run in the current
|
|
|
|
environment. For example, an RPI module may check to see if
|
|
|
|
supporting network hardware is present before reporting that it is
|
|
|
|
available or not.
|
|
|
|
|
|
|
|
Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules
|
|
|
|
that are included in the Open MPI distribution, and detail the
|
|
|
|
requirements for each of them to indicate whether they are available
|
|
|
|
or not.
|
|
|
|
|
|
|
|
\item[Selected] The term ``selected'' means that a module has been
|
|
|
|
chosen to be used at run-time. Depending on the module type, zero
|
|
|
|
or more modules may be selected.
|
|
|
|
|
|
|
|
\item[Scope] Each module selection has a scope depending on the type
|
|
|
|
of the module. ``Scope'' refers to the duration of the module's
|
|
|
|
selection. Table~\ref{tbl:mca-module-scopes} lists the scopes for
|
|
|
|
each module type.
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
\begin{table}[htbp]
|
|
|
|
\centering
|
|
|
|
\begin{tabular}{|l|p{4in}|}
|
|
|
|
\hline
|
|
|
|
\multicolumn{1}{|c|}{Type} &
|
|
|
|
\multicolumn{1}{|c|}{Scope description} \\
|
|
|
|
\hline
|
|
|
|
\hline
|
|
|
|
\kind{boot} & A module is selected at the beginning of
|
|
|
|
\cmd{lamboot} (or \cmd{recon}) and is used for the duration of the
|
|
|
|
Open MPI universe. \\
|
|
|
|
\hline
|
|
|
|
\kind{coll} & A module is selected every time an MPI communicator
|
|
|
|
is created (including \mpiconst{MPI\_\-COMM\_\-WORLD} and
|
|
|
|
\mpiconst{MPI\_\-COMM\_\-SELF}). It remains in use until that
|
|
|
|
communicator has been freed. \\
|
|
|
|
\hline
|
|
|
|
\kind{cr} & Checkpoint/restart modules are selected at the
|
|
|
|
beginning of an MPI job and remain in use until the job
|
|
|
|
completes. \\
|
|
|
|
\hline
|
|
|
|
\kind{rpi} & RPI modules are selected during \mpifunc{MPI\_\-INIT}
|
|
|
|
and remain in use until \mpifunc{MPI\_\-FINALIZE} returns. \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\caption{MCA module types and their corresponding scopes.}
|
|
|
|
\label{tbl:mca-module-scopes}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{MCA Parameters}
|
|
|
|
\label{sec:commands-mca-module-parameters}
|
|
|
|
\index{MCA!parameter overview}
|
|
|
|
|
|
|
|
One of the founding principles of MCA is to allow the passing of
|
|
|
|
run-time parameters through the MCA framework. This allows both the
|
|
|
|
selection of which modules will be used at run-time (by passing
|
|
|
|
parameters to the MCA framework itself) as well as tuning run-time
|
|
|
|
performance of individual modules (by passing parameters to each
|
|
|
|
module).
|
|
|
|
%
|
|
|
|
Although the specific usage of each MCA module parameter is defined by
|
|
|
|
either the framework or the module that it is passed to, the value of
|
|
|
|
most parameters will be resolved by the following:
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item If a valid value is provided via a run-time MCA parameter, use
|
|
|
|
that.
|
|
|
|
|
|
|
|
\item Otherwise, attempt to calculate a meaningful value at run-time
|
|
|
|
or use a compiled-in default value.\footnote{Note that many MCA
|
|
|
|
modules provide configure flags to set compile-time defaults
|
|
|
|
for ``tweakable'' parameters.
|
|
|
|
See~\cite{lamteam03:_lam_mpi_install_guide}.}
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
As such, it is typically possible to set a parameter's default value
|
|
|
|
when Open MPI is configured/compiled, but use a different value at run
|
|
|
|
time.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Naming Conventions}
|
|
|
|
|
|
|
|
MCA parameter names are generally strings containing only letters and
|
|
|
|
underscores, and can typically be broken down into three parts. For
|
|
|
|
example, the parameter \mcaparam{boot\_\-rsh\_\-agent} can be broken
|
|
|
|
into its three components:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item MCA module type: The first string of the name. In this case, it
|
|
|
|
is \mcaparam{boot}.
|
|
|
|
|
|
|
|
\item MCA module name: The second string of the name, corresponding to
|
|
|
|
a specific MCA module. In this case, it is \mcaparam{rsh}.
|
|
|
|
|
|
|
|
\item Parameter name: The last string in the name. It may be an
|
|
|
|
arbitrary string, and include multiple underscores. In this case,
|
|
|
|
it is \mcaparam{agent}.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Although the parameter name is technically only the last part of the
|
|
|
|
string, it is only proper to refer to it within its overall context.
|
|
|
|
Hence, it is correct to say ``the \mcaparam{boot\_\-rsh\_\-agent}
|
|
|
|
parameter'' as well as ``the \mcaparam{agent} parameter to the
|
|
|
|
\boot{rsh} boot module''.
|
|
|
|
|
|
|
|
Note that the reserved string \mcaparam{base} may appear as a module
|
|
|
|
name, referring to the fact that the parameter applies to all modules
|
|
|
|
of a give type.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Setting Parameter Values}
|
|
|
|
|
|
|
|
MCA parameters each have a unique name and can take a single string
|
|
|
|
value. The parameter/value pairs can be passed by multiple different
|
|
|
|
mechanisms. Depending on the target module and the specific
|
|
|
|
parameter, mechanisms may include:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Using command line flags when Open MPI was configured.
|
|
|
|
\item Setting environment variables before invoking Open MPI commands.
|
|
|
|
\item Using the \cmdarg{-mca} command line switch to various Open MPI
|
|
|
|
commands.
|
|
|
|
\item Setting attributes on MPI communicators.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Users are most likely to utilize the latter three methods. Each is
|
|
|
|
described in detail, below. Listings and explanations of available
|
|
|
|
MCA parameters are provided in Chapters~\ref{sec:lam-mca}
|
|
|
|
and~\ref{sec:mca-ompi} (pages~\pageref{sec:lam-mca}
|
|
|
|
and~\pageref{sec:mca-ompi}, respectively), categorized by MCA type and
|
|
|
|
module.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Environment Variables}
|
|
|
|
|
|
|
|
MCA parameters can be passed via environment variables prefixed with
|
|
|
|
\envvar{Open MPI\_\-MPI\_\-MCA}. For example, selecting which RPI module
|
|
|
|
to use in an MPI job can be accomplished by setting the environment
|
|
|
|
variable \envvar{Open MPI\_\-MPI\_\-MCA\_\-rpi} to a valid RPI module name
|
|
|
|
(e.g., \cmdarg{tcp}).
|
|
|
|
|
|
|
|
Note that environment variables must be set {\em before} invoking the
|
|
|
|
corresponding Open MPI commands that will use them.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{\cmdarg{-mca} Command Line Switch}
|
|
|
|
|
|
|
|
Open MPI commands that interact with MCA modules accept the
|
|
|
|
\cmdarg{-mca} command line switch. This switch expects two parameters
|
|
|
|
to follow: the name of the MCA parameter and its corresponding value.
|
|
|
|
For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -mca rpi tcp my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
\noindent runs the \cmd{my\_\-mpi\_\-program} on all available CPUs in
|
|
|
|
the Open MPI universe using the \rpi{tcp} RPI module.
|
|
|
|
|
|
|
|
%%%%%
|
|
|
|
|
|
|
|
\subsubsection{Communicator Attributes}
|
|
|
|
|
|
|
|
Some MCA types accept MCA parameters via MPI communicator attributes
|
|
|
|
(notably the MPI collective communication modules). These parameters
|
|
|
|
follow the same rules and restrictions as normal MPI attributes. Note
|
|
|
|
that for portability between 32 and 64 bit systems, care should be
|
|
|
|
taken when setting and getting attribute values. The following is an
|
|
|
|
example of portable attribute C code:
|
|
|
|
|
|
|
|
\lstset{style=lam-c}
|
|
|
|
\begin{lstlisting}
|
|
|
|
int flag, attribute_val;
|
|
|
|
void *set_attribute;
|
|
|
|
void **get_attribute;
|
|
|
|
MPI_Comm comm = MPI_COMM_WORLD;
|
|
|
|
int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE;
|
|
|
|
|
|
|
|
/* Set the value */
|
|
|
|
set_attribute = (void *) 1;
|
|
|
|
MPI_Comm_set_attr(comm, keyval, &set_attribute);
|
|
|
|
|
|
|
|
/* Get the value */
|
|
|
|
get_attribute = NULL;
|
|
|
|
MPI_Comm_get_attr(comm, keyval, &get_attribute, &flag);
|
|
|
|
if (flag == 1) {
|
|
|
|
attribute_val = (int) *get_attribute;
|
|
|
|
printf(``Got the attribute value: %d\n'', attribute_val);
|
|
|
|
}
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
Specifically, the following code is neither correct nor portable:
|
|
|
|
|
|
|
|
\lstset{style=lam-c}
|
|
|
|
\begin{lstlisting}
|
|
|
|
int flag, attribute_val;
|
|
|
|
MPI_Comm comm = MPI_COMM_WORLD;
|
|
|
|
int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE;
|
|
|
|
|
|
|
|
/* Set the value */
|
|
|
|
attribute_val = 1;
|
|
|
|
MPI_Comm_set_attr(comm, keyval, &attribute_val);
|
|
|
|
|
|
|
|
/* Get the value */
|
|
|
|
attribute_val = -1;
|
|
|
|
MPI_Comm_get_attr(comm, keyval, &attribute_val, &flag);
|
|
|
|
if (flag == 1)
|
|
|
|
printf(``Got the attribute value: %d\n'', attribute_val);
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
\index{MCA!overview|)}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Dynamic Shared Object (DSO) Modules}
|
|
|
|
|
|
|
|
\changebegin{7.1}
|
|
|
|
|
|
|
|
Open MPI has the capability of building MCA modules statically as part of
|
|
|
|
the MPI libraries or as dynamic shared objects (DSOs). DSOs are
|
|
|
|
discovered and loaded into Open MPI processes at run-time. This allows
|
|
|
|
adding (or removing) functionality from an existing Open MPI installation
|
|
|
|
without the need to recompile or re-link user applications.
|
|
|
|
|
|
|
|
The default location for DSO MCA modules is \file{\$prefix/lib/lam}.
|
|
|
|
If otherwise unspecified, this is where Open MPI will look for DSO MCA
|
|
|
|
modules. However, the MCA parameter
|
|
|
|
\imcaparam{base\_\-module\_\-path} can be used to specify a new
|
|
|
|
colon-delimited path to look for DSO MCA modules. This allows users
|
|
|
|
to specify their own location for modules, if desired.
|
|
|
|
|
|
|
|
Note that specifying this parameter overrides the default location.
|
|
|
|
If users wish to augment their search path, they will need to include
|
|
|
|
the default location in the path specification.
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -mca base_module_path $prefix/lib/lam:$HOME/my_lam_modules ...
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
\changeend{7.1}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Selecting Modules}
|
|
|
|
|
|
|
|
As implied by the previous sections, modules are selected at run-time
|
|
|
|
either by examining (in order) user-specified parameters, run-time
|
|
|
|
calculations, and compiled-in defaults. The selection process
|
|
|
|
involves a flexible negotitation phase which can be both tweaked and
|
|
|
|
arbitrarily overriden by the user and system administrator.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Specifying Modules}
|
|
|
|
|
|
|
|
Each MCA type has an implicit MCA parameter corresponding to the type
|
|
|
|
name indicating which module(s) to be considered for selection. For
|
|
|
|
example, to specify in that the \rpi{tcp} RPI module should be used,
|
|
|
|
the MCA parameter \mcaparam{rpi} should be set to the value
|
|
|
|
\mcaparam{tcp}. For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -mca rpi tcp my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
The same is true for the other MCA types (\kind{boot}, \kind{cr}, and
|
|
|
|
\kind{coll}), with the exception that the \kind{coll} type can be used
|
|
|
|
to specify a comma-separated list of modules to be considered as each
|
|
|
|
MPI communicator is created (including
|
|
|
|
\mpiconst{MPI\_\-COMM\_\-WORLD}). For example:
|
|
|
|
|
|
|
|
\lstset{style=lam-cmdline}
|
|
|
|
\begin{lstlisting}
|
|
|
|
shell$ mpirun C -mca coll smp,shmem,lam_basic my_mpi_program
|
|
|
|
\end{lstlisting}
|
|
|
|
% stupid emacs mode: $
|
|
|
|
|
|
|
|
\noindent indicates that the \coll{smp} and \coll{lam\_\-basic}
|
|
|
|
modules will potentially both be considered for selection for each MPI
|
|
|
|
communicator.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Setting Priorities}
|
|
|
|
|
|
|
|
Although typically not useful to individual users, system
|
|
|
|
administrators may use priorities to set system-wide defaults that
|
|
|
|
influence the module selection process in Open MPI jobs.
|
|
|
|
|
|
|
|
Each module has an associated priority which plays role in whether a
|
|
|
|
module is selected or not. Specifically, if one or more modules of a
|
|
|
|
given type are available for selection, the modules' priorities will
|
|
|
|
be at least one of the factors used to determine which module will
|
|
|
|
finally be selected. Priorities are in the range $[-1, 100]$, with
|
|
|
|
$-1$ indicating that the module should not be considered for
|
|
|
|
selection, and $100$ being the highest priority. Ties will be broken
|
|
|
|
arbitrarily by the MCA framework.
|
|
|
|
|
|
|
|
A module's priorty can be set run-time through the normal MCA
|
|
|
|
parameter mechanisms (i.e., environment variables or using the
|
|
|
|
\cmdarg{-mca} parameter). Every module has an implicit priority MCA
|
|
|
|
parameter in the form \mcaparam{$<$type$>$\_\-$<$module
|
|
|
|
name$>$\_\-priority}.
|
|
|
|
|
|
|
|
For example, a system administrator may set environment variables in
|
|
|
|
system-wide shell setup files (e.g., \file{/etc/profile},
|
|
|
|
\file{/etc/bashrc}, or \file{/etc/csh.cshrc}) to change the default
|
|
|
|
priorities.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\subsection{Selection Algorithm}
|
|
|
|
|
|
|
|
For each component type, the following general selection algorithm is
|
|
|
|
used:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item A list of all available modules is created. If the user
|
|
|
|
specified one or more modules for this type, only those modules are
|
|
|
|
queried to see if they are available. Otherwise, all modules are
|
|
|
|
queried.
|
|
|
|
|
|
|
|
\item The module with the highest priority (and potentially meeting
|
|
|
|
other selection criteria, depending on the module's type) will be
|
|
|
|
selected.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Each MCA type may define its own additional selection rules. For
|
|
|
|
example, the selection of \kind{coll}, \kind{cr}, and \kind{rpi}
|
|
|
|
modules may be inter-dependant, and depend on the supported MPI thread
|
|
|
|
level. Chapter~\ref{sec:mca-ompi} (page~\pageref{sec:mca-ompi})
|
|
|
|
details the selection algorithm for MPI MCA modules.
|
|
|
|
|