openmpi/doc/user/mca.tex

% -*- latex -*-
%
% Copyright (c) 2004-2005 The Trustees of Indiana University.
%                         All rights reserved.
% Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
%                         All rights reserved.
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
%                         University of Stuttgart.  All rights reserved.
% Copyright (c) 2004-2005 The Regents of the University of California.
%                         All rights reserved.
% $COPYRIGHT$
%
% Additional copyrights may follow
%
% $HEADER$
%

\chapter{Modular Component Architecture (MCA) Overview}
\label{sec:mca}
\index{MCA!overview|(}
\index{Modular Component Architecture|see {MCA}}

The Modular Component Architecture (MCA) makes up the core of Open MPI.
It influences how many commands and MPI processes are executed.  This
chapter provides an overview of what MCA is and what users need to
know about how to use it to maximize performance of MPI applications.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Frameworks and Components}
\index{MCA!component frameworks}

The MCA provides component frameworks for the Open MPI run-time
environment (otherwise known as the Open Run-Time Environment, or
ORTE) and the MPI communications layer.  Components are selected from
each type at run-time and used to effect the RTE and MPI library.

{\Huge JMS Right ideas, but needs much overhauling}

There are currently four types of components used by
Open MPI:

\begin{itemize}
\item \kind{boot}: Starting the Open MPI run-time environment, used mainly
  with the \cmd{lamboot} command.

\item \kind{coll}: MPI collective communications, only used within MPI
  processes.

\item \kind{cr}: Checkpoint/restart functionality, used both within
  Open MPI commands and MPI processes.

\item \kind{rpi}: MPI point-to-point communications, only used within
  MPI processes.
\end{itemize}

The Open MPI distribution includes instances of each component type
referred to as modules.  Each module is an implementation of the
component type which can be selected and used at run-time to provide
services to the Open MPI RTE and MPI communications layer.
Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules that
are available in the Open MPI distribution.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Terminology}

\begin{description}
\item[Available] The term ``available'' is used to describe a module
  that reports (at run-time) that it is able to run in the current
  environment.  For example, an RPI module may check to see if
  supporting network hardware is present before reporting that it is
  available or not.

  Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules
  that are included in the Open MPI distribution, and detail the
  requirements for each of them to indicate whether they are available
  or not.

\item[Selected] The term ``selected'' means that a module has been
  chosen to be used at run-time.  Depending on the module type, zero
  or more modules may be selected.

\item[Scope] Each module selection has a scope depending on the type
  of the module.  ``Scope'' refers to the duration of the module's
  selection.  Table~\ref{tbl:mca-module-scopes} lists the scopes for
  each module type.
\end{description}

\begin{table}[htbp]
  \centering
  \begin{tabular}{|l|p{4in}|}
    \hline
    \multicolumn{1}{|c|}{Type} &
    \multicolumn{1}{|c|}{Scope description} \\
    \hline
    \hline
    \kind{boot} & A module is selected at the beginning of
    \cmd{lamboot} (or \cmd{recon}) and is used for the duration of the
    Open MPI universe. \\
    \hline
    \kind{coll} & A module is selected every time an MPI communicator
    is created (including \mpiconst{MPI\_\-COMM\_\-WORLD} and
    \mpiconst{MPI\_\-COMM\_\-SELF}). It remains in use until that
    communicator has been freed. \\
    \hline
    \kind{cr} & Checkpoint/restart modules are selected at the
    beginning of an MPI job and remain in use until the job
    completes. \\
    \hline
    \kind{rpi} & RPI modules are selected during \mpifunc{MPI\_\-INIT}
    and remain in use until \mpifunc{MPI\_\-FINALIZE} returns. \\
    \hline
  \end{tabular}
  \caption{MCA module types and their corresponding scopes.}
  \label{tbl:mca-module-scopes}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{MCA Parameters}
\label{sec:commands-mca-module-parameters}
\index{MCA!parameter overview}

One of the founding principles of MCA is to allow the passing of
run-time parameters through the MCA framework.  This allows both the
selection of which modules will be used at run-time (by passing
parameters to the MCA framework itself) as well as tuning run-time
performance of individual modules (by passing parameters to each
module).
%
Although the specific usage of each MCA module parameter is defined by
either the framework or the module that it is passed to, the value of
most parameters will be resolved by the following:

\begin{enumerate}
\item If a valid value is provided via a run-time MCA parameter, use
  that.

\item Otherwise, attempt to calculate a meaningful value at run-time
  or use a compiled-in default value.\footnote{Note that many MCA
    modules provide configure flags to set compile-time defaults
    for ``tweakable'' parameters.
    See~\cite{lamteam03:_lam_mpi_install_guide}.}
\end{enumerate}

As such, it is typically possible to set a parameter's default value
when Open MPI is configured/compiled, but use a different value at run
time.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Naming Conventions}

MCA parameter names are generally strings containing only letters and
underscores, and can typically be broken down into three parts.  For
example, the parameter \mcaparam{boot\_\-rsh\_\-agent} can be broken
into its three components:

\begin{itemize}
\item MCA module type: The first string of the name.  In this case, it
  is \mcaparam{boot}.

\item MCA module name: The second string of the name, corresponding to
  a specific MCA module.  In this case, it is \mcaparam{rsh}.

\item Parameter name: The last string in the name.  It may be an
  arbitrary string, and include multiple underscores.  In this case,
  it is \mcaparam{agent}.
\end{itemize}

Although the parameter name is technically only the last part of the
string, it is only proper to refer to it within its overall context.
Hence, it is correct to say ``the \mcaparam{boot\_\-rsh\_\-agent}
parameter'' as well as ``the \mcaparam{agent} parameter to the
\boot{rsh} boot module''.

Note that the reserved string \mcaparam{base} may appear as a module
name, referring to the fact that the parameter applies to all modules
of a give type.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Setting Parameter Values}

MCA parameters each have a unique name and can take a single string
value.  The parameter/value pairs can be passed by multiple different
mechanisms.  Depending on the target module and the specific
parameter, mechanisms may include:

\begin{itemize}
\item Using command line flags when Open MPI was configured.
\item Setting environment variables before invoking Open MPI commands.
\item Using the \cmdarg{-mca} command line switch to various Open MPI
  commands.
\item Setting attributes on MPI communicators.
\end{itemize}

Users are most likely to utilize the latter three methods.  Each is
described in detail, below.  Listings and explanations of available
MCA parameters are provided in Chapters~\ref{sec:lam-mca}
and~\ref{sec:mca-ompi} (pages~\pageref{sec:lam-mca}
and~\pageref{sec:mca-ompi}, respectively), categorized by MCA type and
module.

%%%%%

\subsubsection{Environment Variables}

MCA parameters can be passed via environment variables prefixed with
\envvar{Open MPI\_\-MPI\_\-MCA}.  For example, selecting which RPI module
to use in an MPI job can be accomplished by setting the environment
variable \envvar{Open MPI\_\-MPI\_\-MCA\_\-rpi} to a valid RPI module name
(e.g., \cmdarg{tcp}).

Note that environment variables must be set {\em before} invoking the
corresponding Open MPI commands that will use them.

%%%%%

\subsubsection{\cmdarg{-mca} Command Line Switch}

Open MPI commands that interact with MCA modules accept the
\cmdarg{-mca} command line switch.  This switch expects two parameters
to follow: the name of the MCA parameter and its corresponding value.
For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -mca rpi tcp my_mpi_program
\end{lstlisting}
% stupid emacs mode: $

\noindent runs the \cmd{my\_\-mpi\_\-program} on all available CPUs in
the Open MPI universe using the \rpi{tcp} RPI module.

%%%%%

\subsubsection{Communicator Attributes}

Some MCA types accept MCA parameters via MPI communicator attributes
(notably the MPI collective communication modules).  These parameters
follow the same rules and restrictions as normal MPI attributes.  Note
that for portability between 32 and 64 bit systems, care should be
taken when setting and getting attribute values.  The following is an
example of portable attribute C code:

\lstset{style=lam-c}
\begin{lstlisting}
int flag, attribute_val;
void *set_attribute;
void **get_attribute;
MPI_Comm comm = MPI_COMM_WORLD;
int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE;

/* Set the value */
set_attribute = (void *) 1;
MPI_Comm_set_attr(comm, keyval, &set_attribute);

/* Get the value */
get_attribute = NULL;
MPI_Comm_get_attr(comm, keyval, &get_attribute, &flag);
if (flag == 1) {
  attribute_val = (int) *get_attribute;
  printf(``Got the attribute value: %d\n'', attribute_val);
}
\end{lstlisting}
% stupid emacs mode: $

Specifically, the following code is neither correct nor portable:

\lstset{style=lam-c}
\begin{lstlisting}
int flag, attribute_val;
MPI_Comm comm = MPI_COMM_WORLD;
int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE;

/* Set the value */
attribute_val = 1;
MPI_Comm_set_attr(comm, keyval, &attribute_val);

/* Get the value */
attribute_val = -1;
MPI_Comm_get_attr(comm, keyval, &attribute_val, &flag);
if (flag == 1)
  printf(``Got the attribute value: %d\n'', attribute_val);
\end{lstlisting}
% stupid emacs mode: $

\index{MCA!overview|)}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Dynamic Shared Object (DSO) Modules}

\changebegin{7.1}

Open MPI has the capability of building MCA modules statically as part of
the MPI libraries or as dynamic shared objects (DSOs).  DSOs are
discovered and loaded into Open MPI processes at run-time.  This allows
adding (or removing) functionality from an existing Open MPI installation
without the need to recompile or re-link user applications.

The default location for DSO MCA modules is \file{\$prefix/lib/lam}.
If otherwise unspecified, this is where Open MPI will look for DSO MCA
modules.  However, the MCA parameter
\imcaparam{base\_\-module\_\-path} can be used to specify a new
colon-delimited path to look for DSO MCA modules.  This allows users
to specify their own location for modules, if desired.

Note that specifying this parameter overrides the default location.
If users wish to augment their search path, they will need to include
the default location in the path specification.

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -mca base_module_path $prefix/lib/lam:$HOME/my_lam_modules ...
\end{lstlisting}
% stupid emacs mode: $

\changeend{7.1}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Selecting Modules}

As implied by the previous sections, modules are selected at run-time
either by examining (in order) user-specified parameters, run-time
calculations, and compiled-in defaults.  The selection process
involves a flexible negotitation phase which can be both tweaked and
arbitrarily overriden by the user and system administrator.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Specifying Modules}

Each MCA type has an implicit MCA parameter corresponding to the type
name indicating which module(s) to be considered for selection.  For
example, to specify in that the \rpi{tcp} RPI module should be used,
the MCA parameter \mcaparam{rpi} should be set to the value
\mcaparam{tcp}.  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -mca rpi tcp my_mpi_program
\end{lstlisting}
% stupid emacs mode: $

The same is true for the other MCA types (\kind{boot}, \kind{cr}, and
\kind{coll}), with the exception that the \kind{coll} type can be used
to specify a comma-separated list of modules to be considered as each
MPI communicator is created (including
\mpiconst{MPI\_\-COMM\_\-WORLD}).  For example:

\lstset{style=lam-cmdline}
\begin{lstlisting}
shell$ mpirun C -mca coll smp,shmem,lam_basic my_mpi_program
\end{lstlisting}
% stupid emacs mode: $

\noindent indicates that the \coll{smp} and \coll{lam\_\-basic}
modules will potentially both be considered for selection for each MPI
communicator.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Setting Priorities}

Although typically not useful to individual users, system
administrators may use priorities to set system-wide defaults that
influence the module selection process in Open MPI jobs.

Each module has an associated priority which plays role in whether a
module is selected or not.  Specifically, if one or more modules of a
given type are available for selection, the modules' priorities will
be at least one of the factors used to determine which module will
finally be selected.  Priorities are in the range $[-1, 100]$, with
$-1$ indicating that the module should not be considered for
selection, and $100$ being the highest priority.  Ties will be broken
arbitrarily by the MCA framework.

A module's priorty can be set run-time through the normal MCA
parameter mechanisms (i.e., environment variables or using the
\cmdarg{-mca} parameter).  Every module has an implicit priority MCA
parameter in the form \mcaparam{$<$type$>$\_\-$<$module
  name$>$\_\-priority}.

For example, a system administrator may set environment variables in
system-wide shell setup files (e.g., \file{/etc/profile},
\file{/etc/bashrc}, or \file{/etc/csh.cshrc}) to change the default
priorities.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Selection Algorithm}

For each component type, the following general selection algorithm is
used:

\begin{itemize}
\item A list of all available modules is created.  If the user
  specified one or more modules for this type, only those modules are
  queried to see if they are available.  Otherwise, all modules are
  queried.

\item The module with the highest priority (and potentially meeting
  other selection criteria, depending on the module's type) will be
  selected.
\end{itemize}

Each MCA type may define its own additional selection rules.  For
example, the selection of \kind{coll}, \kind{cr}, and \kind{rpi}
modules may be inter-dependant, and depend on the supported MPI thread
level.  Chapter~\ref{sec:mca-ompi} (page~\pageref{sec:mca-ompi})
details the selection algorithm for MPI MCA modules.