% -*- latex -*- % % Copyright (c) 2004-2005 The Trustees of Indiana University. % All rights reserved. % Copyright (c) 2004-2005 The Trustees of the University of Tennessee. % All rights reserved. % Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, % University of Stuttgart. All rights reserved. % Copyright (c) 2004-2005 The Regents of the University of California. % All rights reserved. % $COPYRIGHT$ % % Additional copyrights may follow % % $HEADER$ % \chapter{Modular Component Architecture (MCA) Overview} \label{sec:mca} \index{MCA!overview|(} \index{Modular Component Architecture|see {MCA}} The Modular Component Architecture (MCA) makes up the core of Open MPI. It influences how many commands and MPI processes are executed. This chapter provides an overview of what MCA is and what users need to know about how to use it to maximize performance of MPI applications. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Frameworks and Components} \index{MCA!component frameworks} The MCA provides component frameworks for the Open MPI run-time environment (otherwise known as the Open Run-Time Environment, or ORTE) and the MPI communications layer. Components are selected from each type at run-time and used to effect the RTE and MPI library. {\Huge JMS Right ideas, but needs much overhauling} There are currently four types of components used by Open MPI: \begin{itemize} \item \kind{boot}: Starting the Open MPI run-time environment, used mainly with the \cmd{lamboot} command. \item \kind{coll}: MPI collective communications, only used within MPI processes. \item \kind{cr}: Checkpoint/restart functionality, used both within Open MPI commands and MPI processes. \item \kind{rpi}: MPI point-to-point communications, only used within MPI processes. \end{itemize} The Open MPI distribution includes instances of each component type referred to as modules. Each module is an implementation of the component type which can be selected and used at run-time to provide services to the Open MPI RTE and MPI communications layer. Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules that are available in the Open MPI distribution. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Terminology} \begin{description} \item[Available] The term ``available'' is used to describe a module that reports (at run-time) that it is able to run in the current environment. For example, an RPI module may check to see if supporting network hardware is present before reporting that it is available or not. Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} list the modules that are included in the Open MPI distribution, and detail the requirements for each of them to indicate whether they are available or not. \item[Selected] The term ``selected'' means that a module has been chosen to be used at run-time. Depending on the module type, zero or more modules may be selected. \item[Scope] Each module selection has a scope depending on the type of the module. ``Scope'' refers to the duration of the module's selection. Table~\ref{tbl:mca-module-scopes} lists the scopes for each module type. \end{description} \begin{table}[htbp] \centering \begin{tabular}{|l|p{4in}|} \hline \multicolumn{1}{|c|}{Type} & \multicolumn{1}{|c|}{Scope description} \\ \hline \hline \kind{boot} & A module is selected at the beginning of \cmd{lamboot} (or \cmd{recon}) and is used for the duration of the Open MPI universe. \\ \hline \kind{coll} & A module is selected every time an MPI communicator is created (including \mpiconst{MPI\_\-COMM\_\-WORLD} and \mpiconst{MPI\_\-COMM\_\-SELF}). It remains in use until that communicator has been freed. \\ \hline \kind{cr} & Checkpoint/restart modules are selected at the beginning of an MPI job and remain in use until the job completes. \\ \hline \kind{rpi} & RPI modules are selected during \mpifunc{MPI\_\-INIT} and remain in use until \mpifunc{MPI\_\-FINALIZE} returns. \\ \hline \end{tabular} \caption{MCA module types and their corresponding scopes.} \label{tbl:mca-module-scopes} \end{table} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{MCA Parameters} \label{sec:commands-mca-module-parameters} \index{MCA!parameter overview} One of the founding principles of MCA is to allow the passing of run-time parameters through the MCA framework. This allows both the selection of which modules will be used at run-time (by passing parameters to the MCA framework itself) as well as tuning run-time performance of individual modules (by passing parameters to each module). % Although the specific usage of each MCA module parameter is defined by either the framework or the module that it is passed to, the value of most parameters will be resolved by the following: \begin{enumerate} \item If a valid value is provided via a run-time MCA parameter, use that. \item Otherwise, attempt to calculate a meaningful value at run-time or use a compiled-in default value.\footnote{Note that many MCA modules provide configure flags to set compile-time defaults for ``tweakable'' parameters. See~\cite{lamteam03:_lam_mpi_install_guide}.} \end{enumerate} As such, it is typically possible to set a parameter's default value when Open MPI is configured/compiled, but use a different value at run time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Naming Conventions} MCA parameter names are generally strings containing only letters and underscores, and can typically be broken down into three parts. For example, the parameter \mcaparam{boot\_\-rsh\_\-agent} can be broken into its three components: \begin{itemize} \item MCA module type: The first string of the name. In this case, it is \mcaparam{boot}. \item MCA module name: The second string of the name, corresponding to a specific MCA module. In this case, it is \mcaparam{rsh}. \item Parameter name: The last string in the name. It may be an arbitrary string, and include multiple underscores. In this case, it is \mcaparam{agent}. \end{itemize} Although the parameter name is technically only the last part of the string, it is only proper to refer to it within its overall context. Hence, it is correct to say ``the \mcaparam{boot\_\-rsh\_\-agent} parameter'' as well as ``the \mcaparam{agent} parameter to the \boot{rsh} boot module''. Note that the reserved string \mcaparam{base} may appear as a module name, referring to the fact that the parameter applies to all modules of a give type. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Setting Parameter Values} MCA parameters each have a unique name and can take a single string value. The parameter/value pairs can be passed by multiple different mechanisms. Depending on the target module and the specific parameter, mechanisms may include: \begin{itemize} \item Using command line flags when Open MPI was configured. \item Setting environment variables before invoking Open MPI commands. \item Using the \cmdarg{-mca} command line switch to various Open MPI commands. \item Setting attributes on MPI communicators. \end{itemize} Users are most likely to utilize the latter three methods. Each is described in detail, below. Listings and explanations of available MCA parameters are provided in Chapters~\ref{sec:lam-mca} and~\ref{sec:mca-ompi} (pages~\pageref{sec:lam-mca} and~\pageref{sec:mca-ompi}, respectively), categorized by MCA type and module. %%%%% \subsubsection{Environment Variables} MCA parameters can be passed via environment variables prefixed with \envvar{Open MPI\_\-MPI\_\-MCA}. For example, selecting which RPI module to use in an MPI job can be accomplished by setting the environment variable \envvar{Open MPI\_\-MPI\_\-MCA\_\-rpi} to a valid RPI module name (e.g., \cmdarg{tcp}). Note that environment variables must be set {\em before} invoking the corresponding Open MPI commands that will use them. %%%%% \subsubsection{\cmdarg{-mca} Command Line Switch} Open MPI commands that interact with MCA modules accept the \cmdarg{-mca} command line switch. This switch expects two parameters to follow: the name of the MCA parameter and its corresponding value. For example: \lstset{style=lam-cmdline} \begin{lstlisting} shell$ mpirun C -mca rpi tcp my_mpi_program \end{lstlisting} % stupid emacs mode: $ \noindent runs the \cmd{my\_\-mpi\_\-program} on all available CPUs in the Open MPI universe using the \rpi{tcp} RPI module. %%%%% \subsubsection{Communicator Attributes} Some MCA types accept MCA parameters via MPI communicator attributes (notably the MPI collective communication modules). These parameters follow the same rules and restrictions as normal MPI attributes. Note that for portability between 32 and 64 bit systems, care should be taken when setting and getting attribute values. The following is an example of portable attribute C code: \lstset{style=lam-c} \begin{lstlisting} int flag, attribute_val; void *set_attribute; void **get_attribute; MPI_Comm comm = MPI_COMM_WORLD; int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE; /* Set the value */ set_attribute = (void *) 1; MPI_Comm_set_attr(comm, keyval, &set_attribute); /* Get the value */ get_attribute = NULL; MPI_Comm_get_attr(comm, keyval, &get_attribute, &flag); if (flag == 1) { attribute_val = (int) *get_attribute; printf(``Got the attribute value: %d\n'', attribute_val); } \end{lstlisting} % stupid emacs mode: $ Specifically, the following code is neither correct nor portable: \lstset{style=lam-c} \begin{lstlisting} int flag, attribute_val; MPI_Comm comm = MPI_COMM_WORLD; int keyval = Open MPI_MPI_MCA_COLL_BASE_ASSOCIATIVE; /* Set the value */ attribute_val = 1; MPI_Comm_set_attr(comm, keyval, &attribute_val); /* Get the value */ attribute_val = -1; MPI_Comm_get_attr(comm, keyval, &attribute_val, &flag); if (flag == 1) printf(``Got the attribute value: %d\n'', attribute_val); \end{lstlisting} % stupid emacs mode: $ \index{MCA!overview|)} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Dynamic Shared Object (DSO) Modules} \changebegin{7.1} Open MPI has the capability of building MCA modules statically as part of the MPI libraries or as dynamic shared objects (DSOs). DSOs are discovered and loaded into Open MPI processes at run-time. This allows adding (or removing) functionality from an existing Open MPI installation without the need to recompile or re-link user applications. The default location for DSO MCA modules is \file{\$prefix/lib/lam}. If otherwise unspecified, this is where Open MPI will look for DSO MCA modules. However, the MCA parameter \imcaparam{base\_\-module\_\-path} can be used to specify a new colon-delimited path to look for DSO MCA modules. This allows users to specify their own location for modules, if desired. Note that specifying this parameter overrides the default location. If users wish to augment their search path, they will need to include the default location in the path specification. \lstset{style=lam-cmdline} \begin{lstlisting} shell$ mpirun C -mca base_module_path $prefix/lib/lam:$HOME/my_lam_modules ... \end{lstlisting} % stupid emacs mode: $ \changeend{7.1} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Selecting Modules} As implied by the previous sections, modules are selected at run-time either by examining (in order) user-specified parameters, run-time calculations, and compiled-in defaults. The selection process involves a flexible negotitation phase which can be both tweaked and arbitrarily overriden by the user and system administrator. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Specifying Modules} Each MCA type has an implicit MCA parameter corresponding to the type name indicating which module(s) to be considered for selection. For example, to specify in that the \rpi{tcp} RPI module should be used, the MCA parameter \mcaparam{rpi} should be set to the value \mcaparam{tcp}. For example: \lstset{style=lam-cmdline} \begin{lstlisting} shell$ mpirun C -mca rpi tcp my_mpi_program \end{lstlisting} % stupid emacs mode: $ The same is true for the other MCA types (\kind{boot}, \kind{cr}, and \kind{coll}), with the exception that the \kind{coll} type can be used to specify a comma-separated list of modules to be considered as each MPI communicator is created (including \mpiconst{MPI\_\-COMM\_\-WORLD}). For example: \lstset{style=lam-cmdline} \begin{lstlisting} shell$ mpirun C -mca coll smp,shmem,lam_basic my_mpi_program \end{lstlisting} % stupid emacs mode: $ \noindent indicates that the \coll{smp} and \coll{lam\_\-basic} modules will potentially both be considered for selection for each MPI communicator. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Setting Priorities} Although typically not useful to individual users, system administrators may use priorities to set system-wide defaults that influence the module selection process in Open MPI jobs. Each module has an associated priority which plays role in whether a module is selected or not. Specifically, if one or more modules of a given type are available for selection, the modules' priorities will be at least one of the factors used to determine which module will finally be selected. Priorities are in the range $[-1, 100]$, with $-1$ indicating that the module should not be considered for selection, and $100$ being the highest priority. Ties will be broken arbitrarily by the MCA framework. A module's priorty can be set run-time through the normal MCA parameter mechanisms (i.e., environment variables or using the \cmdarg{-mca} parameter). Every module has an implicit priority MCA parameter in the form \mcaparam{$<$type$>$\_\-$<$module name$>$\_\-priority}. For example, a system administrator may set environment variables in system-wide shell setup files (e.g., \file{/etc/profile}, \file{/etc/bashrc}, or \file{/etc/csh.cshrc}) to change the default priorities. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Selection Algorithm} For each component type, the following general selection algorithm is used: \begin{itemize} \item A list of all available modules is created. If the user specified one or more modules for this type, only those modules are queried to see if they are available. Otherwise, all modules are queried. \item The module with the highest priority (and potentially meeting other selection criteria, depending on the module's type) will be selected. \end{itemize} Each MCA type may define its own additional selection rules. For example, the selection of \kind{coll}, \kind{cr}, and \kind{rpi} modules may be inter-dependant, and depend on the supported MPI thread level. Chapter~\ref{sec:mca-ompi} (page~\pageref{sec:mca-ompi}) details the selection algorithm for MPI MCA modules.