1
1
openmpi/ompi/mca/io/romio314/romio/doc/source-guide.tex
2015-06-23 20:59:57 -07:00

495 строки
20 KiB
TeX
Исходник Ответственный История

% \documentstyle[11pt,psfig]{article}
\documentclass[11pt]{article}
\hoffset=-.7in
\voffset=-.6in
\textwidth=6.5in
\textheight=8.5in
\begin{document}
\vspace*{-1in}
\thispagestyle{empty}
\begin{center}
ARGONNE NATIONAL LABORATORY \\
9700 South Cass Avenue \\
Argonne, IL 60439
\end{center}
\vskip .5 in
\begin{center}
\rule{1.75in}{.01in} \\
\vspace{.1in}
ANL/MCS-TM-XXX \\
\rule{1.75in}{.01in} \\
\vskip 1.3 in
{\Large\bf A Guide to the ROMIO MPI-IO Implementation } \\
by \\ [2ex]
{\large\it Robert Ross, Robert Latham, and Rajeev Thakur}
\vspace{1in}
Mathematics and Computer Science Division
\bigskip
Technical Memorandum No.\ XXX
% \vspace{1.4in}
% Revised May 2004
\end{center}
\vfill
{\small
\noindent
This work was supported by the Mathematical, Information, and
Computational Sciences Division subprogram of the Office of Advanced
Scientific Computing Research, U.S. Department of Energy, under
Contract W-31-109-Eng-38; and by the Scalable I/O Initiative, a
multiagency project funded by the Defense Advanced Research Projects
Agency (Contract DABT63-94-C-0049), the Department of Energy, the
National Aeronautics and Space Administration, and the National
Science Foundation.}
\newpage
%% Line Spacing (e.g., \ls{1} for single, \ls{2} for double, even \ls{1.5})
%%
\newcommand{\ls}[1]
{\dimen0=\fontdimen6\the\font
\lineskip=#1\dimen0
\advance\lineskip.5\fontdimen5\the\font
\advance\lineskip-\dimen0
\lineskiplimit=.9\lineskip
\baselineskip=\lineskip
\advance\baselineskip\dimen0
\normallineskip\lineskip
\normallineskiplimit\lineskiplimit
\normalbaselineskip\baselineskip
\ignorespaces
}
\renewcommand{\baselinestretch}{1}
\newcommand {\ix} {\hspace*{2em}}
\newcommand {\mc} {\multicolumn}
\tableofcontents
\thispagestyle{empty}
\newpage
\pagenumbering{arabic}
\setcounter{page}{1}
\begin{center}
{\bf Users Guide for ROMIO: A High-Performance,\\[1ex]
Portable MPI-IO Implementation} \\ [2ex]
by \\ [2ex]
{\it Rajeev Thakur, Robert Ross, Ewing Lusk, and William Gropp}
\end{center}
\addcontentsline{toc}{section}{Abstract}
\begin{abstract}
\noindent
ROMIO is a high-performance, portable implementation of MPI-IO (the
I/O chapter in the \mbox{MPI Standard}).
This document describes the internals of the ROMIO implementation.
\end{abstract}
\section{Introduction}
The ROMIO MPI-IO implementation, originally written by Rajeev Thakur, has been
in existence since XXX.
... Discussion of the evolution of ROMIO ...
Architecturally, ROMIO is broken up into three layers: a layer implementing
the MPI I/O routines in terms of an abstract device for I/O (ADIO), a layer of
common code implementing a subset of the ADIO interface, and a set of storage
system specific functions that complete the ADIO implementation in terms of
that storage type. These three layers work together to provide I/O support
for MPI applications.
In this document we will discuss the details of the ROMIO implementation,
including the major components, how those components are implemented, and
where those components are located in the ROMIO source tree.
\section{The Directory Structure}
The ROMIO directory structure consists of two main branches, the MPI-IO branch
(mpi-io) and the ADIO branch (adio). The MPI-IO branch contains code that
implements the functions defined in the MPI specification for I/O, such as
MPI\_File\_open. These functions are then written in terms of other functions
that provide an abstract interface to I/O resources, the ADIO functions.
There is an additional glue subdirectory in the MPI-IO branch that defines
functions related to the MPI implementation as a whole, such as how to
allocate MPI\_File structures and how to report errors.
Code for the ADIO functions is located under the ADIO branch. This code is
responsible for performing I/O operations on whatever underlying storage is
available. There are two categories of directories in this branch. The first
is the common directory. This directory contains two distinct types of
source: source that is used by all ADIO implementations and source that is
common across many ADIO implementations. This distinction will become more
apparent when we discuss file system implementations.
The second category of directory in the ADIO branch is the file system
specific directory (e.g. ad\_ufs, ad\_pvfs2). These directories provide code
that is specific to a particular file system type and is only built if that
file system type is selected at configure time.
\section{The Configure Process}
... What can be specified, AIO stuff, where romioconf exists, how to add
another Makefile.in into the list.
\section{File System Implementations}
Each file system implementation exists in its own subdirectory under the adio
directory in the source tree. Each of these subdirectories must contain at
least two files, a Makefile.in (describing how to build the code in the
directory) and a C source file describing the mapping of ADIO operations to C
functions.
The common practice is to name this file based on the name of the ADIO
implementation. In the ad\_ufs implementation this file is called ad\_ufs.c,
and contains the following:
\begin{verbatim}
struct ADIOI_Fns_struct ADIO_UFS_operations = {
ADIOI_UFS_Open, /* Open */
ADIOI_GEN_ReadContig, /* ReadContig */
ADIOI_GEN_WriteContig, /* WriteContig */
ADIOI_GEN_ReadStridedColl, /* ReadStridedColl */
ADIOI_GEN_WriteStridedColl, /* WriteStridedColl */
ADIOI_GEN_SeekIndividual, /* SeekIndividual */
ADIOI_GEN_Fcntl, /* Fcntl */
ADIOI_GEN_SetInfo, /* SetInfo */
ADIOI_GEN_ReadStrided, /* ReadStrided */
ADIOI_GEN_WriteStrided, /* WriteStrided */
ADIOI_GEN_Close, /* Close */
ADIOI_GEN_IreadContig, /* IreadContig */
ADIOI_GEN_IwriteContig, /* IwriteContig */
ADIOI_GEN_IODone, /* ReadDone */
ADIOI_GEN_IODone, /* WriteDone */
ADIOI_GEN_IOComplete, /* ReadComplete */
ADIOI_GEN_IOComplete, /* WriteComplete */
ADIOI_GEN_IreadStrided, /* IreadStrided */
ADIOI_GEN_IwriteStrided, /* IwriteStrided */
ADIOI_GEN_Flush, /* Flush */
ADIOI_GEN_Resize, /* Resize */
ADIOI_GEN_Delete, /* Delete */
};
\end{verbatim}
The ADIOI\_Fns\_struct structure is defined in adio/include/adioi.h. This
structure holds pointers to appropriate functions for a given file system
type. "Generic" functions, defined in adio/common, are denoted by the
"ADIOI\_GEN" prefix, while file system specific functions use a file system
related prefix. In this example, the only file system specific function is
ADIOI\_UFS\_Open. All other operations use the generic versions.
Typically a third file, a header with file system specific defines and
includes, is also provided and named based on the name of the ADIO
implementation (e.g. ad\_ufs.h).
Because the UFS implementation provides its own open function, that code must be provided in the ad\_ufs subdirectory. That function is implemented in adio/ad\_ufs/ad\_ufs\_open.c.
\section{Generic Functions}
As we saw in the discussion above, generic ADIO function implementations are
used to minimize the amount of code in the ROMIO tree by sharing common
functionality between ADIO implementations. As the ROMIO implementation has
grown, a few categories of generic implementations have developed. At this
time, these are all lumped into the adio/common subdirectory together, which
can be confusing.
The easiest category of generic functions to understand is the ones that
implement functionality in terms of some other ADIO function.
ADIOI\_GEN\_ReadStridedColl is a good example of this type of function and is
implemented in adio/common/ad\_read\_coll.c. This function implements
collective read operations (e.g. MPI\_File\_read\_at\_all). We will discuss how
it works later in this document, but for the time being it is sufficient to
note that it is written in terms of ADIO ReadStrided or ReadContig calls.
A second category of generic functions are ones that implement functionality
in terms of POSIX I/O calls. ADIOI\_GEN\_ReadContig (adio/common/ad\_read.c) is
a good example of this type of function. These "generic" functions are the
result of a large number of ADIO implementations that are largely POSIX I/O
based, such as the UFS, XFS, and PANFS implementations. We have discussed
moving these functions into a separate common/posix subdirectory and renaming
them with ADIOI\_POSIX prefixes, but this has not been done as of the writing
of this document.
The next category of generic functions holds functions that do not actually
require I/O at all. ADIOI\_GEN\_SeekIndividual (adio/common/ad\_seek.c) is a
good example of this. Since we don't need to actually perform I/O at seek
time, we can just update local variables at each process. In fact, one could
argue that we no longer need the ADIO SeekIndividual function at all - all the
ADIO implementations simply use this generic version (with the exception of
TESTFS, which prints the value as well).
The next category of generic functions are the "FAKE" functions (e.g.
ADIOI\_FAKE\_IODone implemented in adio/common/ad\_done\_fake.c). These functions
are all related to asynchronous I/O (AIO) operations. These implement the AIO
operations in terms of blocking operations - in other words, they follow the
standard but do not allow for overlap of I/O and computation or communication.
These are used in cases where AIO support is otherwise unavailable or
unimplemented.
The final category of generic functions are the "na<6E><61>ve" functions (e.g.
ADIOI\_GEN\_WriteStrided\_naive in adio/common/ad\_write\_str\_naive.c). These
functions avoid the use of certain optimizations, such as data sieving.
Other Things in adio/common
... what else is in there?
\subsection{Calling ADIO Functions}
Throughout the code you will see calls to functions such as ADIO\_ReadContig.
There is no such function - this is actually a macro defined in
adio/include/adioi.h that calls the particular function out of the correct
ADIOI\_Fns\_struct for the file being accessed. This is done for convenience.
Exceptions!!! ADIO\_Open, ADIO\_Close...
\section{ROMIO Implementation Details}
The ROMIO Implementation relies on some basic concepts in order to operate and
to optimize I/O access. In this section we will discuss these concepts and
how they are implemented within ROMIO. Before we do that though, we will
discuss the core data structure of ROMIO, the ADIO\_File structure.
\subsection{ADIO\_File}
... discussion ...
\subsection{I/O Aggregation and Aggregators}
When performing collective I/O operations, it is often to our advantage to
combine operations or eliminate redundant operations altogether. We call this
combining process "aggregation", and processes that perform these combined
operations aggregators.
Aggregators are defined at the time the file is opened. A collection of MPI
hints can be used to tune what processes become aggregators for a given file
(see ROMIO User's Guide). The aggregators will then interact with the file
system during collective operations.
Note that it is possible to implement a system where ALL I/O operations pass
exclusively through aggregators, including independent I/O operations from
non-aggregators. However, this would require a guarantee of progress from the
aggregators that for portability would mean adding a thread to manage I/O. We
have chosen not to pursue this path at this time, so independent operations
continue to be serviced by the process making the call.
... how implemented ...
Rank 0 in the communicator opening a file \emph{always} processes the
cb\_config\_list hint using ADIOI\_cb\_config\_list\_parse. A previous call to
ADIOI\_cb\_gather\_name\_array had collected the processor names from all hosts
into an array that is cached on the communicator (so we don't have to gather
it more than once). This creates an ordered array of ranks (relative to the
communicator used to open the file) that will be aggregators. This array is
distributed to all processes using ADIOI\_cb\_bcast\_rank\_map. Aggregators are
referenced by their rank in the communicator used to open the file. These
ranks are stored in fd->hints->ranklist[].
Note that this could be a big list for very large runs. If we were to
restrict aggregators to a rank order subset, we could use a bitfield instead.
If the user specified hints and met conditions for deferred open, then a
separate communicator is also set up (fd->agg\_comm) that contains all the
aggregators, in order of their original ranks (not their order in the rank
list). Otherwise this communicator is set to MPI\_COMM\_NULL, and in any case
it is set to this for non-aggregators. This communicator is currently only
used at ADIO\_Close (adio/common/ad\_close.c), but could be useful in two-phase
I/O as well (discussed later).
\subsection{Deferred Open}
We do not always want all processes to attempt to actually open a file when
MPI\_File\_open is called. We might want to avoid this open because in fact
some processes (non-aggregators) cannot access the file at all and would get
an error, or we might want to avoid this open to avoid a storm of system calls
hitting the file system all at once. In either case, ROMIO implements a
"deferred open" mode that allows some processes to avoid opening the file
until such time as they perform an independent I/O operation on the file (see
ROMIO User's Guide).
Deferred open has a broad impact on the ROMIO implementation, because with its
addition there are now many places where we must first check to see if we have
called the file system specific ADIO Open call before performing I/O. This
impact is limited to the MPI-IO layer by semantically guaranteeing the FS ADIO
Open call has been made by the process prior to calling a read or write
function.
... how implemented ...
\subsection{Two-Phase I/O}
Two-Phase I/O is a technique for increasing the efficiency of I/O operations
by reordering data between processes, either before writes, or after reads.
ROMIO implements two-phase I/O as part of the generic implementations of
ADIO\_WriteStridedColl and ADIO\_ReadStridedColl. These implementations in turn
rely heavily on the aggregation code to determine what processes will actually
perform I/O on behalf of the application as a whole.
\subsection{Data Sieving}
Data sieving is a single-process technique for reducing the number of I/O
operations used to service a MPI read or write operation by accessing a
contiguous region of the file that contains more than one desired region at
once. Because often I/O operations require data movement across the network,
this is usually a more efficient way to access data.
Data sieving is implemented in the common strided I/O routines
(adio/common/ad\_write\_str.c and adio/common/ad\_read\_str.c). These functions
use the contig read and write routines to perform actual I/O. In the case of
a write operation, a read/modify/write sequence is used. In that case, as
well as in the atomic mode case, locking is required on the region. Some of
the ADIO implementations do not currently support locking, and in those cases
it would be erroneous to use the generic strided I/O routines.
\subsection{Shared File Pointers}
Because no file systems supported by ROMIO currently support a shared file
pointer mode, ROMIO must implement shared file pointers under the covers on
its own.
Currently ROMIO implements shared file pointers by storing the file pointer
value in a separate file...
Note that the ROMIO team has devised a portable method for implementing shared
file pointers using only MPI functions. However, this method has
not yet been implemented in ROMIO.
file name is selected at end of mpi-io/open.c.
\subsection{Error Handling}
\subsection{MPI and MPIO Requests}
\section*{Appendix A: ADIO Functions and Semantics}
ADIOI\_Open(ADIO\_File fd, int *error\_code)
Open is used in a strange way in ROMIO, as described previously.
The Open function is used to perform whatever operations are necessary prior
to actually accessing a file using read or write. The file name for the file
is stored in fd->filename prior to Open being called.
Note that when deferred open is in effect, all processes may not immediately
call Open at MPI\_File\_open time, but instead call open if they perform
independent I/O. This can result in somewhat unusual error returns to
processes (e.g. learning that a file is not accessible at write time).
ADIOI\_ReadContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
ReadContig is used to read a contiguous region from a file into a contiguous
buffer. The datatype (which refers to the buffer) can be assumed to be
contiguous. The offset is in bytes and is an absolute offset if
ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
current individual file pointer if ADIO\_INDIVIDUAL was passed as
file\_ptr\_type. Open has been called by this process prior to the call to
ReadContig. There is no guarantee that any other processes will call this
function at the same time.
ADIOI\_WriteContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
WriteContig is used to write a contiguous region to a file from a contiguous
buffer. The datatype (which refers to the buffer) can be assumed to be
contiguous. The offset is in bytes and is an absolute offset if
ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
current individual file pointer if ADIO\_INDIVIDUAL was passed as
file\_ptr\_type. Open has been called by this process prior to the call to
WriteContig. There is no guarantee that any other processes will call this
function at the same time.
ADIOI\_ReadStridedColl
ADIOI\_WriteStridedColl
ADIOI\_SeekIndividual
ADIOI\_Fcntl
ADIOI\_SetInfo
ADIOI\_ReadStrided
ADIOI\_WriteStrided
ADIOI\_Close(ADIO\_File fd, int *error\_code)
Close is responsible for releasing any resources associated with an open file.
It is called on all processes that called the corresponding ADIOI Open, which
might not be all the processes that opened the file (due to deferred open).
Thus it is not safe to perform collective communication among all processes in
the communicator during Close, although collective communication between
aggregators would be safe (if desired).
For performance reasons ROMIO does not guarantee that all file data is written
to "storage" at MPI\_File\_close, instead only performing synchronization
operations at MPI\_File\_sync time. As a result, our Close implementations do
not typically call a sync. However, any locally cached data, if any, should
be passed on to the underlying storage system at this time.
Note that ADIOI\_GEN\_Close is implemented in adio/common/adi\_close.c;
ad\_close.c implements ADIO\_Close, which is called by all processes that opened
the file.
ADIOI\_IreadContig
ADIOI\_IwriteContig
ADIOI\_ReadDone
ADIOI\_WriteDone
ADIOI\_ReadComplete
ADIOI\_WriteComplete
ADIOI\_IreadStrided
ADIOI\_IwriteStrided
ADIOI\_Flush
ADIOI\_Resize(ADIO\_File fd, ADIO\_Offset size, int *error\_code)
Resize is called collectively by all processes that opened the file referenced
by fd. It is not required that the Resize implementation block until all
processes have completed resize operations, but each process should be able to
see the correct size with a corresponding MPI\_File\_get\_size operation (an
independent operation that results in an ADIO Fcntl to obtain the file size).
ADIOI\_Delete(char *filename, int *error\_code)
Delete is called independently, and because only a filename is passed, there
is no opportunity to coordinate deletion if an application were to choose to
have all processes call MPI\_File\_delete. That's not likely to be an issue
though.
\section*{Appendix B: Status of ADIO Implementations}
... who wrote what, status, etc.
Appendix C: Adding a New ADIO Implementation
References
\end{document}