495 строки
20 KiB
TeX
495 строки
20 KiB
TeX
% \documentstyle[11pt,psfig]{article}
|
||
\documentclass[11pt]{article}
|
||
\hoffset=-.7in
|
||
\voffset=-.6in
|
||
\textwidth=6.5in
|
||
\textheight=8.5in
|
||
|
||
\begin{document}
|
||
\vspace*{-1in}
|
||
\thispagestyle{empty}
|
||
\begin{center}
|
||
ARGONNE NATIONAL LABORATORY \\
|
||
9700 South Cass Avenue \\
|
||
Argonne, IL 60439
|
||
\end{center}
|
||
\vskip .5 in
|
||
|
||
\begin{center}
|
||
\rule{1.75in}{.01in} \\
|
||
\vspace{.1in}
|
||
|
||
ANL/MCS-TM-XXX \\
|
||
|
||
\rule{1.75in}{.01in} \\
|
||
|
||
\vskip 1.3 in
|
||
{\Large\bf A Guide to the ROMIO MPI-IO Implementation } \\
|
||
by \\ [2ex]
|
||
{\large\it Robert Ross, Robert Latham, and Rajeev Thakur}
|
||
\vspace{1in}
|
||
|
||
Mathematics and Computer Science Division
|
||
|
||
\bigskip
|
||
|
||
Technical Memorandum No.\ XXX
|
||
|
||
|
||
% \vspace{1.4in}
|
||
% Revised May 2004
|
||
|
||
\end{center}
|
||
|
||
\vfill
|
||
|
||
{\small
|
||
\noindent
|
||
This work was supported by the Mathematical, Information, and
|
||
Computational Sciences Division subprogram of the Office of Advanced
|
||
Scientific Computing Research, U.S. Department of Energy, under
|
||
Contract W-31-109-Eng-38; and by the Scalable I/O Initiative, a
|
||
multiagency project funded by the Defense Advanced Research Projects
|
||
Agency (Contract DABT63-94-C-0049), the Department of Energy, the
|
||
National Aeronautics and Space Administration, and the National
|
||
Science Foundation.}
|
||
|
||
\newpage
|
||
|
||
|
||
%% Line Spacing (e.g., \ls{1} for single, \ls{2} for double, even \ls{1.5})
|
||
%%
|
||
|
||
\newcommand{\ls}[1]
|
||
{\dimen0=\fontdimen6\the\font
|
||
\lineskip=#1\dimen0
|
||
\advance\lineskip.5\fontdimen5\the\font
|
||
\advance\lineskip-\dimen0
|
||
\lineskiplimit=.9\lineskip
|
||
\baselineskip=\lineskip
|
||
\advance\baselineskip\dimen0
|
||
\normallineskip\lineskip
|
||
\normallineskiplimit\lineskiplimit
|
||
\normalbaselineskip\baselineskip
|
||
\ignorespaces
|
||
}
|
||
\renewcommand{\baselinestretch}{1}
|
||
\newcommand {\ix} {\hspace*{2em}}
|
||
\newcommand {\mc} {\multicolumn}
|
||
|
||
|
||
\tableofcontents
|
||
\thispagestyle{empty}
|
||
\newpage
|
||
|
||
\pagenumbering{arabic}
|
||
\setcounter{page}{1}
|
||
\begin{center}
|
||
{\bf Users Guide for ROMIO: A High-Performance,\\[1ex]
|
||
Portable MPI-IO Implementation} \\ [2ex]
|
||
by \\ [2ex]
|
||
{\it Rajeev Thakur, Robert Ross, Ewing Lusk, and William Gropp}
|
||
|
||
\end{center}
|
||
\addcontentsline{toc}{section}{Abstract}
|
||
\begin{abstract}
|
||
\noindent
|
||
ROMIO is a high-performance, portable implementation of MPI-IO (the
|
||
I/O chapter in the \mbox{MPI Standard}).
|
||
This document describes the internals of the ROMIO implementation.
|
||
\end{abstract}
|
||
|
||
\section{Introduction}
|
||
|
||
The ROMIO MPI-IO implementation, originally written by Rajeev Thakur, has been
|
||
in existence since XXX.
|
||
|
||
... Discussion of the evolution of ROMIO ...
|
||
|
||
Architecturally, ROMIO is broken up into three layers: a layer implementing
|
||
the MPI I/O routines in terms of an abstract device for I/O (ADIO), a layer of
|
||
common code implementing a subset of the ADIO interface, and a set of storage
|
||
system specific functions that complete the ADIO implementation in terms of
|
||
that storage type. These three layers work together to provide I/O support
|
||
for MPI applications.
|
||
|
||
In this document we will discuss the details of the ROMIO implementation,
|
||
including the major components, how those components are implemented, and
|
||
where those components are located in the ROMIO source tree.
|
||
|
||
\section{The Directory Structure}
|
||
|
||
The ROMIO directory structure consists of two main branches, the MPI-IO branch
|
||
(mpi-io) and the ADIO branch (adio). The MPI-IO branch contains code that
|
||
implements the functions defined in the MPI specification for I/O, such as
|
||
MPI\_File\_open. These functions are then written in terms of other functions
|
||
that provide an abstract interface to I/O resources, the ADIO functions.
|
||
There is an additional glue subdirectory in the MPI-IO branch that defines
|
||
functions related to the MPI implementation as a whole, such as how to
|
||
allocate MPI\_File structures and how to report errors.
|
||
|
||
Code for the ADIO functions is located under the ADIO branch. This code is
|
||
responsible for performing I/O operations on whatever underlying storage is
|
||
available. There are two categories of directories in this branch. The first
|
||
is the common directory. This directory contains two distinct types of
|
||
source: source that is used by all ADIO implementations and source that is
|
||
common across many ADIO implementations. This distinction will become more
|
||
apparent when we discuss file system implementations.
|
||
|
||
The second category of directory in the ADIO branch is the file system
|
||
specific directory (e.g. ad\_ufs, ad\_pvfs2). These directories provide code
|
||
that is specific to a particular file system type and is only built if that
|
||
file system type is selected at configure time.
|
||
|
||
\section{The Configure Process}
|
||
|
||
... What can be specified, AIO stuff, where romioconf exists, how to add
|
||
another Makefile.in into the list.
|
||
|
||
\section{File System Implementations}
|
||
|
||
Each file system implementation exists in its own subdirectory under the adio
|
||
directory in the source tree. Each of these subdirectories must contain at
|
||
least two files, a Makefile.in (describing how to build the code in the
|
||
directory) and a C source file describing the mapping of ADIO operations to C
|
||
functions.
|
||
|
||
The common practice is to name this file based on the name of the ADIO
|
||
implementation. In the ad\_ufs implementation this file is called ad\_ufs.c,
|
||
and contains the following:
|
||
|
||
\begin{verbatim}
|
||
struct ADIOI_Fns_struct ADIO_UFS_operations = {
|
||
ADIOI_UFS_Open, /* Open */
|
||
ADIOI_GEN_ReadContig, /* ReadContig */
|
||
ADIOI_GEN_WriteContig, /* WriteContig */
|
||
ADIOI_GEN_ReadStridedColl, /* ReadStridedColl */
|
||
ADIOI_GEN_WriteStridedColl, /* WriteStridedColl */
|
||
ADIOI_GEN_SeekIndividual, /* SeekIndividual */
|
||
ADIOI_GEN_Fcntl, /* Fcntl */
|
||
ADIOI_GEN_SetInfo, /* SetInfo */
|
||
ADIOI_GEN_ReadStrided, /* ReadStrided */
|
||
ADIOI_GEN_WriteStrided, /* WriteStrided */
|
||
ADIOI_GEN_Close, /* Close */
|
||
ADIOI_GEN_IreadContig, /* IreadContig */
|
||
ADIOI_GEN_IwriteContig, /* IwriteContig */
|
||
ADIOI_GEN_IODone, /* ReadDone */
|
||
ADIOI_GEN_IODone, /* WriteDone */
|
||
ADIOI_GEN_IOComplete, /* ReadComplete */
|
||
ADIOI_GEN_IOComplete, /* WriteComplete */
|
||
ADIOI_GEN_IreadStrided, /* IreadStrided */
|
||
ADIOI_GEN_IwriteStrided, /* IwriteStrided */
|
||
ADIOI_GEN_Flush, /* Flush */
|
||
ADIOI_GEN_Resize, /* Resize */
|
||
ADIOI_GEN_Delete, /* Delete */
|
||
};
|
||
\end{verbatim}
|
||
|
||
The ADIOI\_Fns\_struct structure is defined in adio/include/adioi.h. This
|
||
structure holds pointers to appropriate functions for a given file system
|
||
type. "Generic" functions, defined in adio/common, are denoted by the
|
||
"ADIOI\_GEN" prefix, while file system specific functions use a file system
|
||
related prefix. In this example, the only file system specific function is
|
||
ADIOI\_UFS\_Open. All other operations use the generic versions.
|
||
|
||
Typically a third file, a header with file system specific defines and
|
||
includes, is also provided and named based on the name of the ADIO
|
||
implementation (e.g. ad\_ufs.h).
|
||
|
||
Because the UFS implementation provides its own open function, that code must be provided in the ad\_ufs subdirectory. That function is implemented in adio/ad\_ufs/ad\_ufs\_open.c.
|
||
|
||
\section{Generic Functions}
|
||
|
||
As we saw in the discussion above, generic ADIO function implementations are
|
||
used to minimize the amount of code in the ROMIO tree by sharing common
|
||
functionality between ADIO implementations. As the ROMIO implementation has
|
||
grown, a few categories of generic implementations have developed. At this
|
||
time, these are all lumped into the adio/common subdirectory together, which
|
||
can be confusing.
|
||
|
||
The easiest category of generic functions to understand is the ones that
|
||
implement functionality in terms of some other ADIO function.
|
||
ADIOI\_GEN\_ReadStridedColl is a good example of this type of function and is
|
||
implemented in adio/common/ad\_read\_coll.c. This function implements
|
||
collective read operations (e.g. MPI\_File\_read\_at\_all). We will discuss how
|
||
it works later in this document, but for the time being it is sufficient to
|
||
note that it is written in terms of ADIO ReadStrided or ReadContig calls.
|
||
|
||
A second category of generic functions are ones that implement functionality
|
||
in terms of POSIX I/O calls. ADIOI\_GEN\_ReadContig (adio/common/ad\_read.c) is
|
||
a good example of this type of function. These "generic" functions are the
|
||
result of a large number of ADIO implementations that are largely POSIX I/O
|
||
based, such as the UFS, XFS, and PANFS implementations. We have discussed
|
||
moving these functions into a separate common/posix subdirectory and renaming
|
||
them with ADIOI\_POSIX prefixes, but this has not been done as of the writing
|
||
of this document.
|
||
|
||
The next category of generic functions holds functions that do not actually
|
||
require I/O at all. ADIOI\_GEN\_SeekIndividual (adio/common/ad\_seek.c) is a
|
||
good example of this. Since we don't need to actually perform I/O at seek
|
||
time, we can just update local variables at each process. In fact, one could
|
||
argue that we no longer need the ADIO SeekIndividual function at all - all the
|
||
ADIO implementations simply use this generic version (with the exception of
|
||
TESTFS, which prints the value as well).
|
||
|
||
The next category of generic functions are the "FAKE" functions (e.g.
|
||
ADIOI\_FAKE\_IODone implemented in adio/common/ad\_done\_fake.c). These functions
|
||
are all related to asynchronous I/O (AIO) operations. These implement the AIO
|
||
operations in terms of blocking operations - in other words, they follow the
|
||
standard but do not allow for overlap of I/O and computation or communication.
|
||
These are used in cases where AIO support is otherwise unavailable or
|
||
unimplemented.
|
||
|
||
The final category of generic functions are the "na<6E><61>ve" functions (e.g.
|
||
ADIOI\_GEN\_WriteStrided\_naive in adio/common/ad\_write\_str\_naive.c). These
|
||
functions avoid the use of certain optimizations, such as data sieving.
|
||
|
||
Other Things in adio/common
|
||
|
||
... what else is in there?
|
||
|
||
\subsection{Calling ADIO Functions}
|
||
|
||
Throughout the code you will see calls to functions such as ADIO\_ReadContig.
|
||
There is no such function - this is actually a macro defined in
|
||
adio/include/adioi.h that calls the particular function out of the correct
|
||
ADIOI\_Fns\_struct for the file being accessed. This is done for convenience.
|
||
|
||
Exceptions!!! ADIO\_Open, ADIO\_Close...
|
||
|
||
\section{ROMIO Implementation Details}
|
||
|
||
The ROMIO Implementation relies on some basic concepts in order to operate and
|
||
to optimize I/O access. In this section we will discuss these concepts and
|
||
how they are implemented within ROMIO. Before we do that though, we will
|
||
discuss the core data structure of ROMIO, the ADIO\_File structure.
|
||
|
||
\subsection{ADIO\_File}
|
||
|
||
... discussion ...
|
||
|
||
\subsection{I/O Aggregation and Aggregators}
|
||
|
||
When performing collective I/O operations, it is often to our advantage to
|
||
combine operations or eliminate redundant operations altogether. We call this
|
||
combining process "aggregation", and processes that perform these combined
|
||
operations aggregators.
|
||
|
||
Aggregators are defined at the time the file is opened. A collection of MPI
|
||
hints can be used to tune what processes become aggregators for a given file
|
||
(see ROMIO User's Guide). The aggregators will then interact with the file
|
||
system during collective operations.
|
||
|
||
Note that it is possible to implement a system where ALL I/O operations pass
|
||
exclusively through aggregators, including independent I/O operations from
|
||
non-aggregators. However, this would require a guarantee of progress from the
|
||
aggregators that for portability would mean adding a thread to manage I/O. We
|
||
have chosen not to pursue this path at this time, so independent operations
|
||
continue to be serviced by the process making the call.
|
||
|
||
... how implemented ...
|
||
|
||
Rank 0 in the communicator opening a file \emph{always} processes the
|
||
cb\_config\_list hint using ADIOI\_cb\_config\_list\_parse. A previous call to
|
||
ADIOI\_cb\_gather\_name\_array had collected the processor names from all hosts
|
||
into an array that is cached on the communicator (so we don't have to gather
|
||
it more than once). This creates an ordered array of ranks (relative to the
|
||
communicator used to open the file) that will be aggregators. This array is
|
||
distributed to all processes using ADIOI\_cb\_bcast\_rank\_map. Aggregators are
|
||
referenced by their rank in the communicator used to open the file. These
|
||
ranks are stored in fd->hints->ranklist[].
|
||
|
||
Note that this could be a big list for very large runs. If we were to
|
||
restrict aggregators to a rank order subset, we could use a bitfield instead.
|
||
|
||
If the user specified hints and met conditions for deferred open, then a
|
||
separate communicator is also set up (fd->agg\_comm) that contains all the
|
||
aggregators, in order of their original ranks (not their order in the rank
|
||
list). Otherwise this communicator is set to MPI\_COMM\_NULL, and in any case
|
||
it is set to this for non-aggregators. This communicator is currently only
|
||
used at ADIO\_Close (adio/common/ad\_close.c), but could be useful in two-phase
|
||
I/O as well (discussed later).
|
||
|
||
|
||
\subsection{Deferred Open}
|
||
|
||
We do not always want all processes to attempt to actually open a file when
|
||
MPI\_File\_open is called. We might want to avoid this open because in fact
|
||
some processes (non-aggregators) cannot access the file at all and would get
|
||
an error, or we might want to avoid this open to avoid a storm of system calls
|
||
hitting the file system all at once. In either case, ROMIO implements a
|
||
"deferred open" mode that allows some processes to avoid opening the file
|
||
until such time as they perform an independent I/O operation on the file (see
|
||
ROMIO User's Guide).
|
||
|
||
Deferred open has a broad impact on the ROMIO implementation, because with its
|
||
addition there are now many places where we must first check to see if we have
|
||
called the file system specific ADIO Open call before performing I/O. This
|
||
impact is limited to the MPI-IO layer by semantically guaranteeing the FS ADIO
|
||
Open call has been made by the process prior to calling a read or write
|
||
function.
|
||
|
||
... how implemented ...
|
||
|
||
\subsection{Two-Phase I/O}
|
||
|
||
Two-Phase I/O is a technique for increasing the efficiency of I/O operations
|
||
by reordering data between processes, either before writes, or after reads.
|
||
|
||
ROMIO implements two-phase I/O as part of the generic implementations of
|
||
ADIO\_WriteStridedColl and ADIO\_ReadStridedColl. These implementations in turn
|
||
rely heavily on the aggregation code to determine what processes will actually
|
||
perform I/O on behalf of the application as a whole.
|
||
|
||
|
||
|
||
\subsection{Data Sieving}
|
||
|
||
Data sieving is a single-process technique for reducing the number of I/O
|
||
operations used to service a MPI read or write operation by accessing a
|
||
contiguous region of the file that contains more than one desired region at
|
||
once. Because often I/O operations require data movement across the network,
|
||
this is usually a more efficient way to access data.
|
||
|
||
Data sieving is implemented in the common strided I/O routines
|
||
(adio/common/ad\_write\_str.c and adio/common/ad\_read\_str.c). These functions
|
||
use the contig read and write routines to perform actual I/O. In the case of
|
||
a write operation, a read/modify/write sequence is used. In that case, as
|
||
well as in the atomic mode case, locking is required on the region. Some of
|
||
the ADIO implementations do not currently support locking, and in those cases
|
||
it would be erroneous to use the generic strided I/O routines.
|
||
|
||
\subsection{Shared File Pointers}
|
||
|
||
Because no file systems supported by ROMIO currently support a shared file
|
||
pointer mode, ROMIO must implement shared file pointers under the covers on
|
||
its own.
|
||
|
||
Currently ROMIO implements shared file pointers by storing the file pointer
|
||
value in a separate file...
|
||
|
||
Note that the ROMIO team has devised a portable method for implementing shared
|
||
file pointers using only MPI functions. However, this method has
|
||
not yet been implemented in ROMIO.
|
||
|
||
file name is selected at end of mpi-io/open.c.
|
||
|
||
\subsection{Error Handling}
|
||
|
||
\subsection{MPI and MPIO Requests}
|
||
|
||
\section*{Appendix A: ADIO Functions and Semantics}
|
||
|
||
ADIOI\_Open(ADIO\_File fd, int *error\_code)
|
||
|
||
Open is used in a strange way in ROMIO, as described previously.
|
||
|
||
The Open function is used to perform whatever operations are necessary prior
|
||
to actually accessing a file using read or write. The file name for the file
|
||
is stored in fd->filename prior to Open being called.
|
||
|
||
Note that when deferred open is in effect, all processes may not immediately
|
||
call Open at MPI\_File\_open time, but instead call open if they perform
|
||
independent I/O. This can result in somewhat unusual error returns to
|
||
processes (e.g. learning that a file is not accessible at write time).
|
||
|
||
ADIOI\_ReadContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
|
||
int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
|
||
|
||
ReadContig is used to read a contiguous region from a file into a contiguous
|
||
buffer. The datatype (which refers to the buffer) can be assumed to be
|
||
contiguous. The offset is in bytes and is an absolute offset if
|
||
ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
|
||
current individual file pointer if ADIO\_INDIVIDUAL was passed as
|
||
file\_ptr\_type. Open has been called by this process prior to the call to
|
||
ReadContig. There is no guarantee that any other processes will call this
|
||
function at the same time.
|
||
|
||
ADIOI\_WriteContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
|
||
int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
|
||
|
||
WriteContig is used to write a contiguous region to a file from a contiguous
|
||
buffer. The datatype (which refers to the buffer) can be assumed to be
|
||
contiguous. The offset is in bytes and is an absolute offset if
|
||
ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
|
||
current individual file pointer if ADIO\_INDIVIDUAL was passed as
|
||
file\_ptr\_type. Open has been called by this process prior to the call to
|
||
WriteContig. There is no guarantee that any other processes will call this
|
||
function at the same time.
|
||
|
||
ADIOI\_ReadStridedColl
|
||
|
||
ADIOI\_WriteStridedColl
|
||
|
||
ADIOI\_SeekIndividual
|
||
|
||
ADIOI\_Fcntl
|
||
|
||
ADIOI\_SetInfo
|
||
|
||
ADIOI\_ReadStrided
|
||
|
||
ADIOI\_WriteStrided
|
||
|
||
ADIOI\_Close(ADIO\_File fd, int *error\_code)
|
||
|
||
Close is responsible for releasing any resources associated with an open file.
|
||
It is called on all processes that called the corresponding ADIOI Open, which
|
||
might not be all the processes that opened the file (due to deferred open).
|
||
Thus it is not safe to perform collective communication among all processes in
|
||
the communicator during Close, although collective communication between
|
||
aggregators would be safe (if desired).
|
||
|
||
For performance reasons ROMIO does not guarantee that all file data is written
|
||
to "storage" at MPI\_File\_close, instead only performing synchronization
|
||
operations at MPI\_File\_sync time. As a result, our Close implementations do
|
||
not typically call a sync. However, any locally cached data, if any, should
|
||
be passed on to the underlying storage system at this time.
|
||
|
||
Note that ADIOI\_GEN\_Close is implemented in adio/common/adi\_close.c;
|
||
ad\_close.c implements ADIO\_Close, which is called by all processes that opened
|
||
the file.
|
||
|
||
ADIOI\_IreadContig
|
||
|
||
ADIOI\_IwriteContig
|
||
|
||
ADIOI\_ReadDone
|
||
|
||
ADIOI\_WriteDone
|
||
|
||
ADIOI\_ReadComplete
|
||
|
||
ADIOI\_WriteComplete
|
||
|
||
ADIOI\_IreadStrided
|
||
|
||
ADIOI\_IwriteStrided
|
||
|
||
ADIOI\_Flush
|
||
|
||
ADIOI\_Resize(ADIO\_File fd, ADIO\_Offset size, int *error\_code)
|
||
|
||
Resize is called collectively by all processes that opened the file referenced
|
||
by fd. It is not required that the Resize implementation block until all
|
||
processes have completed resize operations, but each process should be able to
|
||
see the correct size with a corresponding MPI\_File\_get\_size operation (an
|
||
independent operation that results in an ADIO Fcntl to obtain the file size).
|
||
|
||
ADIOI\_Delete(char *filename, int *error\_code)
|
||
|
||
Delete is called independently, and because only a filename is passed, there
|
||
is no opportunity to coordinate deletion if an application were to choose to
|
||
have all processes call MPI\_File\_delete. That's not likely to be an issue
|
||
though.
|
||
|
||
\section*{Appendix B: Status of ADIO Implementations}
|
||
|
||
... who wrote what, status, etc.
|
||
|
||
Appendix C: Adding a New ADIO Implementation
|
||
|
||
References
|
||
|
||
\end{document}
|