% \documentstyle[11pt,psfig]{article} \documentstyle[11pt]{article} \hoffset=-.7in \voffset=-.6in \textwidth=6.5in \textheight=8.5in \begin{document} \vspace*{-1in} \thispagestyle{empty} \begin{center} ARGONNE NATIONAL LABORATORY \\ 9700 South Cass Avenue \\ Argonne, IL 60439 \end{center} \vskip .5 in \begin{center} \rule{1.75in}{.01in} \\ \vspace{.1in} ANL/MCS-TM-234 \\ \rule{1.75in}{.01in} \\ \vskip 1.3in {\Large\bf Users Guide for ROMIO: A High-Performance, \\ [1ex] Portable MPI-IO Implementation} \\ [4ex] by \\ [2ex] {\large\it Rajeev Thakur, Robert Ross, Ewing Lusk, William Gropp, Robert Latham} \vspace{1in} Mathematics and Computer Science Division \bigskip Technical Memorandum No.\ 234 \vspace{1.4in} Revised May 2004, November 2007, April 2010 \end{center} \vfill {\small \noindent This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38; and by the Scalable I/O Initiative, a multiagency project funded by the Defense Advanced Research Projects Agency (Contract DABT63-94-C-0049), the Department of Energy, the National Aeronautics and Space Administration, and the National Science Foundation.} \newpage %% Line Spacing (e.g., \ls{1} for single, \ls{2} for double, even \ls{1.5}) %% \newcommand{\ls}[1] {\dimen0=\fontdimen6\the\font \lineskip=#1\dimen0 \advance\lineskip.5\fontdimen5\the\font \advance\lineskip-\dimen0 \lineskiplimit=.9\lineskip \baselineskip=\lineskip \advance\baselineskip\dimen0 \normallineskip\lineskip \normallineskiplimit\lineskiplimit \normalbaselineskip\baselineskip \ignorespaces } \renewcommand{\baselinestretch}{1} \newcommand {\ix} {\hspace*{2em}} \newcommand {\mc} {\multicolumn} \tableofcontents \thispagestyle{empty} \newpage \pagenumbering{arabic} \setcounter{page}{1} \begin{center} {\bf Users Guide for ROMIO: A High-Performance,\\[1ex] Portable MPI-IO Implementation} \\ [2ex] by \\ [2ex] {\it Rajeev Thakur, Robert Ross, Ewing Lusk, and William Gropp} \end{center} \addcontentsline{toc}{section}{Abstract} \begin{abstract} \noindent ROMIO is a high-performance, portable implementation of MPI-IO (the I/O chapter in the \mbox{MPI Standard}). This document describes how to install and use ROMIO version~1.2.4 on various machines. \end{abstract} \section{Introduction} ROMIO\footnote{\tt http://www.mcs.anl.gov/romio} is a high-performance, portable implementation of MPI-IO (the I/O chapter in MPI~\cite{mpi97a}). This document describes how to install and use ROMIO version~1.2.4 on various machines. % % MAJOR CHANGES IN THIS VERSION % \section{Major Changes in This Version} \begin{itemize} \item Added section describing ROMIO \texttt{MPI\_FILE\_SYNC} and \texttt{MPI\_FILE\_CLOSE} behavior to User's Guide \item Bug removed from PVFS ADIO implementation regarding resize operations \item Added support for PVFS listio operations (see Section \ref{sec:hints}) \item Added the following working hints: \texttt{romio\_pvfs\_listio\_read}, \texttt{romio\_pvfs\_listio\_write} \end{itemize} % % GENERAL INFORMATION % \section{General Information} This version of ROMIO includes everything defined in the MPI I/O chapter except support for file interoperability and user-defined error handlers for files (\S~4.13.3). The subarray and distributed array datatype constructor functions from Chapter 4 (\S~4.14.4 \& \S~4.14.5) have been implemented. They are useful for accessing arrays stored in files. The functions {\tt MPI\_File\_f2c} and {\tt MPI\_File\_c2f} (\S~4.12.4) are also implemented. C, Fortran, and profiling interfaces are provided for all functions that have been implemented. This version of ROMIO runs on at least the following machines: IBM SP; Intel Paragon; HP Exemplar; SGI Origin2000; Cray T3E; NEC SX-4; other symmetric multiprocessors from HP, SGI, DEC, Sun, and IBM; and networks of workstations (Sun, SGI, HP, IBM, DEC, Linux, and FreeBSD). Supported file systems are IBM PIOFS, Intel PFS, HP/Convex HFS, SGI XFS, NEC SFS, PVFS, NFS, NTFS, and any Unix file system (UFS). This version of ROMIO is included in MPICH 1.2.4; an earlier version is included in at least the following MPI implementations: LAM, HP MPI, SGI MPI, and NEC MPI. Note that proper I/O error codes and classes are returned and the status variable is filled only when used with MPICH revision 1.2.1 or later. You can open files on multiple file systems in the same program. The only restriction is that the directory where the file is to be opened must be accessible from the process opening the file. For example, a process running on one workstation may not be able to access a directory on the local disk of another workstation, and therefore ROMIO will not be able to open a file in such a directory. NFS-mounted files can be accessed. An MPI-IO file created by ROMIO is no different from any other file created by the underlying file system. Therefore, you may use any of the commands provided by the file system to access the file, for example, {\tt ls}, {\tt mv}, {\tt cp}, {\tt rm}, {\tt ftp}. Please read the limitations of this version of ROMIO that are listed in Section~\ref{sec:limit} of this document (e.g., restriction to homogeneous environments). \subsection{ROMIO Optimizations} \label{sec:opt} ROMIO implements two I/O optimization techniques that in general result in improved performance for applications. The first of these is \emph{data sieving}~\cite{choudhary:passion}. Data sieving is a technique for efficiently accessing noncontiguous regions of data in files when noncontiguous accesses are not provided as a file system primitive. The naive approach to accessing noncontiguous regions is to use a separate I/O call for each contiguous region in the file. This results in a large number of I/O operations, each of which is often for a very small amount of data. The added network cost of performing an I/O operation across the network, as in parallel I/O systems, is often high because of latency. Thus, this naive approach typically performs very poorly because of the overhead of multiple operations. % In the data sieving technique, a number of noncontiguous regions are accessed by reading a block of data containing all of the regions, including the unwanted data between them (called ``holes''). The regions of interest are then extracted from this large block by the client. This technique has the advantage of a single I/O call, but additional data is read from the disk and passed across the network. There are four hints that can be used to control the application of data sieving in ROMIO: \texttt{ind\_rd\_buffer\_size}, \texttt{ind\_wr\_buffer\_size}, \texttt{romio\_ds\_read}, and \texttt{romio\_ds\_write}. These are discussed in Section~\ref{sec:hints}. The second optimization is \emph{two-phase I/O}~\cite{bordawekar:primitives}. Two-phase I/O, also called collective buffering, is an optimization that only applies to collective I/O operations. In two-phase I/O, the collection of independent I/O operations that make up the collective operation are analyzed to determine what data regions must be transferred (read or written). These regions are then split up amongst a set of aggregator processes that will actually interact with the file system. In the case of a read, these aggregators first read their regions from disk and redistribute the data to the final locations, while in the case of a write, data is first collected from the processes before being written to disk by the aggregators. There are five hints that can be used to control the application of two-phase I/O: \texttt{cb\_config\_list}, \texttt{cb\_nodes}, \texttt{cb\_buffer\_size}, \texttt{romio\_cb\_read}, and \texttt{romio\_cb\_write}. These are discussed in Subsection~\ref{sec:hints}. \subsection{Hints} \label{sec:hints} If ROMIO doesn't understand a hint, or if the value is invalid, the hint will be ignored. The values of hints being used by ROMIO for a file can be obtained at any time via {\tt MPI\_File\_get\_info}. The following hints control the data sieving optimization and are applicable to all file system types: \begin{itemize} \item \texttt{ind\_rd\_buffer\_size} -- Controls the size (in bytes) of the intermediate buffer used by ROMIO when performing data sieving during read operations. Default is \texttt{4194304} (4~Mbytes). \item \texttt{ind\_wr\_buffer\_size} -- Controls the size (in bytes) of the intermediate buffer used by ROMIO when performing data sieving during write operations. Default is \texttt{524288} (512~Kbytes). \item \texttt{romio\_ds\_read} -- Determines when ROMIO will choose to perform data sieving. Valid values are \texttt{enable}, \texttt{disable}, or \texttt{automatic}. Default value is \texttt{automatic}. In \texttt{automatic} mode ROMIO may choose to enable or disable data sieving based on heuristics. \item \texttt{romio\_ds\_write} -- Same as above, only for writes. \end{itemize} The following hints control the two-phase (collective buffering) optimization and are applicable to all file system types: \begin{itemize} \item \texttt{cb\_buffer\_size} -- Controls the size (in bytes) of the intermediate buffer used in two-phase collective I/O. If the amount of data that an aggregator will transfer is larger than this value, then multiple operations are used. The default is \texttt{4194304} (4~Mbytes). \item \texttt{cb\_nodes} -- Controls the maximum number of aggregators to be used. By default this is set to the number of unique hosts in the communicator used when opening the file. \item \texttt{romio\_cb\_read} -- Controls when collective buffering is applied to collective read operations. Valid values are \texttt{enable}, \texttt{disable}, and \texttt{automatic}. Default is \texttt{automatic}. When enabled, all collective reads will use collective buffering. When disabled, all collective reads will be serviced with individual operations by each process. When set to \texttt{automatic}, ROMIO will use heuristics to determine when to enable the optimization. \item \texttt{romio\_cb\_write} -- Controls when collective buffering is applied to collective write operations. Valid values are \texttt{enable}, \texttt{disable}, and \texttt{automatic}. Default is \texttt{automatic}. See the description of \texttt{romio\_cb\_read} for an explanation of the values. \item \texttt{romio\_no\_indep\_rw} -- This hint controls when ``deferred open'' is used. When set to \texttt{true}, ROMIO will make an effort to avoid performing any file operation on non-aggregator nodes. The application is expected to use only collective operations. This is discussed in further detail below. \item \texttt{cb\_config\_list} -- Provides explicit control over aggregators. This is discussed in further detail below. \end{itemize} For some systems configurations, more control is needed to specify which hardware resources (processors or nodes in an SMP) are preferred for collective I/O, either for performance reasons or because only certain resources have access to storage. The additional MPI\_Info key name \texttt{cb\_config\_list} specifies a comma-separated list of strings, each string specifying a particular node and an optional limit on the number of processes to be used for collective buffering on this node. This refers to the same processes that \texttt{cb\_nodes} refers to, but specifies the available nodes more precisely. The format of the value of \texttt{cb\_config\_list} is given by the following BNF: \begin{verbatim} cb_config_list => hostspec [ ',' cb_config_list ] hostspec => hostname [ ':' maxprocesses ] hostname => | '*' maxprocesses => | '*' \end{verbatim} The value \texttt{hostname} identifies a processor. This name must match the name returned by \texttt{MPI\_Get\_processor\_name}~\footnote{The MPI standard requires that the output from this routine identify a particular piece of hardware; some MPI implementations may not conform to this requirement. MPICH does conform to the MPI standard.} % for the specified hardware. The value \texttt{*} as a hostname matches all processors. The value of maxprocesses may be any nonnegative integer (zero is allowed). The value \texttt{maxprocesses} specifies the maximum number of processes that may be used for collective buffering on the specified host. If no value is specified, the value one is assumed. If \texttt{*} is specified for the number of processes, then all MPI processes with this same hostname will be used.. Leftmost components of the info value take precedence. Note: Matching of processor names to \texttt{cb\_config\_list} entries is performed with string matching functions and is independent of the listing of machines that the user provides to mpirun/mpiexec. In other words, listing the same machine multiple times in the list of hosts to run on will not cause a \texttt{*:1} to assign the same host four aggregators, because the matching code will see that the processor name is the same for all four and will assign exactly one aggregator to the processor. The value of this info key must be the same for all processes (i.e., the call is collective and each process must receive the same hint value for these collective buffering hints). Further, in the ROMIO implementation the hint is only recognized at \texttt{MPI\_File\_open} time. The set of hints used with a file is available through the routine \texttt{MPI\_File\_get\_info}, as documented in the MPI standard. As an additional feature in the ROMIO implementation, wildcards will be expanded to indicate the precise configuration used with the file, with the hostnames in the rank order used for the collective buffering algorithm (\emph{this is not implemented at this time}). Here are some examples of how this hint might be used: \begin{itemize} \item \texttt{*:1} One process per hostname (i.e., one process per node) \item \texttt{box12:30,*:0} Thirty processes on one machine, namely \texttt{box12}, and none anywhere else. \item \texttt{n01,n11,n21,n31,n41} One process on each of these specific nodes only. \end{itemize} When the values specified by \texttt{cb\_config\_list} conflict with other hints (e.g., the number of collective buffering nodes specified by \texttt{cb\_nodes}), the implementation is encouraged to take the minimum of the two values. In other words, if \texttt{cb\_config\_list} specifies ten processors on which I/O should be performed, but \texttt{cb\_nodes} specifies a smaller number, then an implementation is encouraged to use only \texttt{cb\_nodes} total aggregators. If \texttt{cb\_config\_list} specifies fewer processes than \texttt{cb\_nodes}, no more than the number in \texttt{cb\_config\_list} should be used. The implementation is also encouraged to assign processes in the order that they are listed in \texttt{cb\_config\_list}. The following hint controls the deferred open feature of romio and are also applicable to all file system types: \begin{itemize} \item \texttt{romio\_no\_indep\_rw} -- If the application plans on performing only collecitve operations and this hint is set to ``true'', then ROMIO can have just the aggregators open a file. The \texttt{cb\_config\_list} and \texttt{cb\_nodes} hints can be given to further control which nodes are aggregators. \end{itemize} For PVFS, PIOFS, and PFS: \begin{itemize} \item \texttt{striping\_factor} -- Controls the number of I/O devices to stripe across. The default is file system dependent, but for PVFS it is \texttt{-1}, indicating that the file should be striped across all I/O devices. \item \texttt{striping\_unit} -- Controls the striping unit (in bytes). For PVFS the default will be the PVFS file system default strip size. \item \texttt{start\_iodevice} -- Determines what I/O device data will first be written to. This is a number in the range of 0 ... striping\_factor - 1. \end{itemize} \subsubsection{Hints for PFS} \label{sec:hints_pfs} \begin{itemize} \item \texttt{pfs\_svr\_buf} -- Turns on PFS server buffering. Valid values are \texttt{true} and \texttt{false}. Default is \texttt{false}. \end{itemize} \subsubsection{Hints for XFS} \label{sec:hints_xfs} For XFS control is provided for the direct I/O optimization: \begin{itemize} \item \texttt{direct\_read} -- Controls direct I/O for reads. Valid values are \texttt{true} and \texttt{false}. Default is \texttt{false}. \item \texttt{direct\_write} -- Controls direct I/O for writes. Valid values are \texttt{true} and \texttt{false}. Default is \texttt{false}. \end{itemize} \subsubsection{Hints for PVFS (v1)} \label{sec:hints_oldpvfs} For PVFS control is provided for the use of the listio interface. This interface to PVFS allows for a collection of noncontiguous regions to be requested (for reading or writing) with a single operation. This can result in substantially higher performance when accessing noncontiguous regions. Support for these operations in PVFS exists after version 1.5.4, but has not been heavily tested, so use of the interface is disabled in ROMIO by default at this time. The hints to control listio use are: \begin{itemize} \item \texttt{romio\_pvfs\_listio\_read} -- Controls use of listio for reads. Valid values are \texttt{enable}, \texttt{disable}, and \texttt{automatic}. Default is \texttt{disable}. \item \texttt{romio\_pvfs\_listio\_write} -- Controls use of listio for writes. Valid values are \texttt{enable}, \texttt{disable}, and \texttt{automatic}. Default is \texttt{disable}. \end{itemize} \subsubsection{Hints for PVFS (v2)} \label{sec:hints_pvfs} The PVFS v2 file system has many tuning parameters. \begin{itemize} \item dtype i/o \end{itemize} \subsubsection{Hints for Lustre} \begin{itemize} \item romio\_lustre\_co\_ratio In stripe-contiguous IO pattern, each OST will be accessed by a group of IO clients. CO means *C*lient/*O*ST ratio, or the max. number of IO clients for each OST. CO=1 by default. \item \texttt{romio\_lustre\_coll\_threshold} We won't do collective I/O if this hint is set and the IO request size is bigger than this value. That's because when the request size is big, the collective communication overhead increases and the benefits from collective I/O becomes limited. A value of 0 means always perform collective I/O \item \texttt{romio\_lustre\_cb\_ds\_threshold} ROMIO can optimize collective I/O with a version of data sieving. If the I/O request is smaller than this hint's value, though, ROMIO will not try to apply the data sieving optimization. \item \texttt{romio\_lustre\_ds\_in\_coll} Collective IO will apply read-modify-write to deal with non-contiguous data by default. However, it will introduce some overhead(IO operation and locking). The Lustre developers have run tests where data sieving showed bad collective write performance for some kinds of workloads. So, to avoid this, we define the \texttt{romio\_lustre\_ds\_in\_coll} hint to disable the read-modify-write step in collective I/O. This optimization is distinct from the one in independent I/O (controlled by \texttt{romio\_ds\_read} and \texttt{romio\_ds\_write}). \end{itemize} \subsubsection{Hints for PANFS (Panasas)} PanFS allows users to specify the layout of a file at file-creation time. Layout information includes the number of StorageBlades (SB) across which the data is stored, the number of SBs across which a parity stripe is written, and the number of consecutive stripes that are placed on the same set of SBs. The \texttt{panfs\_layout\_*} hints are only used if supplied at file-creation time. \begin{itemize} \item \texttt{panfs\_layout\_type} Specifies the layout of a file: 2 = RAID0 3 = RAID5 Parity Stripes \item \texttt{panfs\_layout\_stripe\_unit} The size of the stripe unit in bytes \item \texttt{panfs\_layout\_total\_num\_comps} The total number of StorageBlades a file is striped across. \item \texttt{ panfs\_layout\_parity\_stripe\_width} If the layout type is RAID5 Parity Stripes, this hint specifies the number of StorageBlades in a parity stripe. \item \texttt{panfs\_layout\_parity\_stripe\_depth} If the layout type is RAID5 Parity Stripes, this hint specifies the number of contiguous parity stripes written across the same set of SBs. \item \texttt{panfs\_layout\_visit\_policy} If the layout type is RAID5 Parity Stripes, the policy used to determine the parity stripe a given file offset is written to: 1 = Round Robin \end{itemize} PanFS supports the ``concurrent write'' (CW) mode, where groups of cooperating clients can disable the PanFS consistency mechanisms and use their own consistency protocol. Clients participating in concurrent write mode use application specific information to improve performance while maintaining file consistency. All clients accessing the file(s) must enable concurrent write mode. If any client does not enable concurrent write mode, then the PanFS consistency protocol will be invoked. Once a file is opened in CW mode on a machine, attempts to open a file in non-CW mode will fail with EACCES. If a file is already opened in non-CW mode, attempts to open the file in CW mode will fail with EACCES. The following hint is used to enable concurrent write mode. \begin{itemize} \item \texttt{panfs\_concurrent\_write} If set to 1 at file open time, the file is opened using the PanFS concurrent write mode flag. Concurrent write mode is not a persistent attribute of the file. \end{itemize} Below is an example PanFS layout using the following parameters: \begin{verbatim} - panfs_layout_type = 3 - panfs_layout_total_num_comps = 100 - panfs_layout_parity_stripe_width = 10 - panfs_layout_parity_stripe_depth = 8 - panfs_layout_visit_policy = 1 Parity Stripe Group 1 Parity Stripe Group 2 . . . Parity Stripe Group 10 ---------------------- ---------------------- -------------------- SB1 SB2 ... SB10 SB11 SB12 ... SB20 ... SB91 SB92 ... SB100 ----------------------- ----------------------- --------------------- D1 D2 ... D10 D91 D92 ... D100 D181 D182 ... D190 D11 D12 D20 D101 D102 D110 D191 D192 D193 D21 D22 D30 . . . . . . D31 D32 D40 D41 D42 D50 D51 D52 D60 D61 D62 D70 D71 D72 D80 D81 D82 D90 D171 D172 D180 D261 D262 D270 D271 D272 D273 . . . . . . ... \end{verbatim} \subsubsection{Systemwide Hints} \label{sec:system_hints} A site administrator with knowledge of the storage and networking capabilities of a machine might be able to come up with a set of hint values that work better for that machine than the ROMIO default values. As an extention to the standard, ROMIO will consult a ``hints file''. This file provides an additional mechanism for setting MPI-IO hints, albeit in a ROMIO-specific manner. The hints file contains a list of hints and their values. ROMIO will use these initial hint settings, though programs are free to override any of them. The format of the hints file is a list of hints and their values, one per line. A \# character in the first column indicates a comment, and ROMIO will ignore the entire line. Here's an example: \begin{verbatim} # this is a comment describing the following setting cb_nodes 32 # these nodes happen to have the best connection to storage cb_config_list n01,n11,n21,n31,n41 \end{verbatim} ROMIO will look for these hints in the file \texttt{/etc/romio-hints}. A user can set the environment variable \texttt{ROMIO\_HINTS} to the name of a file which ROMIO will use instead. \subsection{Using ROMIO on NFS} It is worth first mentioning that in no way do we encourage the use of ROMIO on NFS volumes. NFS is not a high-performance protocol, nor are NFS servers typically very good at handling the types of concurrent access seen from MPI-IO applications. Nevertheless, NFS is a very popular mechanism for providing access to a shared space, and ROMIO does support MPI-IO to NFS volumes, provided that they are configured properly. To use ROMIO on NFS, file locking with {\tt fcntl} must work correctly on the NFS installation. On some installations, fcntl locks don't work. To get them to work, you need to use Version~3 of NFS, ensure that the lockd daemon is running on all the machines, and have the system administrator mount the NFS file system with the ``{\tt noac}'' option (no attribute caching). Turning off attribute caching may reduce performance, but it is necessary for correct behavior. The following are some instructions we received from Ian Wells of HP for setting the {\tt noac} option on NFS. We have not tried them ourselves. We are including them here because you may find them useful. Note that some of the steps may be specific to HP systems, and you may need root permission to execute some of the commands. \begin{verbatim} >1. first confirm you are running nfs version 3 > >rpcnfo -p `hostname` | grep nfs > >ie > goedel >rpcinfo -p goedel | grep nfs > 100003 2 udp 2049 nfs > 100003 3 udp 2049 nfs > > >2. then edit /etc/fstab for each nfs directory read/written by MPIO > on each machine used for multihost MPIO. > > Here is an example of a correct fstab entry for /epm1: > > ie grep epm1 /etc/fstab > > ROOOOT 11>grep epm1 /etc/fstab > gershwin:/epm1 /rmt/gershwin/epm1 nfs bg,intr,noac 0 0 > > if the noac option is not present, add it > and then remount this directory > on each of the machines that will be used to share MPIO files > >ie > >ROOOOT >umount /rmt/gershwin/epm1 >ROOOOT >mount /rmt/gershwin/epm1 > >3. Confirm that the directory is mounted noac: > >ROOOOT >grep gershwin /etc/mnttab >gershwin:/epm1 /rmt/gershwin/epm1 nfs >noac,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0 0 0 899911504 \end{verbatim} \subsubsection{ROMIO, NFS, and Synchronization} NFS has a ``sync'' option that specifies that the server should put data on the disk before replying that an operation is complete. This means that the actual I/O cost on the server side cannot be hidden with caching, etc. when this option is selected. In the ``async'' mode the server can get the data into a buffer (and perhaps put it in the write queue; this depends on the implementation) and reply right away. Obviously if the server were to go down after the reply was sent but before the data was written, the system would be in a strange state, which is why so many articles suggest the "sync" option. Some systems default to ``sync'', while others default to ``async'', and the default can change from version to version of the NFS software. If you find that access to an NFS volume through MPI-IO is particularly slow, this is one thing to check out. \subsection{Using testfs} The testfs ADIO implementation provides a harness for testing components of ROMIO or discovering the underlying I/O access patterns of an application. When testfs is specified as the file system type, no actual files will be opened. Instead debugging information will be displayed on the processes opening the file. Subsequent I/O operations on this testfs file will provide additional debugging information. The intention of the testfs implementation is that it serve as a starting point for further instrumentation when debugging new features or applications. As such it is expected that users will want to modify the ADIO implementation in order to get the specific output they desire. \subsection{ROMIO and {\tt MPI\_FILE\_SYNC}} The MPI specification notes that a call to {\tt MPI\_FILE\_SYNC} ``causes all previous writes to {\tt fh} by the calling process to be transferred to the storage device.'' Likewise, calls to {\tt MPI\_FILE\_CLOSE} have this same semantic. Further, ``if all processes have made updates to the storage device, then all such updates become visible to subsequent reads of {\tt fh} by the calling process.'' The intended use of {\tt MPI\_FILE\_SYNC} is to allow all processes in the communicator used to open the file to see changes made to the file by each other (the second part of the specification). The definition of ``storage device'' in the specification is vague, and it isn't necessarily the case that calling {\tt MPI\_FILE\_SYNC} will force data out to permanent storage. Since users often use {\tt MPI\_FILE\_SYNC} to attempt to force data out to permanent storage (i.e. disk), the ROMIO implementation of this call enforces stronger semantics for most underlying file systems by calling the appropriate file sync operation when {\tt MPI\_FILE\_SYNC} is called (e.g. {\tt fsync}). However, it is still unwise to assume that the data has all made it to disk because some file systems (e.g. NFS) may not force data to disk when a client system makes a sync call. For performance reasons we do \emph{not} make this same file system call at {\tt MPI\_FILE\_CLOSE} time. At close time ROMIO ensures any data has been written out to the ``storage device'' (file system) as defined in the standard, but does not try to push the data beyond this and into physical storage. Users should call {\tt MPI\_FILE\_SYNC} before the close if they wish to encourage the underlying file system to push data to permanent storage. \subsection{ROMIO and {\tt MPI\_FILE\_SET\_SIZE}} {\tt MPI\_FILE\_SET\_SIZE} is a collective routine used to resize a file. It is important to remember that a MPI-IO routine being collective does not imply that the routine synchronizes the calling processes in any way (unless this is specified explicitly). As of 1.2.4, ROMIO implements {\tt MPI\_FILE\_SET\_SIZE} by calling {\tt ftruncate} from all processes. Since different processes may call the function at different times, it means that unless external synchronization is used, a resize operation mixed in with writes or reads could have unexpected results. In short, if synchronization after a set size is needed, the user should add a barrier or similar operation to ensure the set size has completed. % % INSTALLATION INSTRUCTIONS % \section{Installation Instructions} Since ROMIO is included in MPICH, LAM, HP MPI, SGI MPI, and NEC MPI, you don't need to install it separately if you are using any of these MPI implementations. If you are using some other MPI, you can configure and build ROMIO as follows: Untar the tar file as \begin{verbatim} gunzip -c romio.tar.gz | tar xvf - \end{verbatim} {\noindent or} \begin{verbatim} zcat romio.tar.Z | tar xvf - \end{verbatim} {\noindent then} \begin{verbatim} cd romio ./configure make \end{verbatim} Some example programs and a Makefile are provided in the {\tt romio/test} directory. Run the examples as you would run any MPI program. Each program takes the filename as a command-line argument ``{\tt -fname filename}''. The {\tt configure} script by default configures ROMIO for the file systems most likely to be used on the given machine. If you wish, you can explicitly specify the file systems by using the ``{\tt -file\_system}'' option to configure. Multiple file systems can be specified by using `+' as a separator, e.g., \\ \hspace*{.4in} {\tt ./configure -file\_system=xfs+nfs} \\ For the entire list of options to configure, do\\ \hspace*{.4in} {\tt ./configure -h | more} \\ After building a specific version, you can install it in a particular directory with \\ \hspace*{.4in} {\tt make install PREFIX=/usr/local/romio (or whatever directory you like)} \\ or just\\ \hspace*{.4in} {\tt make install (if you used -prefix at configure time)} If you intend to leave ROMIO where you built it, you should {\it not} install it; {\tt make install} is used only to move the necessary parts of a built ROMIO to another location. The installed copy will have the include files, libraries, man pages, and a few other odds and ends, but not the whole source tree. It will have a {\tt test} directory for testing the installation and a location-independent Makefile built during installation, which users can copy and modify to compile and link against the installed copy. To rebuild ROMIO with a different set of configure options, do\\ \hspace*{.4in} {\tt make distclean}\\ to clean everything, including the Makefiles created by {\tt configure}. Then run {\tt configure} again with the new options, followed by {\tt make}. \subsection{Configuring for Linux and Large Files } 32-bit systems running linux kernel version 2.4.0 or newer and glibc version 2.2.0 or newer can support files greater than 2 GBytes in size. This support is currently automaticly detected and enabled. We document the manual steps should the automatic detection not work for some reason. The two macros {\tt\_FILE\_OFFSET\_BITS=64} and {\tt\_LARGEFILE64\_SOURCE} tell gnu libc it's ok to support large files on 32 bit platforms. The former changes the size of {\tt off\_t} (no need to change source. might affect interoperability with libraries compiled with a different size of {\tt off\_t}). The latter exposes the gnu libc functions open64(), write64(), read64(), etc. ROMIO does not make use of the 64 bit system calls directly at this time, but we add this flag for good measure. If your linux system is relatively new, there is an excellent chance it is running kernel 2.4.0 or newer and glibc-2.2.0 or newer. Add the string \begin{verbatim} "-D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" \end{verbatim} to your CFLAGS environment variable before runnint {\tt./configure} % % TESTING ROMIO % \section{Testing ROMIO} To test if the installation works, do\\ \hspace*{.4in} {\tt make testing}\\ in the {\tt romio/test} directory. This calls a script that runs the test programs and compares the results with what they should be. By default, {\tt make testing} causes the test programs to create files in the current directory and use whatever file system that corresponds to. To test with other file systems, you need to specify a filename in a directory corresponding to that file system as follows:\\ \hspace*{.4in} {\tt make testing TESTARGS="-fname=/foo/piofs/test"} % % COMPILING AND RUNNING MPI-IO PROGRAMS % \section{Compiling and Running MPI-IO Programs} If ROMIO is not already included in the MPI implementation, you need to include the file {\tt mpio.h} for C or {\tt mpiof.h} for Fortran in your MPI-IO program. Note that on HP machines running HPUX and on NEC SX-4, you need to compile Fortran programs with {\tt mpifort}, because {\tt mpif77} does not support 8-byte integers. With MPICH, HP MPI, or NEC MPI, you can compile MPI-IO programs as \\ \hspace*{.4in} {\tt mpicc foo.c}\\ or \\ \hspace*{.4in} {\tt mpif77 foo.f }\\ or\\ \hspace*{.4in} {\tt mpifort foo.f}\\ As mentioned above, mpifort is preferred over mpif77 on HPUX and NEC because the f77 compilers on those machines do not support 8-byte integers. With SGI MPI, you can compile MPI-IO programs as \\ \hspace*{.4in} {\tt cc foo.c -lmpi}\\ or \\ \hspace*{.4in} {\tt f77 foo.f -lmpi}\\ or \\ \hspace*{.4in} {\tt f90 foo.f -lmpi}\\ With LAM, you can compile MPI-IO programs as \\ \hspace*{.4in} {\tt hcc foo.c -lmpi}\\ or \\ \hspace*{.4in} {\tt hf77 foo.f -lmpi}\\ If you have built ROMIO with some other MPI implementation, you can compile MPI-IO programs by explicitly giving the path to the include file mpio.h or mpiof.h and explicitly specifying the path to the library libmpio.a, which is located in {\tt \$(ROMIO\_HOME)/lib/\$(ARCH)/libmpio.a}. Run the program as you would run any MPI program on the machine. If you use {\tt mpirun}, make sure you use the correct {\tt mpirun} for the MPI implementation you are using. For example, if you are using MPICH on an SGI machine, make sure that you use MPICH's {\tt mpirun} and not SGI's {\tt mpirun}. % % LIMITATIONS % \section{Limitations of This Version of ROMIO \label{sec:limit}} \begin{itemize} \item When used with any MPI implementation other than MPICH revision 1.2.1 or later, the {\tt status} argument is not filled in any MPI-IO function. Consequently, {\tt MPI\_Get\_count} and\linebreak {\tt MPI\_Get\_elements} will not work when passed the {\tt status} object from an MPI-IO operation. \item Additionally, when used with any MPI implementation other than MPICH revision 1.2.1 or later, all MPI-IO functions return only two possible error codes---{\tt MPI\_SUCCESS} on success and {\tt MPI\_ERR\_UNKNOWN} on failure. \item This version works only on a homogeneous cluster of machines, and only the ``native'' file data representation is supported. \item Shared file pointers are not supported on PVFS and IBM PIOFS file systems because they don't support {\tt fcntl} file locks, and ROMIO uses that feature to implement shared file pointers. \item On HP machines running HPUX and on NEC SX-4, you need to compile Fortran programs with {\tt mpifort} instead of {\tt mpif77}, because the {\tt f77} compilers on these machines don't support 8-byte integers. \item The file-open mode {\tt MPI\_MODE\_EXCL} does not work on Intel PFS file system, due to a bug in PFS. \end{itemize} % % USAGE TIPS % \section{Usage Tips} \begin{itemize} \item When using ROMIO with SGI MPI, you may sometimes get an error message from SGI MPI: ``MPI has run out of internal datatype entries. Please set the environment variable {\tt MPI\_TYPE\_MAX} for additional space.'' If you get this error message, add the following line to your {\tt .cshrc} file:\\ \hspace*{.4in} {\tt setenv MPI\_TYPE\_MAX 65536}\\ Use a larger number if you still get the error message. \item If a Fortran program uses a file handle created using ROMIO's C interface, or vice versa, you must use the functions {\tt MPI\_File\_c2f} or {\tt MPI\_File\_f2c} (see \S~4.12.4 in~\cite{mpi97a}). Such a situation occurs, for example, if a Fortran program uses an I/O library written in C with MPI-IO calls. Similar functions {\tt MPIO\_Request\_f2c} and {\tt MPIO\_Request\_c2f} are also provided. \item For Fortran programs on the Intel Paragon, you may need to provide the complete path to {\tt mpif.h} in the {\tt include} statement, e.g., \\ \hspace*{.4in} {\tt include '/usr/local/mpich/include/mpif.h'}\\ instead of \\ \hspace*{.4in} {\tt include 'mpif.h'}\\ This is because the {\tt -I} option to the Paragon Fortran compiler {\tt if77} doesn't work correctly. It always looks in the default directories first and, therefore, picks up Intel's {\tt mpif.h}, which is actually the {\tt mpif.h} of an older version of MPICH. \end{itemize} % % MAILING LIST % % this mailing list has been dead for a while % % REPORTING BUGS % \section{Reporting Bugs} If you have trouble, first check the users guide. Then check if there is a list of known bugs and patches on the ROMIO web page at {\tt http://www.mcs.anl.gov/romio}. Finally, if you still have problems, send a detailed message containing:\\ \hspace*{.2in}$\bullet$ the type of system (often {\tt uname -a}),\\ \hspace*{.2in}$\bullet$ the output of {\tt configure},\\ \hspace*{.2in}$\bullet$ the output of {\tt make}, and \\ \hspace*{.2in}$\bullet$ any programs or tests\\ to {\tt romio-maint@mcs.anl.gov}. % % ROMIO INTERNALS % \section{ROMIO Internals} A key component of ROMIO that enables such a portable MPI-IO implementation is an internal abstract I/O device layer called ADIO~\cite{thak96e}. Most users of ROMIO will not need to deal with the ADIO layer at all. However, ADIO is useful to those who want to port ROMIO to some other file system. The ROMIO source code and the ADIO paper~\cite{thak96e} will help you get started. MPI-IO implementation issues are discussed in~\cite{thak99b}. All ROMIO-related papers are available online at {\tt http://www.mcs.anl.gov/romio}. \section{Learning MPI-IO} The book {\em Using MPI-2: Advanced Features of the Message-Passing Interface}~\cite{grop99a}, published by MIT Press, provides a tutorial introduction to all aspects of MPI-2, including parallel I/O. It has lots of example programs. See {\tt http://www.mcs.anl.gov/mpi/usingmpi2} for further information about the book. % % MAJOR CHANGES IN PREVIOUS RELEASES % \section{Major Changes in Previous Releases} \subsection{Major Changes in Version 1.2.3} \begin{itemize} \item Added explicit control over aggregators for collective operations (see description of \texttt{cb\_config\_list}). \item Added the following working hints: \texttt{cb\_config\_list}, \texttt{romio\_cb\_read}, \texttt{romio\_cb\_write},\newline \texttt{romio\_ds\_read}. These additional hints have been added but are currently ignored by the implementation: \texttt{romio\_ds\_write}, \texttt{romio\_no\_indep\_rw}. \item Added NTFS ADIO implementation. \item Added testfs ADIO implementation for use in debugging. \item Added delete function to ADIO interface so that file systems that need to use their own delete function may do so (e.g. PVFS). \item Changed version numbering to match version number of MPICH release. \end{itemize} \subsection{Major Changes in Version 1.0.3} \begin{itemize} \item When used with MPICH 1.2.1, the MPI-IO functions return proper error codes and classes, and the status object is filled in. \item On SGI's XFS file system, ROMIO can use direct I/O even if the user's request does not meet the various restrictions needed to use direct I/O. ROMIO does this by doing part of the request with buffered I/O (until all the restrictions are met) and doing the rest with direct I/O. (This feature hasn't been tested rigorously. Please check for errors.) By default, ROMIO will use only buffered I/O. Direct I/O can be enabled either by setting the environment variables {\tt MPIO\_DIRECT\_READ} and/or {\tt MPIO\_DIRECT\_WRITE} to {\tt TRUE}, or on a per-file basis by using the info keys {\tt direct\_read} and {\tt direct\_write}. Direct I/O will result in higher performance only if you are accessing a high-bandwidth disk system. Otherwise, buffered I/O is better and is therefore used as the default. \item Miscellaneous bug fixes. \end{itemize} \subsection{Major Changes in Version 1.0.2} \begin{itemize} \item Implemented the shared file pointer functions and split collective I/O functions. Therefore, the main components of the MPI I/O chapter not yet implemented are file interoperability and error handling. \item Added support for using ``direct I/O'' on SGI's XFS file system. Direct I/O is an optional feature of XFS in which data is moved directly between the user's buffer and the storage devices, bypassing the file-system cache. This can improve performance significantly on systems with high disk bandwidth. Without high disk bandwidth, regular I/O (that uses the file-system cache) perfoms better. ROMIO, therefore, does not use direct I/O by default. The user can turn on direct I/O (separately for reading and writing) either by using environment variables or by using MPI's hints mechanism (info). To use the environment-variables method, do \begin{verbatim} setenv MPIO_DIRECT_READ TRUE setenv MPIO_DIRECT_WRITE TRUE \end{verbatim} To use the hints method, the two keys are {\tt direct\_read} and {\tt direct\_write}. By default their values are {\tt false}. To turn on direct I/O, set the values to {\tt true}. The environment variables have priority over the info keys. In other words, if the environment variables are set to {\tt TRUE}, direct I/O will be used even if the info keys say {\tt false}, and vice versa. Note that direct I/O must be turned on separately for reading and writing. The environment-variables method assumes that the environment variables can be read by each process in the MPI job. This is not guaranteed by the MPI Standard, but it works with SGI's MPI and the {\tt ch\_shmem} device of MPICH. \item Added support (new ADIO device, {\tt ad\_pvfs}) for the PVFS parallel file system for Linux clusters, developed at Clemson University (see {\tt http://www.parl.clemson.edu/pvfs}). To use it, you must first install PVFS and then when configuring ROMIO, specify {\tt -file\_system=pvfs} in addition to any other options to {\tt configure}. (As usual, you can configure for multiple file systems by using ``{\tt +}''; for example, {\tt -file\_system=pvfs+ufs+nfs}.) You will need to specify the path to the PVFS include files via the {\tt -cflags} option to {\tt configure}, for example, \newline {\tt configure -cflags=-I/usr/pvfs/include}. You will also need to specify the full path name of the PVFS library. The best way to do this is via the {\tt -lib} option to MPICH's {\tt configure} script (assuming you are using ROMIO from within MPICH). \item Uses weak symbols (where available) for building the profiling version, i.e., the PMPI routines. As a result, the size of the library is reduced considerably. \item The Makefiles use {\em virtual paths} if supported by the make utility. GNU {\tt make} supports it, for example. This feature allows you to untar the distribution in some directory, say a slow NFS directory, and compile the library (create the .o files) in another directory, say on a faster local disk. For example, if the tar file has been untarred in an NFS directory called {\tt /home/thakur/romio}, one can compile it in a different directory, say {\tt /tmp/thakur}, as follows: \begin{verbatim} cd /tmp/thakur /home/thakur/romio/configure make \end{verbatim} The .o files will be created in {\tt /tmp/thakur}; the library will be created in\newline {\tt /home/thakur/romio/lib/\$ARCH/libmpio.a}. This method works only if the {\tt make} utility supports {\em virtual paths}. If the default {\tt make} utility does not, you can install GNU {\tt make} which does, and specify it to {\tt configure} as \begin{verbatim} /home/thakur/romio/configure -make=/usr/gnu/bin/gmake (or whatever) \end{verbatim} \item Lots of miscellaneous bug fixes and other enhancements. \item This version is included in MPICH 1.2.0. If you are using MPICH, you need not download ROMIO separately; it gets built as part of MPICH. The previous version of ROMIO is included in LAM, HP MPI, SGI MPI, and NEC MPI. NEC has also implemented the MPI-IO functions missing in ROMIO, and therefore NEC MPI has a complete implementation of MPI-IO. \end{itemize} \subsection{Major Changes in Version 1.0.1} \begin{itemize} \item This version is included in MPICH 1.1.1 and HP MPI 1.4. \item Added support for NEC SX-4 and created a new device {\tt ad\_sfs} for NEC SFS file system. \item New devices {\tt ad\_hfs} for HP HFS file system and {\tt ad\_xfs} for SGI XFS file system. \item Users no longer need to prefix the filename with the type of file system; ROMIO determines the file-system type on its own. \item Added support for 64-bit file sizes on IBM PIOFS, SGI XFS, HP HFS, and NEC SFS file systems. \item {\tt MPI\_Offset} is an 8-byte integer on machines that support 8-byte integers. It is of type {\tt long long} in C and {\tt integer*8} in Fortran. With a Fortran 90 compiler, you can use either {\tt integer*8} or {\tt integer(kind=MPI\_OFFSET\_KIND)}. If you {\tt printf} an {\tt MPI\_Offset} in C, remember to use {\tt \%lld} or {\tt \%ld} as required by your compiler. (See what is used in the test program {\tt romio/test/misc.c}). On some machines, ROMIO detects at configure time that {\tt long long} is either not supported by the C compiler or it doesn't work properly. In such cases, configure sets {\tt MPI\_Offset} to {\tt long} in C and {\tt integer} in Fortran. This happens on Intel Paragon, Sun4, and FreeBSD. \item Added support for passing hints to the implementation via the {\tt MPI\_Info} parameter. ROMIO understands the following hints (keys in {\tt MPI\_Info} object): \texttt{cb\_buffer\_size}, \texttt{cb\_nodes},\newline \texttt{ind\_rd\_buffer\_size}, \texttt{ind\_wr\_buffer\_size} (on all but IBM PIOFS), \texttt{striping\_factor} (on PFS and PIOFS), \texttt{striping\_unit} (on PFS and PIOFS), \texttt{start\_iodevice} (on PFS and PIOFS), and \texttt{pfs\_svr\_buf} (on PFS only). \end{itemize} \newpage \addcontentsline{toc}{section}{References} \bibliographystyle{plain} %% these are the "full" bibliography databases %\bibliography{/homes/thakur/tex/bib/papers,/homes/robl/projects/papers/pario} % this is the pared-down one containing only those references used in % users-guide.tex % to regenerate, uncomment the full databases above, then run % ~gropp/bin/citetags users-guide.tex | sort | uniq | \ % ~gropp/bin/citefind - /homes/thakur/tex/bib/papers.bib \ % /homes/robl/projects/papers/pario \bibliography{romio} \end{document}