42ec26e640
This commit was SVN r7999.
250 строки
10 KiB
TeX
250 строки
10 KiB
TeX
% -*- latex -*-
|
|
%
|
|
% Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
% University Research and Technology
|
|
% Corporation. All rights reserved.
|
|
% Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
% of Tennessee Research Foundation. All rights
|
|
% reserved.
|
|
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
% University of Stuttgart. All rights reserved.
|
|
% Copyright (c) 2004-2005 The Regents of the University of California.
|
|
% All rights reserved.
|
|
% $COPYRIGHT$
|
|
%
|
|
% Additional copyrights may follow
|
|
%
|
|
% $HEADER$
|
|
%
|
|
|
|
\chapter{Troubleshooting}
|
|
\label{sec:troubleshooting}
|
|
|
|
{\Huge JMS Not bad, but needs tweaking}
|
|
|
|
Although Open MPI is a robust run-time environment, and its MPI layer
|
|
is a mature software system, errors do occur. Particularly when using
|
|
Open MPI for the first time, some of the initial, per-user setup can
|
|
be confusing (e.g., setting up \ifile{.rhosts} or SSH keys for
|
|
password-less remote logins). This section aims to identify a few
|
|
common problems and solutions.
|
|
|
|
Much more information can be found on the Open MPI FAQ on the main
|
|
Open MPI web site.\footnote{\url{http://www.open-mpi.org/faq/}}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{The Open MPI Mailing Lists}
|
|
\label{troubleshooting:mailing-lists}
|
|
\index{e-mail lists}
|
|
\index{mailing lists}
|
|
\index{listserv mailing lists}
|
|
|
|
There are two mailing lists: one for Open MPI announcements, and
|
|
another for questions and user discussion of Open MPI.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Announcements}
|
|
|
|
This is a low-volume list that is used to announce new version of
|
|
Open MPI, important patches, etc. To subscribe to the Open MPI announcement
|
|
list, visit its list information page (you can also use that page to
|
|
unsubscribe or change your subscription options):
|
|
|
|
\vspace{11pt}
|
|
|
|
\centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam-announce}}
|
|
|
|
\vspace{11pt}
|
|
|
|
\noindent {\bf NOTE: Users cannot post to this list; all such posts
|
|
are automatically rejected -- only the Open MPI Team can post to this
|
|
list.}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{General Discussion / User Questions}
|
|
|
|
{\bf BEFORE YOU POST TO THIS LIST:} {\em Please} check all the other
|
|
resources listed in this chapter first. Search the mailing list to
|
|
see if anyone else had a similar problem before you did. Re-read the
|
|
error message that Open MPI displayed to you (Open MPI can sometimes give {\em
|
|
incredibly} detailed error messages that tell you {\em exactly} how
|
|
to fix the problem). This, unfortunately, does not stop some users
|
|
from cut-n-pasting the entire error message, verbatim (including the
|
|
solution to their problem) into a mail message, sending it to the
|
|
list, and asking ``How do I fix this problem?'' So please: think (and
|
|
read) before you post.\footnote{Our deep appologies if some of the
|
|
information in this section appears to be repetitive and
|
|
condescending. Believe us when we say that we have tried all other
|
|
approaches -- some users simply either do not read the information
|
|
provided, or only read the e-mail address to send ``help!'' e-mails
|
|
to. It is our hope that big, bold print will catch some people's
|
|
eyes and enable them to help themselves rather than having to wait
|
|
for their post to distribute around the world and then further wait
|
|
for someone to reply telling them that the solution to their problem
|
|
was already printed on their screen. Thanks for your time in
|
|
reading all of this!}
|
|
|
|
\vspace{11pt}
|
|
|
|
This list is used for general questions and discussion of Open MPI.
|
|
User can post questions, comments, etc. to this list. {\bf Due to
|
|
recent increases in spam, only subscribers are allowed to post to
|
|
the list}. If you are not subscribed to the list, your posts will
|
|
be discarded.
|
|
|
|
To subscribe or unsubscribe from the list, visit the list information
|
|
page:
|
|
|
|
\vspace{11pt}
|
|
\centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam}}
|
|
\vspace{11pt}
|
|
|
|
After you have subscribed (and received a confirmation e-mail), you
|
|
can send mail to the list at the following address:
|
|
|
|
\vspace{11pt}
|
|
\centerline{{\bf You must be subscribed in order to post to the list}}
|
|
\centerline{\url{lam@lam-mpi.org}}
|
|
\centerline{{\bf You must be subscribed in order to post to the list}}
|
|
\vspace{11pt}
|
|
|
|
Be sure to include the following information in your e-mail:
|
|
|
|
\begin{itemize}
|
|
\item The \file{config.log} file from the top-level Open MPI directory, if
|
|
available ({\bf please compress!}).
|
|
|
|
\item The output of ``\icmd{laminfo}\ \ \cmdarg{-all}''.
|
|
|
|
\item A {\em detailed} description of what is failing. The more
|
|
details that you provide, the better. E-mails saying ``My
|
|
application doesn't work!'' will inevitably be answered with
|
|
requests for more information about {\em exactly what doesn't work};
|
|
so please include as much detailed information in your initial
|
|
e-mail as possible.
|
|
\end{itemize}
|
|
|
|
{\bf NOTE:} People tend to only reply to the list; if you subscribe,
|
|
post, and then unsubscribe from the list, you will likely miss
|
|
replies.
|
|
|
|
Also please be aware that the list goes to several hundred people
|
|
around the world -- it is not uncommon to move a high-volume exchange
|
|
off the list, and only post the final resolution of the problem/bug
|
|
fix to the list. This prevents exchanges like ``Did you try X?'',
|
|
``Yes, I tried X, and it did not work.'', ``Did you try Y?'', etc.
|
|
from cluttering up peoples' inboxes.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{Open MPI Run-Time Environment Problems}
|
|
|
|
Some common problems with the Open MPI run-time environment are listed
|
|
below.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Problems with the \icmd{lamboot} Command}
|
|
|
|
Many first-time Open MPI users do not have their environment properly
|
|
configured for Open MPI to boot properly. Refer to
|
|
Section~\ref{sec:getting-started-lamboot} for the list of conditions
|
|
that Open MPI requires to boot properly. User problems with \cmd{lamboot}
|
|
typically fall into one of the following categories:
|
|
|
|
\begin{itemize}
|
|
\item \cmd{rsh}/\cmd{ssh} is not set up properly for password-less
|
|
logins to remote nodes.
|
|
|
|
{\bf Solution:} Set up \cmd{rsh}/\cmd{ssh} properly for
|
|
password-less remote logins. Consult local documentation or
|
|
internet tutorials for how to set up \file{\$HOME/.rhosts} and SSH
|
|
keys. Note that the Open MPI Team {\bf STRONGLY} discourages the use of
|
|
\cmdarg{+} in \file{.rhosts} or \file{host.equiv} files!
|
|
|
|
\item \cmd{rsh}/\cmd{ssh} prints something on \file{stderr}.
|
|
|
|
{\bf Solution:} Clean up system or user ``dot'' files so that
|
|
nothing is printed on \file{stderr} during a remote login.
|
|
|
|
\item A Open MPI daemon is unable to open a connection back to
|
|
\cmd{lamboot}.
|
|
|
|
{\bf Solution:} Many Linux distributions ship with firewalls
|
|
enabled. Open MPI uses random TCP ports to communicate, and
|
|
therefore firewall support must be either disabled or opened between
|
|
machines that will be using Open MPI.
|
|
|
|
\item Open MPI is unable to open a session directory.
|
|
|
|
{\bf Solution:} Open MPI needs to use a per-user, per-session temporary
|
|
directory, typically located under \file{/tmp} (see
|
|
Section~\ref{sec:misc-session-directory},
|
|
page~\pageref{sec:misc-session-directory}). Open MPI must be able to
|
|
read/write in this session directory; check permissions in this
|
|
tree.
|
|
|
|
\item Open MPI is unable to find the current host in the boot schema.
|
|
|
|
{\bf Solution:} Open MPI can only boot a universe that includes the
|
|
current node. If the current node is not listed in the hostfile, or
|
|
is not listed by a name that can be resolved and identified as the
|
|
current node, \cmd{lamboot} (and friends) will abort.
|
|
|
|
\item Open MPI is unable to resolve all names in the boot schema.
|
|
|
|
{\bf Solution:} All names in the boot schema must be resolvable by
|
|
the boot SSI module that is being used. This typically means that
|
|
there end up being IP hostnames that must be resolved to IP
|
|
addresses. Resolution can occur by any valid OS mechanism (e.g.,
|
|
through DNS, local file lookup, etc.). Note that the name ``{\tt
|
|
localhost}'' (or any address that resolves to 127.0.0.1) cannot be
|
|
used in a boot schema that includes more than one host -- otherwise
|
|
the other nodes in the resulting Open MPI universe will not be able to
|
|
contact that host.
|
|
\end{itemize}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{MPI Problems}
|
|
|
|
For the most part, Open MPI implements the MPI standard similarly to other
|
|
MPI implementations. Hence, most MPI programmers are not too
|
|
surprised by how Open MPI handles various errors, etc. However, there are
|
|
some cases that Open MPI handles in its own unique fashion. In these cases
|
|
Open MPI tries to display a helpful message discussing what happened.
|
|
|
|
Here's some more background on a few of the messages:
|
|
|
|
\begin{itemize}
|
|
\item ``One of the processes started by mpirun has exited with a
|
|
nonzero exit code.''
|
|
|
|
This means that at least one MPI process has exited after invoking
|
|
\mpifunc{MPI\_\-INIT}, but before invoking
|
|
\mpifunc{MPI\_\-FINALIZE}. This is therefore an error, and Open MPI
|
|
will abort the entire MPI application. The last line of the error
|
|
message indicates the PID, node, and exit status of the failed
|
|
process.
|
|
|
|
\item ``MPI\_{\tt <function>}: process in local group is dead (rank
|
|
{\tt <N>}, MPI\_\-COMM\_\-WORLD)''
|
|
|
|
This means that some MPI function tried to communicate with a peer
|
|
MPI process and discovered that the peer process is dead. Common
|
|
causes of this problem include attempting to communicate with
|
|
processes that have failed (which, in some cases, won't generate the
|
|
``One of the processes started by mpirun has exited...'' messages),
|
|
or have already invoked \mpifunc{MPI\_\-FINALIZE}. Communication
|
|
should not be initiated that could involve processes that have
|
|
already invoked \mpifunc{MPI\_\-FINALIZE}. This may include using
|
|
\mpiconst{MPI\_\-ANY\_\-SOURCE} or collectives on communicators that
|
|
include processes that have already finalized.
|
|
\end{itemize}
|