1
1
openmpi/doc/user/troubleshooting.tex
Jeff Squyres 42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00

250 строки
10 KiB
TeX

% -*- latex -*-
%
% Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
% University Research and Technology
% Corporation. All rights reserved.
% Copyright (c) 2004-2005 The University of Tennessee and The University
% of Tennessee Research Foundation. All rights
% reserved.
% Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
% University of Stuttgart. All rights reserved.
% Copyright (c) 2004-2005 The Regents of the University of California.
% All rights reserved.
% $COPYRIGHT$
%
% Additional copyrights may follow
%
% $HEADER$
%
\chapter{Troubleshooting}
\label{sec:troubleshooting}
{\Huge JMS Not bad, but needs tweaking}
Although Open MPI is a robust run-time environment, and its MPI layer
is a mature software system, errors do occur. Particularly when using
Open MPI for the first time, some of the initial, per-user setup can
be confusing (e.g., setting up \ifile{.rhosts} or SSH keys for
password-less remote logins). This section aims to identify a few
common problems and solutions.
Much more information can be found on the Open MPI FAQ on the main
Open MPI web site.\footnote{\url{http://www.open-mpi.org/faq/}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Open MPI Mailing Lists}
\label{troubleshooting:mailing-lists}
\index{e-mail lists}
\index{mailing lists}
\index{listserv mailing lists}
There are two mailing lists: one for Open MPI announcements, and
another for questions and user discussion of Open MPI.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Announcements}
This is a low-volume list that is used to announce new version of
Open MPI, important patches, etc. To subscribe to the Open MPI announcement
list, visit its list information page (you can also use that page to
unsubscribe or change your subscription options):
\vspace{11pt}
\centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam-announce}}
\vspace{11pt}
\noindent {\bf NOTE: Users cannot post to this list; all such posts
are automatically rejected -- only the Open MPI Team can post to this
list.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{General Discussion / User Questions}
{\bf BEFORE YOU POST TO THIS LIST:} {\em Please} check all the other
resources listed in this chapter first. Search the mailing list to
see if anyone else had a similar problem before you did. Re-read the
error message that Open MPI displayed to you (Open MPI can sometimes give {\em
incredibly} detailed error messages that tell you {\em exactly} how
to fix the problem). This, unfortunately, does not stop some users
from cut-n-pasting the entire error message, verbatim (including the
solution to their problem) into a mail message, sending it to the
list, and asking ``How do I fix this problem?'' So please: think (and
read) before you post.\footnote{Our deep appologies if some of the
information in this section appears to be repetitive and
condescending. Believe us when we say that we have tried all other
approaches -- some users simply either do not read the information
provided, or only read the e-mail address to send ``help!'' e-mails
to. It is our hope that big, bold print will catch some people's
eyes and enable them to help themselves rather than having to wait
for their post to distribute around the world and then further wait
for someone to reply telling them that the solution to their problem
was already printed on their screen. Thanks for your time in
reading all of this!}
\vspace{11pt}
This list is used for general questions and discussion of Open MPI.
User can post questions, comments, etc. to this list. {\bf Due to
recent increases in spam, only subscribers are allowed to post to
the list}. If you are not subscribed to the list, your posts will
be discarded.
To subscribe or unsubscribe from the list, visit the list information
page:
\vspace{11pt}
\centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam}}
\vspace{11pt}
After you have subscribed (and received a confirmation e-mail), you
can send mail to the list at the following address:
\vspace{11pt}
\centerline{{\bf You must be subscribed in order to post to the list}}
\centerline{\url{lam@lam-mpi.org}}
\centerline{{\bf You must be subscribed in order to post to the list}}
\vspace{11pt}
Be sure to include the following information in your e-mail:
\begin{itemize}
\item The \file{config.log} file from the top-level Open MPI directory, if
available ({\bf please compress!}).
\item The output of ``\icmd{laminfo}\ \ \cmdarg{-all}''.
\item A {\em detailed} description of what is failing. The more
details that you provide, the better. E-mails saying ``My
application doesn't work!'' will inevitably be answered with
requests for more information about {\em exactly what doesn't work};
so please include as much detailed information in your initial
e-mail as possible.
\end{itemize}
{\bf NOTE:} People tend to only reply to the list; if you subscribe,
post, and then unsubscribe from the list, you will likely miss
replies.
Also please be aware that the list goes to several hundred people
around the world -- it is not uncommon to move a high-volume exchange
off the list, and only post the final resolution of the problem/bug
fix to the list. This prevents exchanges like ``Did you try X?'',
``Yes, I tried X, and it did not work.'', ``Did you try Y?'', etc.
from cluttering up peoples' inboxes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Open MPI Run-Time Environment Problems}
Some common problems with the Open MPI run-time environment are listed
below.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Problems with the \icmd{lamboot} Command}
Many first-time Open MPI users do not have their environment properly
configured for Open MPI to boot properly. Refer to
Section~\ref{sec:getting-started-lamboot} for the list of conditions
that Open MPI requires to boot properly. User problems with \cmd{lamboot}
typically fall into one of the following categories:
\begin{itemize}
\item \cmd{rsh}/\cmd{ssh} is not set up properly for password-less
logins to remote nodes.
{\bf Solution:} Set up \cmd{rsh}/\cmd{ssh} properly for
password-less remote logins. Consult local documentation or
internet tutorials for how to set up \file{\$HOME/.rhosts} and SSH
keys. Note that the Open MPI Team {\bf STRONGLY} discourages the use of
\cmdarg{+} in \file{.rhosts} or \file{host.equiv} files!
\item \cmd{rsh}/\cmd{ssh} prints something on \file{stderr}.
{\bf Solution:} Clean up system or user ``dot'' files so that
nothing is printed on \file{stderr} during a remote login.
\item A Open MPI daemon is unable to open a connection back to
\cmd{lamboot}.
{\bf Solution:} Many Linux distributions ship with firewalls
enabled. Open MPI uses random TCP ports to communicate, and
therefore firewall support must be either disabled or opened between
machines that will be using Open MPI.
\item Open MPI is unable to open a session directory.
{\bf Solution:} Open MPI needs to use a per-user, per-session temporary
directory, typically located under \file{/tmp} (see
Section~\ref{sec:misc-session-directory},
page~\pageref{sec:misc-session-directory}). Open MPI must be able to
read/write in this session directory; check permissions in this
tree.
\item Open MPI is unable to find the current host in the boot schema.
{\bf Solution:} Open MPI can only boot a universe that includes the
current node. If the current node is not listed in the hostfile, or
is not listed by a name that can be resolved and identified as the
current node, \cmd{lamboot} (and friends) will abort.
\item Open MPI is unable to resolve all names in the boot schema.
{\bf Solution:} All names in the boot schema must be resolvable by
the boot SSI module that is being used. This typically means that
there end up being IP hostnames that must be resolved to IP
addresses. Resolution can occur by any valid OS mechanism (e.g.,
through DNS, local file lookup, etc.). Note that the name ``{\tt
localhost}'' (or any address that resolves to 127.0.0.1) cannot be
used in a boot schema that includes more than one host -- otherwise
the other nodes in the resulting Open MPI universe will not be able to
contact that host.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{MPI Problems}
For the most part, Open MPI implements the MPI standard similarly to other
MPI implementations. Hence, most MPI programmers are not too
surprised by how Open MPI handles various errors, etc. However, there are
some cases that Open MPI handles in its own unique fashion. In these cases
Open MPI tries to display a helpful message discussing what happened.
Here's some more background on a few of the messages:
\begin{itemize}
\item ``One of the processes started by mpirun has exited with a
nonzero exit code.''
This means that at least one MPI process has exited after invoking
\mpifunc{MPI\_\-INIT}, but before invoking
\mpifunc{MPI\_\-FINALIZE}. This is therefore an error, and Open MPI
will abort the entire MPI application. The last line of the error
message indicates the PID, node, and exit status of the failed
process.
\item ``MPI\_{\tt <function>}: process in local group is dead (rank
{\tt <N>}, MPI\_\-COMM\_\-WORLD)''
This means that some MPI function tried to communicate with a peer
MPI process and discovered that the peer process is dead. Common
causes of this problem include attempting to communicate with
processes that have failed (which, in some cases, won't generate the
``One of the processes started by mpirun has exited...'' messages),
or have already invoked \mpifunc{MPI\_\-FINALIZE}. Communication
should not be initiated that could involve processes that have
already invoked \mpifunc{MPI\_\-FINALIZE}. This may include using
\mpiconst{MPI\_\-ANY\_\-SOURCE} or collectives on communicators that
include processes that have already finalized.
\end{itemize}