% -*- latex -*- % % Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana % University Research and Technology % Corporation. All rights reserved. % Copyright (c) 2004-2005 The University of Tennessee and The University % of Tennessee Research Foundation. All rights % reserved. % Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, % University of Stuttgart. All rights reserved. % Copyright (c) 2004-2005 The Regents of the University of California. % All rights reserved. % $COPYRIGHT$ % % Additional copyrights may follow % % $HEADER$ % \chapter{Troubleshooting} \label{sec:troubleshooting} {\Huge JMS Not bad, but needs tweaking} Although Open MPI is a robust run-time environment, and its MPI layer is a mature software system, errors do occur. Particularly when using Open MPI for the first time, some of the initial, per-user setup can be confusing (e.g., setting up \ifile{.rhosts} or SSH keys for password-less remote logins). This section aims to identify a few common problems and solutions. Much more information can be found on the Open MPI FAQ on the main Open MPI web site.\footnote{\url{http://www.open-mpi.org/faq/}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The Open MPI Mailing Lists} \label{troubleshooting:mailing-lists} \index{e-mail lists} \index{mailing lists} \index{listserv mailing lists} There are two mailing lists: one for Open MPI announcements, and another for questions and user discussion of Open MPI. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Announcements} This is a low-volume list that is used to announce new version of Open MPI, important patches, etc. To subscribe to the Open MPI announcement list, visit its list information page (you can also use that page to unsubscribe or change your subscription options): \vspace{11pt} \centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam-announce}} \vspace{11pt} \noindent {\bf NOTE: Users cannot post to this list; all such posts are automatically rejected -- only the Open MPI Team can post to this list.} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{General Discussion / User Questions} {\bf BEFORE YOU POST TO THIS LIST:} {\em Please} check all the other resources listed in this chapter first. Search the mailing list to see if anyone else had a similar problem before you did. Re-read the error message that Open MPI displayed to you (Open MPI can sometimes give {\em incredibly} detailed error messages that tell you {\em exactly} how to fix the problem). This, unfortunately, does not stop some users from cut-n-pasting the entire error message, verbatim (including the solution to their problem) into a mail message, sending it to the list, and asking ``How do I fix this problem?'' So please: think (and read) before you post.\footnote{Our deep appologies if some of the information in this section appears to be repetitive and condescending. Believe us when we say that we have tried all other approaches -- some users simply either do not read the information provided, or only read the e-mail address to send ``help!'' e-mails to. It is our hope that big, bold print will catch some people's eyes and enable them to help themselves rather than having to wait for their post to distribute around the world and then further wait for someone to reply telling them that the solution to their problem was already printed on their screen. Thanks for your time in reading all of this!} \vspace{11pt} This list is used for general questions and discussion of Open MPI. User can post questions, comments, etc. to this list. {\bf Due to recent increases in spam, only subscribers are allowed to post to the list}. If you are not subscribed to the list, your posts will be discarded. To subscribe or unsubscribe from the list, visit the list information page: \vspace{11pt} \centerline{\url{http://www.lam-mpi.org/mailman/listinfo.cgi/lam}} \vspace{11pt} After you have subscribed (and received a confirmation e-mail), you can send mail to the list at the following address: \vspace{11pt} \centerline{{\bf You must be subscribed in order to post to the list}} \centerline{\url{lam@lam-mpi.org}} \centerline{{\bf You must be subscribed in order to post to the list}} \vspace{11pt} Be sure to include the following information in your e-mail: \begin{itemize} \item The \file{config.log} file from the top-level Open MPI directory, if available ({\bf please compress!}). \item The output of ``\icmd{laminfo}\ \ \cmdarg{-all}''. \item A {\em detailed} description of what is failing. The more details that you provide, the better. E-mails saying ``My application doesn't work!'' will inevitably be answered with requests for more information about {\em exactly what doesn't work}; so please include as much detailed information in your initial e-mail as possible. \end{itemize} {\bf NOTE:} People tend to only reply to the list; if you subscribe, post, and then unsubscribe from the list, you will likely miss replies. Also please be aware that the list goes to several hundred people around the world -- it is not uncommon to move a high-volume exchange off the list, and only post the final resolution of the problem/bug fix to the list. This prevents exchanges like ``Did you try X?'', ``Yes, I tried X, and it did not work.'', ``Did you try Y?'', etc. from cluttering up peoples' inboxes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Open MPI Run-Time Environment Problems} Some common problems with the Open MPI run-time environment are listed below. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Problems with the \icmd{lamboot} Command} Many first-time Open MPI users do not have their environment properly configured for Open MPI to boot properly. Refer to Section~\ref{sec:getting-started-lamboot} for the list of conditions that Open MPI requires to boot properly. User problems with \cmd{lamboot} typically fall into one of the following categories: \begin{itemize} \item \cmd{rsh}/\cmd{ssh} is not set up properly for password-less logins to remote nodes. {\bf Solution:} Set up \cmd{rsh}/\cmd{ssh} properly for password-less remote logins. Consult local documentation or internet tutorials for how to set up \file{\$HOME/.rhosts} and SSH keys. Note that the Open MPI Team {\bf STRONGLY} discourages the use of \cmdarg{+} in \file{.rhosts} or \file{host.equiv} files! \item \cmd{rsh}/\cmd{ssh} prints something on \file{stderr}. {\bf Solution:} Clean up system or user ``dot'' files so that nothing is printed on \file{stderr} during a remote login. \item A Open MPI daemon is unable to open a connection back to \cmd{lamboot}. {\bf Solution:} Many Linux distributions ship with firewalls enabled. Open MPI uses random TCP ports to communicate, and therefore firewall support must be either disabled or opened between machines that will be using Open MPI. \item Open MPI is unable to open a session directory. {\bf Solution:} Open MPI needs to use a per-user, per-session temporary directory, typically located under \file{/tmp} (see Section~\ref{sec:misc-session-directory}, page~\pageref{sec:misc-session-directory}). Open MPI must be able to read/write in this session directory; check permissions in this tree. \item Open MPI is unable to find the current host in the boot schema. {\bf Solution:} Open MPI can only boot a universe that includes the current node. If the current node is not listed in the hostfile, or is not listed by a name that can be resolved and identified as the current node, \cmd{lamboot} (and friends) will abort. \item Open MPI is unable to resolve all names in the boot schema. {\bf Solution:} All names in the boot schema must be resolvable by the boot SSI module that is being used. This typically means that there end up being IP hostnames that must be resolved to IP addresses. Resolution can occur by any valid OS mechanism (e.g., through DNS, local file lookup, etc.). Note that the name ``{\tt localhost}'' (or any address that resolves to 127.0.0.1) cannot be used in a boot schema that includes more than one host -- otherwise the other nodes in the resulting Open MPI universe will not be able to contact that host. \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{MPI Problems} For the most part, Open MPI implements the MPI standard similarly to other MPI implementations. Hence, most MPI programmers are not too surprised by how Open MPI handles various errors, etc. However, there are some cases that Open MPI handles in its own unique fashion. In these cases Open MPI tries to display a helpful message discussing what happened. Here's some more background on a few of the messages: \begin{itemize} \item ``One of the processes started by mpirun has exited with a nonzero exit code.'' This means that at least one MPI process has exited after invoking \mpifunc{MPI\_\-INIT}, but before invoking \mpifunc{MPI\_\-FINALIZE}. This is therefore an error, and Open MPI will abort the entire MPI application. The last line of the error message indicates the PID, node, and exit status of the failed process. \item ``MPI\_{\tt }: process in local group is dead (rank {\tt }, MPI\_\-COMM\_\-WORLD)'' This means that some MPI function tried to communicate with a peer MPI process and discovered that the peer process is dead. Common causes of this problem include attempting to communicate with processes that have failed (which, in some cases, won't generate the ``One of the processes started by mpirun has exited...'' messages), or have already invoked \mpifunc{MPI\_\-FINALIZE}. Communication should not be initiated that could involve processes that have already invoked \mpifunc{MPI\_\-FINALIZE}. This may include using \mpiconst{MPI\_\-ANY\_\-SOURCE} or collectives on communicators that include processes that have already finalized. \end{itemize}