1
1
openmpi/opal/mca/btl/tcp/help-mpi-btl-tcp.txt
Jeff Squyres 1953e3406f btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 09:40:03 -04:00

94 строки
2.6 KiB
Plaintext

# -*- text -*-
#
# Copyright (c) 2009-2016 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2015-2016 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's TCP support
# (the openib BTL).
#
[invalid if_inexclude]
WARNING: An invalid value was given for btl_tcp_if_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[invalid minimum port]
WARNING: An invalid value was given for the btl_tcp_port_min_%s. Legal
values are in the range [1 .. 2^16-1]. This value will be ignored
(reset to the default value of 1024).
Local host: %s
Value: %d
#
[client connect fail]
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now fail.
Local host: %s
PID: %d
Message: %s
Error: %s (%d)
#
[client handshake fail]
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP. This should not happen.
Your Open MPI job may now fail.
Local host: %s
PID: %d
Message: %s
#
[accept failed]
WARNING: The accept(3) system call failed on a TCP socket. While this
should generally never happen on a well-configured HPC system, the
most common causes when it does occur are:
* The process ran out of file descriptors
* The operating system ran out of file descriptors
* The operating system ran out of memory
Your Open MPI job will likely hang (or crash) until the failure
resason is fixed (e.g., more file descriptors and/or memory becomes
available), and may eventually timeout / abort.
Local host: %s
PID: %d
Errno: %d (%s)
#
[unsuported progress thread]
WARNING: Support for the TCP progress thread has not been compiled in.
Fall back to the normal progress.
Local host: %s
Value: %s
Message: %s
#
[peer hung up]
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: %s
Local PID: %d
Peer host: %s
#