1
1

btl/tcp: add show_help message when peer hangs up

We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Этот коммит содержится в:
Jeff Squyres 2016-09-02 17:44:07 -04:00
родитель 95c6f6cfc0
Коммит 1953e3406f
2 изменённых файлов: 32 добавлений и 1 удалений

Просмотреть файл

@ -14,7 +14,7 @@
* reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2015-2016 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@ -44,8 +44,12 @@
#include "opal/opal_socket_errno.h"
#include "opal/mca/btl/base/btl_base_error.h"
#include "opal/util/show_help.h"
#include "btl_tcp_frag.h"
#include "btl_tcp_endpoint.h"
#include "btl_tcp_proc.h"
static void mca_btl_tcp_frag_eager_constructor(mca_btl_tcp_frag_t* frag)
{
@ -225,6 +229,16 @@ bool mca_btl_tcp_frag_recv(mca_btl_tcp_frag_t* frag, int sd)
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
mca_btl_tcp_endpoint_close(btl_endpoint);
return false;
case ECONNRESET:
opal_show_help("help-mpi-btl-tcp.txt", "peer hung up",
true, opal_process_info.nodename,
getpid(),
btl_endpoint->endpoint_proc->proc_opal->proc_hostname);
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
mca_btl_tcp_endpoint_close(btl_endpoint);
return false;
default:
BTL_ERROR(("mca_btl_tcp_frag_recv: readv failed: %s (%d)",
strerror(opal_socket_errno),

Просмотреть файл

@ -74,3 +74,20 @@ Fall back to the normal progress.
Local host: %s
Value: %s
Message: %s
#
[peer hung up]
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: %s
Local PID: %d
Peer host: %s
#