From 323b9f346c1c57e83ec9f7893f6d9e49db0a27ce Mon Sep 17 00:00:00 2001 From: Jeff Squyres Date: Fri, 8 Aug 2014 17:18:29 +0000 Subject: [PATCH] usnic: update connectivity checker help message Show an example of using the btl_usnic_connectivity_map option. Also, mention that another reason for the "total connectivity failure" may be due to asymmetric / unexpected routing. Reviewed by Dave Goodell. cmr=v1.8.2:reviewer=ompi-rm1.8 This commit was SVN r32465. --- opal/mca/btl/usnic/help-mpi-btl-usnic.txt | 53 ++++++++++++++++++----- 1 file changed, 42 insertions(+), 11 deletions(-) diff --git a/opal/mca/btl/usnic/help-mpi-btl-usnic.txt b/opal/mca/btl/usnic/help-mpi-btl-usnic.txt index 6e635bcf20..fdef7cca4c 100644 --- a/opal/mca/btl/usnic/help-mpi-btl-usnic.txt +++ b/opal/mca/btl/usnic/help-mpi-btl-usnic.txt @@ -174,9 +174,18 @@ Your MPI job is going to abort now. Large message size: %u Note that this behavior usually indicates that the MTU of some network -link is too small between these two interfaces. You should verify that -UDP traffic with payloads up to the "large message size" listed above -can flow between the specified interfaces on these servers. +link is too small between these two interfaces. + +You should verify that UDP traffic with payloads up to the "large +message size" listed above can flow between these two interfaces. You +should also verify that Open MPI is choosing to pair IP interfaces +consistently. For example: + + mpirun --mca btl_usnic_connectivity_map mymap ... + +Check the resulting "mymap*" files to see the exact pairing of IP +interfaces. Inconsistent results may be indicative of underlying +network misconfigurations. # [connectivity error: small bad, large ok] The Open MPI usNIC BTL was unable to establish full connectivity @@ -199,15 +208,28 @@ Your MPI job is going to abort now. This is a very strange network error, and should not occur in most situations. You may be experiencing high amounts of congestion, or -this may indicate some kind of network misconfiguration. You should -verify that UDP traffic with payloads up to the "large message size" -listed above can flow between the specified interfaces on these -servers. +this may indicate some kind of network misconfiguration. + +You should verify that UDP traffic with payloads up to the "large +message size" listed above can flow between these two interfaces. You +should also verify that Open MPI is choosing to pair IP interfaces +consistently. For example: + + mpirun --mca btl_usnic_connectivity_map mymap ... + +Check the resulting "mymap*" files to see the exact pairing of IP +interfaces. Inconsistent results may be indicative of underlying +network misconfigurations. # [connectivity error: small bad, large bad] The Open MPI usNIC BTL was unable to establish any connectivity between at least one pair of interfaces on servers in the MPI job. -Specifically, no UDP messages seemed to flow between the interfaces. +This can happen for several reasons, including: + +1. No UDP traffic is able to flow between the interfaces listed below. +2. There is asymmetric routing between the interfaces listed below, + leading Open MPI to discard UDP traffic it thinks is from an + unexpected source. Your MPI job is going to abort now. @@ -223,9 +245,18 @@ Your MPI job is going to abort now. Large message size: %u Note that this behavior usually indicates some kind of network -misconfiguration. You should verify that UDP traffic with payloads up -to the "large message size" listed above can flow between the -specified interfaces on these servers. +misconfiguration. + +You should verify that UDP traffic with payloads up to the "large +message size" listed above can flow between these two interfaces. You +should also verify that Open MPI is choosing to pair IP interfaces +consistently. For example: + + mpirun --mca btl_usnic_connectivity_map mymap ... + +Check the resulting "mymap*" files to see the exact pairing of IP +interfaces. Inconsistent results may be indicative of underlying +network misconfigurations. # [ibv_create_ah timeout] The usnic BTL failed to create addresses for remote peers within the