1ea7bad5a0
When ibv_create_ah() fails due to an address resolution failure, it really only means that we can't reach that one peer -- so we should just ignore that one peer. If ibv_create_ah() fails for some other reason, then give up on the entire usnic_X device. Change the show_help() message that is displayed when ibv_create_ah() fails due to address resolution failure; indicate that it's likely a routing problem. Also opal_output_verbose() the same info, since show_help() is de-duplicated (and this particular show_help() message can be squelched). Fixes Cisco bugs CSCup35851 and CSCup35872. cmr=v1.8.2:ticket=trac:4734 This commit was SVN r32061. The following Trac tickets were found above: Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
285 строки
9.3 KiB
Plaintext
285 строки
9.3 KiB
Plaintext
# -*- text -*-
|
|
#
|
|
# Copyright (c) 2012-2014 Cisco Systems, Inc. All rights reserved.
|
|
#
|
|
# $COPYRIGHT$
|
|
#
|
|
# Additional copyrights may follow
|
|
#
|
|
# $HEADER$
|
|
#
|
|
# This is the US/English help file for the Open MPI usnic BTL.
|
|
#
|
|
[ibv API failed]
|
|
Open MPI failed a basic verbs operation on a Cisco usNIC interface.
|
|
This is highly unusual and shouldn't happen. It suggests that there
|
|
may be something wrong with the usNIC or OpenFabrics configuration on
|
|
this server.
|
|
|
|
In addition to any suggestions listed below, you might want to check
|
|
the Linux "memlock" limits on your system (they should probably be
|
|
"unlimited"). See this FAQ entry for details:
|
|
|
|
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s (which is %s)
|
|
Failed function: %s (%s:%d)
|
|
Description: %s
|
|
#
|
|
[not enough usnic resources]
|
|
There are not enough usNIC resources on a VIC for all the MPI
|
|
processes on this server.
|
|
|
|
This means that you have either not provisioned enough usNICs on this
|
|
VIC, or there are not enough total receive, transmit, or completion
|
|
queues on the provisioned usNICs. On each VIC in a given server, you
|
|
need to provision at least as many usNICs as MPI processes on that
|
|
server. In each usNIC, you need to provision at least two each of the
|
|
following: send queues, receive queues, and completion queues.
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Description: %s
|
|
#
|
|
[create ibv resource failed]
|
|
Open MPI failed to allocate a usNIC-related resource on a VIC. This
|
|
usually means one of two things:
|
|
|
|
1. You are running something other than this MPI job on this server
|
|
that is consuming usNIC resources.
|
|
2. You have run out of locked Linux memory. You should probably set
|
|
the Linux "memlock" limits to "unlimited". See this FAQ entry for
|
|
details:
|
|
|
|
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
|
|
|
|
This Open MPI job will skip this usNIC interface in the usnic BTL,
|
|
which may result in either lower performance or the job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s (which is %s)
|
|
Failed function: %s (%s:%d)
|
|
Description: %s
|
|
#
|
|
[async event]
|
|
Open MPI detected a fatal error on a usNIC interface. Your MPI job
|
|
will now abort; sorry.
|
|
|
|
Server: %s
|
|
usNIC interface: %s (which is %s)
|
|
Async event code: %s (%d)
|
|
#
|
|
[internal error during init]
|
|
An internal error has occurred in the Open MPI usNIC BTL. This is
|
|
highly unusual and shouldn't happen. It suggests that there may be
|
|
something wrong with the usNIC or OpenFabrics configuration on this
|
|
server.
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s (which is %s)
|
|
Failure: %s (%s:%d)
|
|
Description: %s
|
|
#
|
|
[internal error after init]
|
|
An internal error has occurred in the Open MPI usNIC BTL. This is
|
|
highly unusual and shouldn't happen. It suggests that there may be
|
|
something wrong with the usNIC or OpenFabrics configuration on this
|
|
server.
|
|
|
|
Server: %s
|
|
Message: %s
|
|
File: %s
|
|
Line: %d
|
|
Error: %s
|
|
#
|
|
[verbs_port_bw failed]
|
|
Open MPI failed to query the supported bandwidth of a usNIC interface.
|
|
This is unusual and shouldn't happen. It suggests that there may be
|
|
something wrong with the usNIC or OpenFabrics configuration on this
|
|
server.
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s (which is %s)
|
|
#
|
|
[check_reg_mem_basics fail]
|
|
The usNIC BTL failed to initialize while trying to register some
|
|
memory. This typically can indicate that the "memlock" limits are set
|
|
too low. For most HPC installations, the memlock limits should be set
|
|
to "unlimited". The failure occurred here:
|
|
|
|
Local host: %s
|
|
Memlock limit: %s
|
|
|
|
You may need to consult with your system administrator to get this
|
|
problem fixed. This FAQ entry on the Open MPI web site may also be
|
|
helpful:
|
|
|
|
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
|
|
#
|
|
[invalid if_inexclude]
|
|
WARNING: An invalid value was given for btl_usnic_if_%s. This
|
|
value will be ignored.
|
|
|
|
Local host: %s
|
|
Value: %s
|
|
Message: %s
|
|
#
|
|
[MTU mismatch]
|
|
The MTU does not match on local and remote hosts. All interfaces on
|
|
all hosts participating in an MPI job must be configured with the same
|
|
MTU. The usNIC interface listed below will not be used to communicate
|
|
with this remote host.
|
|
|
|
Local host: %s
|
|
usNIC interface: %s (which is %s)
|
|
Local MTU: %d
|
|
Remote host: %s
|
|
Remote MTU: %d
|
|
#
|
|
[rtnetlink init fail]
|
|
The usnic BTL failed to initialize the rtnetlink query subsystem.
|
|
|
|
Server: %s
|
|
Error message: %s
|
|
#
|
|
[connectivity error: small ok, large bad]
|
|
The Open MPI usNIC BTL was unable to establish full connectivity
|
|
between at least one pair of servers in the MPI job. Specifically,
|
|
small UDP messages seem to flow between the servers, but large UDP
|
|
messages do not.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interfaces: %s / %s
|
|
MAC address: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
MAC address: %s
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
Note that this behavior usually indicates that the MTU of some network
|
|
link is too small between these two servers. You should verify that
|
|
UDP traffic with payloads up to the "large message size" listed above
|
|
can flow between these two servers.
|
|
#
|
|
[connectivity error: small bad, large ok]
|
|
The Open MPI usNIC BTL was unable to establish full connectivity
|
|
between at least one pair of servers in the MPI job. Specifically,
|
|
large UDP messages seem to flow between the servers, but small UDP
|
|
messages do not.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interfaces: %s / %s
|
|
MAC address: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
MAC address: %s
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
This is a very strange network error, and should not occur in most
|
|
situations. You may be experiencing high amounts of congestion, or
|
|
this may indicate some kind of network misconfiguration. You should
|
|
verify that UDP traffic with payloads up to the "large message size"
|
|
listed above can flow between these two servers.
|
|
#
|
|
[connectivity error: small bad, large bad]
|
|
The Open MPI usNIC BTL was unable to establish any connectivity
|
|
between at least one pair of servers in the MPI job. Specifically,
|
|
no UDP messages seemed to flow between these two servers.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interfaces: %s / %s
|
|
MAC address: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
MAC address: %s
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
Note that this behavior usually indicates some kind of network
|
|
misconfiguration. You should verify that UDP traffic with payloads up
|
|
to the "large message size" listed above can flow between these two
|
|
servers.
|
|
#
|
|
[ibv_create_ah timeout]
|
|
The usnic BTL failed to create addresses for remote peers within the
|
|
specified timeout. When using the usNIC/UDP transport, this usually
|
|
means that ARP requests failed to resolve in time. You may be able to
|
|
solve the problem by increasing the usnic BTL's ARP timeout. If that
|
|
doesn't work, you should diagnose why ARP replies are apparently not
|
|
being delivered in a timely manner.
|
|
|
|
The usNIC interface listed below will be ignored. Your MPI
|
|
application will likely either run with degraded performance and/or
|
|
abort.
|
|
|
|
Server: %s
|
|
usNIC interface: %s:%d (%s)
|
|
Current ARP timeout: %d (btl_usnic_arp_timeout MCA param)
|
|
#
|
|
[transport mismatch]
|
|
The underlying transports used by the usNIC driver stack on multiple
|
|
servers do not match. This configuration is unsupported and is almost
|
|
certainly not what you want.
|
|
|
|
This error indicates that the VIC firmware, Linux usNIC kernel driver,
|
|
and/or Linux usNIC userspace drivers are not compatible between at
|
|
least the following two servers:
|
|
|
|
Local server: %s
|
|
Remote server: %s
|
|
|
|
The usnic MPI transport will be deactivated in at least the one local
|
|
MPI process that reported the problem. This may lead to performance
|
|
degradation, and may also result in aborting the overall MPI job.
|
|
|
|
It is usually easiest to have the same VIC firmware, Linux usNIC
|
|
kernel driver, and Linux usNIC userspace driver installed on all
|
|
servers.
|
|
#
|
|
[create_ah failed]
|
|
WARNING: Open MPI failed to find a route to a peer IP address via a
|
|
specific usNIC interface. This usually indicates a problem in the IP
|
|
routing between these peers.
|
|
|
|
Open MPI will skip this usNIC interface when communicating with that
|
|
peer, which may result in lower performance to that peer. It may also
|
|
result in your job aborting if there are no other network paths to
|
|
that peer.
|
|
|
|
Note that this error message defaults to only printing for the *first*
|
|
pair of MPI process peers that exhibit this problem; this same problem
|
|
may exist for other peer pairs, too.
|
|
|
|
Local interface: %s:%s (which is %s and %s)
|
|
Peer: %s:%s
|
|
|
|
NOTE: You can set the MCA param btl_usnic_show_route_failures to 0 to
|
|
disable this warning.
|