1
1
openmpi/ompi/mca/btl/usnic/help-mpi-btl-usnic.txt
Jeff Squyres 3cbdf33b88 This is what r30852 should have been: Consolidate into a single, outter loop of ibv_create_ah() calls
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style.  That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint.  If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later.  We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.

cmr=v1.7.5:ticket=trac:4253

This commit was SVN r30879.

The following SVN revision numbers were found above:
  r30850 --> open-mpi/ompi@3641500442
  r30852 --> open-mpi/ompi@4e282a3295

The following Trac tickets were found above:
  Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
2014-02-27 17:19:50 +00:00

289 строки
9.2 KiB
Plaintext

# -*- text -*-
#
# Copyright (c) 2012-2014 Cisco Systems, Inc. All rights reserved.
#
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for the Open MPI usnic BTL.
#
[ibv API failed]
Open MPI failed a basic verbs operation on a Cisco usNIC device. This
is highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
In addition to any suggestions listed below, you might want to check
the Linux "memlock" limits on your system (they should probably be
"unlimited"). See this FAQ entry for details:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Failed function: %s (%s:%d)
Description: %s
#
[not enough usnic resources]
There are not enough usNIC resources on a VIC for all the MPI
processes on this server.
This means that you have either not provisioned enough usNICs on this
VIC, or there are not enough total receive, transmit, or completion
queues on the provisioned usNICs. On each VIC in a given server, you
need to provision at least as many usNICs as MPI processes on that
server. In each usNIC, you need to provision at least two each of the
following: send queues, receive queues, and completion queues.
Open MPI will skip this device in the usnic BTL, which may result in
either lower performance or your job aborting.
Server: %s
Device: %s
Description: %s
#
[create ibv resource failed]
Open MPI failed to allocate a usNIC-related resource on a VIC. This
usually means one of two things:
1. You are running something other than this MPI job on this server
that is consuming usNIC resources.
2. You have run out of locked Linux memory. You should probably set
the Linux "memlock" limits to "unlimited". See this FAQ entry for
details:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
This Open MPI job will skip this device/port in the usnic BTL, which
may result in either lower performance or the job aborting.
Server: %s
Device: %s
Failed function: %s (%s:%d)
Description: %s
#
[async event]
Open MPI detected a fatal error on a usNIC port. Your MPI job will
now abort; sorry.
Server: %s
Device: %s
Port: %d
Async event code: %s (%d)
#
[internal error during init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Failure: %s (%s:%d)
Description: %s
#
[internal error after init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Server: %s
Message: %s
File: %s
Line: %d
Error: %s
#
[ibv API failed after init]
Open MPI failed a basic verbs operation on a Cisco usNIC device. This
is highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Your MPI job may behave erratically, hang, and/or abort.
Server: %s
Failure: %s (%s:%d)
Description: %s
#
[verbs_port_bw failed]
Open MPI failed to query the supported bandwidth of a port on a Cisco
usNIC device. This is unusual and shouldn't happen. It suggests that
there may be something wrong with the usNIC or OpenFabrics
configuration on this server.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
#
[eager_limit too high]
The eager_limit in the usnic BTL is too high for a device that Open
MPI tried to use. The usnic BTL eager_limit value is the largest
message payload that Open MPI will send in a single datagram.
You are seeing this message because the eager_limit was set to a value
larger than the MPI message payload capacity of a single UD datagram.
The max payload size is smaller than the size of individual datagrams
because each datagram also contains MPI control metadata, meaning that
the some bytes in the datagram must be reserved for overhead.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Max payload allowed: %d
Specified eager_limit: %d
#
[check_reg_mem_basics fail]
The usNIC BTL failed to initialize while trying to register some
memory. This typically can indicate that the "memlock" limits are set
too low. For most HPC installations, the memlock limits should be set
to "unlimited". The failure occurred here:
Local host: %s
Memlock limit: %s
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
#
[invalid if_inexclude]
WARNING: An invalid value was given for btl_usnic_if_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[MTU mismatch]
The MTU does not match on local and remote hosts. All interfaces on all
hosts participating in an MPI job must be configured with the same MTU.
The device and port listed below will not be used to communicate with this
remote host.
Local host: %s
Device/port: %s/%d
Local MTU: %d
Remote host: %s
Remote MTU: %d
#
[bad value for btl_usnic_vendor_part_ids]
A non-numeric value was specified for the btl_usnic_vendor_part_ids
MCA parameter. This parameter is supposed to be a comma-delimited
list of decimal verbs vendor part IDs. This usnic BTL will be ignored
for this job.
btl_usnic_vendor_part_ids value: %s
#
[rtnetlink init fail]
The usnic BTL failed to initialize the rtnetlink query subsystem.
Server: %s
Error message: %s
#
[connectivity error: small ok, large bad]
The Open MPI usNIC BTL was unable to establish full connectivity
between at least one pair of servers in the MPI job. Specifically,
small UDP messages seem to flow between the servers, but large UDP
messages do not.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interfaces: %s / %s
MAC address: %s
Destination:
Hostname / IP: %s (%s)
MAC address: %s
Small message size: %u
Large message size: %u
Note that this behavior usually indicates that the MTU of some network
link is too small between these two servers. You should verify that
UDP traffic with payloads up to the "large message size" listed above
can flow between these two servers.
#
[connectivity error: small bad, large ok]
The Open MPI usNIC BTL was unable to establish full connectivity
between at least one pair of servers in the MPI job. Specifically,
large UDP messages seem to flow between the servers, but small UDP
messages do not.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interfaces: %s / %s
MAC address: %s
Destination:
Hostname / IP: %s (%s)
MAC address: %s
Small message size: %u
Large message size: %u
This is a very strange network error, and should not occur in most
situations. You may be experiencing high amounts of congestion, or
this may indicate some kind of network misconfiguration. You should
verify that UDP traffic with payloads up to the "large message size"
listed above can flow between these two servers.
#
[connectivity error: small bad, large bad]
The Open MPI usNIC BTL was unable to establish any connectivity
between at least one pair of servers in the MPI job. Specifically,
no UDP messages seemed to flow between these two servers.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interfaces: %s / %s
MAC address: %s
Destination:
Hostname / IP: %s (%s)
MAC address: %s
Small message size: %u
Large message size: %u
Note that this behavior usually indicates some kind of network
misconfiguration. You should verify that UDP traffic with payloads up
to the "large message size" listed above can flow between these two
servers.
#
[ibv_create_ah timeout]
The usnic BTL failed to create addresses for remote peers within the
specified timeout. When using the usNIC/UDP transport, this usually
means that ARP requests failed to resolve in time. You may be able to
solve the problem by increasing the usnic BTL's ARP timeout. If that
doesn't work, you should diagnose why ARP replies are apparently not
being delivered in a timely manner.
The usNIC interface listed below will be ignored. Your MPI
application will likely either run with degraded performance and/or
abort.
Server: %s
Device: %s:%d (%s)
Current ARP timeout: %d (btl_usnic_arp_timeout MCA param)