1
1
openmpi/opal/mca/btl/usnic/help-mpi-btl-usnic.txt
Jeff Squyres dc18c32437 usnic: fix resource check
The math for checking the number of QPs and CQs per usNIC/VF was
incorrect, allowing you to run MPI processes even when usNICs (i.e.,
VIC VFs) had fewer QPs and CQs than were necessary.  This led to a
confusing error later when fi_enable(3) failed (because we lazily
create QPs).  Fixing the math here ensure that we actually print a
helpful error message telling the user specifically what is wrong.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-22 15:58:27 -07:00

343 строки
11 KiB
Plaintext

# -*- text -*-
#
# Copyright (c) 2012-2016 Cisco Systems, Inc. All rights reserved.
#
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for the Open MPI usnic BTL.
#
[not enough usnic resources]
There are not enough usNIC resources on a VIC for all the MPI
processes on this server.
This means that you have either not provisioned enough usNICs on this
VIC, or there are not enough total receive, transmit, or completion
queues on the provisioned usNICs. On each VIC in a given server, you
need to provision at least as many usNICs as MPI processes on that
server. In each usNIC, you need to provision enough of each of the
following: send queues, receive queues, and completion queues.
Open MPI will skip this usNIC interface in the usnic BTL, which may
result in either lower performance or your job aborting.
Server: %s
usNIC interface: %s
Description: %s
#
[libfabric API failed]
Open MPI failed a basic API operation on a Cisco usNIC interface.
This is highly unusual and shouldn't happen. It suggests that there
may be something wrong with the usNIC configuration on this server.
In addition to any suggestions listed below, you might want to check
the Linux "memlock" limits on your system (they should probably be
"unlimited"). See this FAQ entry for details:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Open MPI will skip this usNIC interface in the usnic BTL, which may
result in either lower performance or your job aborting.
Server: %s
usNIC interface: %s
Failed function: %s (%s:%d)
Return status: %d
Description: %s
#
[async event]
Open MPI detected a fatal error on a usNIC interface. Your MPI job
will now abort; sorry.
Server: %s
usNIC interface: %s
Async event code: %s (%d)
#
[internal error during init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Open MPI will skip this usNIC interface in the usnic BTL, which may
result in either lower performance or your job aborting.
Server: %s
usNIC interface: %s
Failure at: %s (%s:%d)
Error: %d (%s)
#
[internal error after init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Server: %s
File: %s
Line: %d
Message: %s
#
[check_reg_mem_basics fail]
The usNIC BTL failed to initialize while trying to register some
memory. This typically can indicate that the "memlock" limits are set
too low. For most HPC installations, the memlock limits should be set
to "unlimited". The failure occurred here:
Local host: %s
Memlock limit: %s
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
#
[invalid if_inexclude]
WARNING: An invalid value was given for btl_usnic_if_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[device present but not up]
Open MPI has found a usNIC device that is present / listed in Linux,
but in a "down" state. It will not be used by this MPI job.
You may wish to check this device, especially if it is unexpectedly
down.
Local server: %s
Device name: %s
#
[MTU mismatch]
The MTU does not match on local and remote hosts. All interfaces on
all hosts participating in an MPI job must be configured with the same
MTU. The usNIC interface listed below will not be used to communicate
with this remote host.
Local host: %s
usNIC interface: %s
Local MTU: %d
Remote host: %s
Remote MTU: %d
#
[connectivity error: small ok, large bad]
The Open MPI usNIC BTL was unable to establish full connectivity
between at least one pair of interfaces on servers in the MPI job.
Specifically, small UDP messages seem to flow between the interfaces,
but large UDP messages do not.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interface: %s
Destination:
Hostname / IP: %s (%s)
Small message size: %u
Large message size: %u
Note that this behavior usually indicates that the MTU of some network
link is too small between these two interfaces.
You should verify that UDP traffic with payloads up to the "large
message size" listed above can flow between these two interfaces. You
should also verify that Open MPI is choosing to pair IP interfaces
consistently. For example:
mpirun --mca btl_usnic_connectivity_map mymap ...
Check the resulting "mymap*" files to see the exact pairing of IP
interfaces. Inconsistent results may be indicative of underlying
network misconfigurations.
#
[connectivity error: small bad, large ok]
The Open MPI usNIC BTL was unable to establish full connectivity
between at least one pair of interfaces on servers in the MPI job.
Specifically, large UDP messages seem to flow between the interfaces,
but small UDP messages do not.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interface: %s
Destination:
Hostname / IP: %s (%s)
Small message size: %u
Large message size: %u
This is a very strange network error, and should not occur in most
situations. You may be experiencing high amounts of congestion, or
this may indicate some kind of network misconfiguration.
You should verify that UDP traffic with payloads up to the "large
message size" listed above can flow between these two interfaces. You
should also verify that Open MPI is choosing to pair IP interfaces
consistently. For example:
mpirun --mca btl_usnic_connectivity_map mymap ...
Check the resulting "mymap*" files to see the exact pairing of IP
interfaces. Inconsistent results may be indicative of underlying
network misconfigurations.
#
[connectivity error: small bad, large bad]
The Open MPI usNIC BTL was unable to establish any connectivity
between at least one pair of interfaces on servers in the MPI job.
This can happen for several reasons, including:
1. No UDP traffic is able to flow between the interfaces listed below.
2. There is asymmetric routing between the interfaces listed below,
leading Open MPI to discard UDP traffic it thinks is from an
unexpected source.
Your MPI job is going to abort now.
Source:
Hostname / IP: %s (%s)
Host interface: %s
Destination:
Hostname / IP: %s (%s)
Small message size: %u
Large message size: %u
Note that this behavior usually indicates some kind of network
misconfiguration.
You should verify that UDP traffic with payloads up to the "large
message size" listed above can flow between these two interfaces. You
should also verify that Open MPI is choosing to pair IP interfaces
consistently. For example:
mpirun --mca btl_usnic_connectivity_map mymap ...
Check the resulting "mymap*" files to see the exact pairing of IP
interfaces. Inconsistent results may be indicative of underlying
network misconfigurations.
#
[fi_av_insert timeout]
The usnic BTL failed to create addresses for remote peers within the
specified timeout. This usually means that ARP requests failed to
resolve in time. You may be able to solve the problem by increasing
the usnic BTL's ARP timeout. If that doesn't work, you should
diagnose why ARP replies are apparently not being delivered in a
timely manner.
The usNIC interface listed below will be ignored. Your MPI
application will likely either run with degraded performance and/or
abort.
Server: %s
usNIC interface: %s
Current ARP timeout: %d (btl_usnic_arp_timeout MCA param)
#
[fi_av_eq too small]
The usnic BTL was told to create an address resolution queue that was
too small via the mca_btl_usnic_av_eq_num MCA parameter. This
parameter controls how many outstanding peer address resolutions can
be outstanding at a time. Larger values allow more concurrent address
resolutions, but consume more memory.
Server: %s
av_eq_num param value: %d
av_eq_num minimum value: %d
Your job will likely either perform poorly, or will abort.
#
[unreachable peer IP]
WARNING: Open MPI failed to find a route to a peer IP address via a
specific usNIC interface. This usually indicates a problem in the IP
routing between these peers.
Open MPI will skip this usNIC interface when communicating with that
peer, which may result in lower performance to that peer. It may also
result in your job aborting if there are no other network paths to
that peer.
Note that this error message defaults to only printing for the *first*
pair of MPI process peers that exhibit this problem; this same problem
may exist for other peer pairs, too.
Local interface: %s:%s (which is %s)
Peer: %s:%s
NOTE: You can set the MCA param btl_usnic_show_route_failures to 0 to
disable this warning.
#
[cannot write to map file]
WARNING: The usnic BTL was unable to open the requested output
connectivity map file. Your job will continue, but this output
connectivity map file will not be written.
Local server: %s
Output map file: %s
Working directory: %s
Error: %s (%d)
#
[received too many short packets]
WARNING: The usnic BTL received a significant number of abnormally
short packets on a single network interface. This may be due to
corruption or congestion in the network fabric. It may be useful to
run a physical/layer 0 diagnostic.
Your job will continue, but if this poor network behavior continues,
you may experience lower-than-expected performance due to overheads
caused by higher-than-usual retransmission rates (to compensate for
the corrupted received packets).
Local server: %s
usNIC interface: %s
# of short packets
received so far: %d
You will only receive this warning once per MPI process per job.
If you know that your network environment is lossy/heavily congested
such that short/corrupted packets are expected, you can disable this
warning by setting the btl_usnic_max_short_packets MCA parameter to 0.
#
[non-receive completion error]
WARNING: The usnic BTL has detected an error in the completion of a
non-receive event. This is highly unusual, and may indicate an error
in the usNIC subsystem on this server.
Your MPI job will continue, but you should monitor the job and ensure
that it behaves correctly.
Local server: %s
usNIC interface: %s
Channel index: %d
Completion status: %s (%d)
Work request ID: %p
Opcode: %s (%d)
If this error keeps happening, you should contact Cisco technical
support.
#
[device present but not up]
Open MPI has found a usNIC device that is present / listed in Linux,
but in a "down" state. It will not be used by this MPI job.
You may wish to check this device, especially if it is unexpectedly
down.
Local server: %s
Device name: %s
#
[transport mismatch]
Open MPI has found two servers with different underlying usNIC
transports. This is an unsupported configuration; all usNIC devices
must have the same underlying transport in order to use the usNIC BTL.
Local server / transport: %s / %s
Remote server / transport: %s / %s