9547345b18
Rename a few symbols to use libfabric-friendly names. Fix a show_help message when fi_av_insert times out.
331 строка
11 KiB
Plaintext
331 строка
11 KiB
Plaintext
# -*- text -*-
|
|
#
|
|
# Copyright (c) 2012-2014 Cisco Systems, Inc. All rights reserved.
|
|
#
|
|
# $COPYRIGHT$
|
|
#
|
|
# Additional copyrights may follow
|
|
#
|
|
# $HEADER$
|
|
#
|
|
# This is the US/English help file for the Open MPI usnic BTL.
|
|
#
|
|
[not enough usnic resources]
|
|
There are not enough usNIC resources on a VIC for all the MPI
|
|
processes on this server.
|
|
|
|
This means that you have either not provisioned enough usNICs on this
|
|
VIC, or there are not enough total receive, transmit, or completion
|
|
queues on the provisioned usNICs. On each VIC in a given server, you
|
|
need to provision at least as many usNICs as MPI processes on that
|
|
server. In each usNIC, you need to provision at least two each of the
|
|
following: send queues, receive queues, and completion queues.
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Description: %s
|
|
#
|
|
[libfabric API failed]
|
|
Open MPI failed a basic API operation on a Cisco usNIC interface.
|
|
This is highly unusual and shouldn't happen. It suggests that there
|
|
may be something wrong with the usNIC configuration on this server.
|
|
|
|
In addition to any suggestions listed below, you might want to check
|
|
the Linux "memlock" limits on your system (they should probably be
|
|
"unlimited"). See this FAQ entry for details:
|
|
|
|
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Failed function: %s (%s:%d)
|
|
Return status: %d
|
|
Description: %s
|
|
#
|
|
[async event]
|
|
Open MPI detected a fatal error on a usNIC interface. Your MPI job
|
|
will now abort; sorry.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Async event code: %s (%d)
|
|
#
|
|
[internal error during init]
|
|
An internal error has occurred in the Open MPI usNIC BTL. This is
|
|
highly unusual and shouldn't happen. It suggests that there may be
|
|
something wrong with the usNIC or OpenFabrics configuration on this
|
|
server.
|
|
|
|
Open MPI will skip this usNIC interface in the usnic BTL, which may
|
|
result in either lower performance or your job aborting.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Failure at: %s (%s:%d)
|
|
Error: %d (%s)
|
|
#
|
|
[internal error after init]
|
|
An internal error has occurred in the Open MPI usNIC BTL. This is
|
|
highly unusual and shouldn't happen. It suggests that there may be
|
|
something wrong with the usNIC or OpenFabrics configuration on this
|
|
server.
|
|
|
|
Server: %s
|
|
Message: %s
|
|
File: %s
|
|
Line: %d
|
|
Error: %s
|
|
#
|
|
[check_reg_mem_basics fail]
|
|
The usNIC BTL failed to initialize while trying to register some
|
|
memory. This typically can indicate that the "memlock" limits are set
|
|
too low. For most HPC installations, the memlock limits should be set
|
|
to "unlimited". The failure occurred here:
|
|
|
|
Local host: %s
|
|
Memlock limit: %s
|
|
|
|
You may need to consult with your system administrator to get this
|
|
problem fixed. This FAQ entry on the Open MPI web site may also be
|
|
helpful:
|
|
|
|
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
|
|
#
|
|
[invalid if_inexclude]
|
|
WARNING: An invalid value was given for btl_usnic_if_%s. This
|
|
value will be ignored.
|
|
|
|
Local host: %s
|
|
Value: %s
|
|
Message: %s
|
|
#
|
|
[device present but not up]
|
|
Open MPI has found a usNIC device that is present / listed in Linux,
|
|
but in a "down" state. It will not be used by this MPI job.
|
|
|
|
You may wish to check this device, especially if it is unexpectedly
|
|
down.
|
|
|
|
Local server: %s
|
|
Device name: %s
|
|
#
|
|
[MTU mismatch]
|
|
The MTU does not match on local and remote hosts. All interfaces on
|
|
all hosts participating in an MPI job must be configured with the same
|
|
MTU. The usNIC interface listed below will not be used to communicate
|
|
with this remote host.
|
|
|
|
Local host: %s
|
|
usNIC interface: %s
|
|
Local MTU: %d
|
|
Remote host: %s
|
|
Remote MTU: %d
|
|
#
|
|
[connectivity error: small ok, large bad]
|
|
The Open MPI usNIC BTL was unable to establish full connectivity
|
|
between at least one pair of interfaces on servers in the MPI job.
|
|
Specifically, small UDP messages seem to flow between the interfaces,
|
|
but large UDP messages do not.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interface: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
Note that this behavior usually indicates that the MTU of some network
|
|
link is too small between these two interfaces.
|
|
|
|
You should verify that UDP traffic with payloads up to the "large
|
|
message size" listed above can flow between these two interfaces. You
|
|
should also verify that Open MPI is choosing to pair IP interfaces
|
|
consistently. For example:
|
|
|
|
mpirun --mca btl_usnic_connectivity_map mymap ...
|
|
|
|
Check the resulting "mymap*" files to see the exact pairing of IP
|
|
interfaces. Inconsistent results may be indicative of underlying
|
|
network misconfigurations.
|
|
#
|
|
[connectivity error: small bad, large ok]
|
|
The Open MPI usNIC BTL was unable to establish full connectivity
|
|
between at least one pair of interfaces on servers in the MPI job.
|
|
Specifically, large UDP messages seem to flow between the interfaces,
|
|
but small UDP messages do not.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interface: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
This is a very strange network error, and should not occur in most
|
|
situations. You may be experiencing high amounts of congestion, or
|
|
this may indicate some kind of network misconfiguration.
|
|
|
|
You should verify that UDP traffic with payloads up to the "large
|
|
message size" listed above can flow between these two interfaces. You
|
|
should also verify that Open MPI is choosing to pair IP interfaces
|
|
consistently. For example:
|
|
|
|
mpirun --mca btl_usnic_connectivity_map mymap ...
|
|
|
|
Check the resulting "mymap*" files to see the exact pairing of IP
|
|
interfaces. Inconsistent results may be indicative of underlying
|
|
network misconfigurations.
|
|
#
|
|
[connectivity error: small bad, large bad]
|
|
The Open MPI usNIC BTL was unable to establish any connectivity
|
|
between at least one pair of interfaces on servers in the MPI job.
|
|
This can happen for several reasons, including:
|
|
|
|
1. No UDP traffic is able to flow between the interfaces listed below.
|
|
2. There is asymmetric routing between the interfaces listed below,
|
|
leading Open MPI to discard UDP traffic it thinks is from an
|
|
unexpected source.
|
|
|
|
Your MPI job is going to abort now.
|
|
|
|
Source:
|
|
Hostname / IP: %s (%s)
|
|
Host interface: %s
|
|
Destination:
|
|
Hostname / IP: %s (%s)
|
|
|
|
Small message size: %u
|
|
Large message size: %u
|
|
|
|
Note that this behavior usually indicates some kind of network
|
|
misconfiguration.
|
|
|
|
You should verify that UDP traffic with payloads up to the "large
|
|
message size" listed above can flow between these two interfaces. You
|
|
should also verify that Open MPI is choosing to pair IP interfaces
|
|
consistently. For example:
|
|
|
|
mpirun --mca btl_usnic_connectivity_map mymap ...
|
|
|
|
Check the resulting "mymap*" files to see the exact pairing of IP
|
|
interfaces. Inconsistent results may be indicative of underlying
|
|
network misconfigurations.
|
|
#
|
|
[fi_av_insert timeout]
|
|
The usnic BTL failed to create addresses for remote peers within the
|
|
specified timeout. This usually means that ARP requests failed to
|
|
resolve in time. You may be able to solve the problem by increasing
|
|
the usnic BTL's ARP timeout. If that doesn't work, you should
|
|
diagnose why ARP replies are apparently not being delivered in a
|
|
timely manner.
|
|
|
|
The usNIC interface listed below will be ignored. Your MPI
|
|
application will likely either run with degraded performance and/or
|
|
abort.
|
|
|
|
Server: %s
|
|
usNIC interface: %s
|
|
Current ARP timeout: %d (btl_usnic_arp_timeout MCA param)
|
|
#
|
|
[unreachable peer IP]
|
|
WARNING: Open MPI failed to find a route to a peer IP address via a
|
|
specific usNIC interface. This usually indicates a problem in the IP
|
|
routing between these peers.
|
|
|
|
Open MPI will skip this usNIC interface when communicating with that
|
|
peer, which may result in lower performance to that peer. It may also
|
|
result in your job aborting if there are no other network paths to
|
|
that peer.
|
|
|
|
Note that this error message defaults to only printing for the *first*
|
|
pair of MPI process peers that exhibit this problem; this same problem
|
|
may exist for other peer pairs, too.
|
|
|
|
Local interface: %s:%s (which is %s)
|
|
Peer: %s:%s
|
|
|
|
NOTE: You can set the MCA param btl_usnic_show_route_failures to 0 to
|
|
disable this warning.
|
|
#
|
|
[cannot write to map file]
|
|
WARNING: The usnic BTL was unable to open the requested output
|
|
connectivity map file. Your job will continue, but this output
|
|
connectivity map file will not be written.
|
|
|
|
Local server: %s
|
|
Output map file: %s
|
|
Working directory: %s
|
|
Error: %s (%d)
|
|
#
|
|
[received too many short packets]
|
|
WARNING: The usnic BTL received a significant number of abnormally
|
|
short packets on a single network interface. This may be due to
|
|
corruption or congestion in the network fabric. It may be useful to
|
|
run a physical/layer 0 diagnostic.
|
|
|
|
Your job will continue, but if this poor network behavior continues,
|
|
you may experience lower-than-expected performance due to overheads
|
|
caused by higher-than-usual retransmission rates (to compensate for
|
|
the corrupted received packets).
|
|
|
|
Local server: %s
|
|
usNIC interface: %s
|
|
# of short packets
|
|
received so far: %d
|
|
|
|
You will only receive this warning once per MPI process per job.
|
|
|
|
If you know that your network environment is lossy/heavily congested
|
|
such that short/corrupted packets are expected, you can disable this
|
|
warning by setting the btl_usnic_max_short_packets MCA parameter to 0.
|
|
#
|
|
[non-receive completion error]
|
|
WARNING: The usnic BTL has detected an error in the completion of a
|
|
non-receive event. This is highly unusual, and may indicate an error
|
|
in the usNIC subsystem on this server.
|
|
|
|
Your MPI job will continue, but you should monitor the job and ensure
|
|
that it behaves correctly.
|
|
|
|
Local server: %s
|
|
usNIC interface: %s
|
|
Channel index: %d
|
|
Completion status: %s (%d)
|
|
Work request ID: %p
|
|
Opcode: %s (%d)
|
|
|
|
If this error keeps happening, you should contact Cisco technical
|
|
support.
|
|
#
|
|
[device present but not up]
|
|
Open MPI has found a usNIC device that is present / listed in Linux,
|
|
but in a "down" state. It will not be used by this MPI job.
|
|
|
|
You may wish to check this device, especially if it is unexpectedly
|
|
down.
|
|
|
|
Local server: %s
|
|
Device name: %s
|
|
#
|
|
[transport mismatch]
|
|
Open MPI has found two servers with different underlying usNIC
|
|
transports. This is an unsupported configuration; all usNIC devices
|
|
must have the same underlying transport in order to use the usNIC BTL.
|
|
|
|
Local server / transport: %s / %s
|
|
Remote server / transport: %s / %s
|