c42ab8ea37
Commit from a long-standing Mercurial tree that ended up incorporating a lot of things: * A few fixes for CPC interface changes in all the CPCs * Attempts (but not yet finished) to fix shutdown problems in the IB CM CPC * #1319: add CTS support (i.e., initiator guarantees to send first message; automatically activated for iWARP over the RDMA CM CPC) * Some variable and function renamings to make this be generic (e.g., alloc_credit_frag became alloc_control_frag) * CPCs no longer post receive buffers; they only post a single receive buffer for the CTS if they use CTS. Instead, the main BTL now posts the main sets of receive buffers. * CPCs allocate a CTS buffer only if they're about to make a connection * RDMA CM improvements: * Use threaded mode openib fd monitoring to wait for for RDMA CM events * Synchronize endpoint finalization and disconnection between main thread and service thread to avoid/fix some race conditions * Converted several structs to be OBJs so that we can use reference counting to know when to invoke destructors * Make some new OBJ's have opal_list_item_t's as their base, thereby eliminating the need for the local list_item_t type * Renamed many variables to be internally consistent * Centralize the decision in an inline function as to whether this process or the remote process is supposed to be the initiator * Add oodles of OPAL_OUTPUT statements for debugging (hard-wired to output stream -1; to be activated by developers if they want/need them) * Use rdma_create_qp() instead of ibv_create_qp() * openib fd monitoring improvements: * Renamed a bunch of functions and variables to be a little more obvious as to their true function * Use pipes to communicate between main thread and service thread * Add ability for main thread to invoke a function back on the service thread * Ensure to set initiator_depth and responder_resources properly, but putting max_qp_rd_ataom and ma_qp_init_rd_atom in the modex (see rdma_connect(3)) * Ensure to set the source IP address in rdma_resolve() to ensure that we select the correct OpenFabrics source port * Make new MCA param: openib_btl_connect_rdmacm_resolve_timeout * Other improvements: * btl_openib_device_type MCA param: can be "iw" or "ib" or "all" (or "infiniband" or "iwarp") * Somewhat improved error handling * Bunches of spelling fixes in comments, VERBOSE, and OUTPUT statements * Oodles of little coding style fixes * Changed shutdown ordering of btl; the device is now an OBJ with ref counting for destruction * Added some more show_help error messages * Change configury to only build IBCM / RDMACM if we have threads (because we need a progress thread) This commit was SVN r19686. The following Trac tickets were found above: Ticket 1210 --> https://svn.open-mpi.org/trac/ompi/ticket/1210
69 строки
1.7 KiB
Plaintext
69 строки
1.7 KiB
Plaintext
# -*- text -*-
|
|
#
|
|
# Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
|
|
# $COPYRIGHT$
|
|
#
|
|
# Additional copyrights may follow
|
|
#
|
|
# $HEADER$
|
|
#
|
|
# This is the US/English help file for Open MPI's OpenFabrics RDMA CM
|
|
# support (the openib BTL).
|
|
#
|
|
[no valid ip]
|
|
It appears that an OpenFabrics device does not have an IP address
|
|
associated with it. The OpenFabrics RDMA CM connection scheme
|
|
*requires* IP addresses to be setup in order to function properly.
|
|
|
|
Local host: %s
|
|
Local device: %s
|
|
#
|
|
[could not find matching endpoint]
|
|
The OpenFabrics device in an MPI process received an RDMA CM connect
|
|
request for a peer that it could not identify as part of this MPI job.
|
|
This should not happen. Your process is likely to abort; sorry.
|
|
|
|
Local host: %s
|
|
Local device: %s
|
|
Remote address: %s
|
|
Remote TCP port: %d
|
|
#
|
|
[illegal tcp port]
|
|
The btl_openib_connect_rdmacm_port MCA parameter was used to specify
|
|
an illegal TCP port value. TCP ports must be between 0 and 65536
|
|
(ports below 1024 can only be used by root).
|
|
|
|
TCP port: %d
|
|
|
|
This value was ignored.
|
|
#
|
|
[illegal timeout]
|
|
The btl_openib_connect_rdmacm_resolve_timeout parameter was used to
|
|
specify an illegal timeout value. Timeout values are specified in
|
|
miliseconds and must be greater than 0.
|
|
|
|
Timeout value: %d
|
|
|
|
This value was ignored.
|
|
#
|
|
[rdma cm device removal]
|
|
The RDMA CM returned that the device Open MPI was trying to use has
|
|
been removed.
|
|
|
|
Local host: %s
|
|
Local device: %s
|
|
|
|
Your MPI job will now abort, sorry.
|
|
#
|
|
[rdma cm event error]
|
|
The RDMA CM returned an event error while attempting to make a
|
|
connection. This type of error usually indicates a network
|
|
configuration error.
|
|
|
|
Local host: %s
|
|
Local device: %s
|
|
Error name: %s
|
|
Peer: %s
|
|
|
|
Your MPI job will now abort, sorry.
|