1
1

Merge pull request #6270 from jsquyres/pr/remove-openib-and-affiliated-stuff

So long, openib, and thanks for all the fish.
Этот коммит содержится в:
Jeff Squyres 2019-02-07 09:29:31 -05:00 коммит произвёл GitHub
родитель ead2efb136 99553eb1b9
Коммит f53a4f2d5b
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
99 изменённых файлов: 5 добавлений и 25722 удалений

66
README
Просмотреть файл

@ -623,7 +623,6 @@ MPI Functionality and Features
- portals4
(2) The ob1 PML and the following BTLs support MPI_THREAD_MULTIPLE:
- openib (see exception below)
- self
- sm
- smcuda
@ -632,10 +631,6 @@ MPI Functionality and Features
- usnic
- vader (shared memory)
The openib BTL's RDMACM based connection setup mechanism is also not
thread safe. The default UDCM method should be used for
applications requiring MPI_THREAD_MULTIPLE support.
Currently, MPI File operations are not thread safe even if MPI is
initialized for MPI_THREAD_MULTIPLE support.
@ -794,7 +789,8 @@ Network Support
- In prior versions of Open MPI, InfiniBand and RoCE support was
provided through the openib BTL and ob1 PML plugins. Starting with
Open MPI 4.0.0, InfiniBand support through the openib plugin is both
deprecated and superseded by the ucx PML component.
deprecated and superseded by the ucx PML component. The openib BTL
was removed in Open MPI v5.0.0.
While the openib BTL depended on libibverbs, the UCX PML depends on
the UCX library.
@ -809,15 +805,6 @@ Network Support
for OpenSHMEM support, and "--mca osc ucx" for MPI RMA (one-sided)
operations.
- Although the ob1 PML+openib BTL is still the default for iWARP and
RoCE devices, it will reject InfiniBand defaults (by default) so
that they will use the ucx PML. If using the openib BTL is still
desired, set the following MCA parameters:
# Note that "vader" is Open MPI's shared memory BTL
$ mpirun --mca pml ob1 --mca btl openib,vader,self \
--mca btl_openib_allow_ib 1 ...
- The usnic BTL is support for Cisco's usNIC device ("userspace NIC")
on Cisco UCS servers with the Virtualized Interface Card (VIC).
Although the usNIC is accessed via the OpenFabrics Libfabric API
@ -850,8 +837,8 @@ Network Support
http://lwn.net/Articles/343351/
- The use of fork() with OpenFabrics-based networks (i.e., the openib
BTL) is only partially supported, and only on Linux kernels >=
- The use of fork() with OpenFabrics-based networks (i.e., the UCX
PML) is only partially supported, and only on Linux kernels >=
v2.6.15 with libibverbs v1.1 or later (first released as part of
OFED v1.2), per restrictions imposed by the OFED network stack.
@ -1206,51 +1193,6 @@ NETWORKING SUPPORT / OPTIONS
--with-usnic
Abort configure if Cisco usNIC support cannot be built.
--with-verbs=<directory>
Specify the directory where the verbs (also known as OpenFabrics
verbs, or Linux verbs, and previously known as OpenIB) libraries and
header files are located. This option is generally only necessary
if the verbs headers and libraries are not in default
compiler/linker search paths.
The Verbs library usually implies operating system bypass networks,
such as InfiniBand, usNIC, iWARP, and RoCE (aka "IBoIP").
--with-verbs-libdir=<directory>
Look in directory for the verbs libraries. By default, Open MPI
will look in <verbs_directory>/lib and <verbs_ directory>/lib64,
which covers most cases. This option is only needed for special
configurations.
--with-verbs-usnic
Note that this option is no longer necessary in recent Linux distro
versions. If your Linux distro uses the "rdma-core" package (instead
of a standalone "libibverbs" package), not only do you not need this
option, you shouldn't use it, either. More below.
This option will activate support in Open MPI for disabling a
dire-sounding warning message from libibverbs that Cisco usNIC
devices are not supported (because Cisco usNIC devices are supported
through libfabric, not libibverbs). This libibverbs warning can
also be suppressed by installing the "no op" libusnic_verbs plugin
for libibverbs (see https://github.com/cisco/libusnic_verbs, or
download binaries from cisco.com).
This option is disabled by default for two reasons:
1. It causes libopen-pal.so to depend on libibverbs.so, which is
undesirable to many downstream packagers.
2. As mentioned above, recent versions of the libibverbs library
(included in the "rdma-core" package) do not have the bug that
will emit dire-sounding warnings about usnic devices. Indeed,
the --with-verbs-usnic option will enable code in Open MPI that
is actually incompatible with rdma-core (i.e., cause Open MPI to
fail to compile).
If you enable --with-verbs-usnic and your system uses the rdma-core
package, configure will safely abort with a helpful message telling
you that you should not use --with-verbs-usnic.
RUN-TIME SYSTEM SUPPORT

Просмотреть файл

@ -114,4 +114,3 @@ libmca_opal_common_ofi_so_version=0:0:0
libmca_opal_common_sm_so_version=0:0:0
libmca_opal_common_ucx_so_version=0:0:0
libmca_opal_common_ugni_so_version=0:0:0
libmca_opal_common_verbs_so_version=0:0:0

Просмотреть файл

@ -1,485 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2006-2017 Los Alamos National Security, LLC. All rights
# reserved.
# Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
# Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
# Copyright (c) 2014 Bull SAS. All rights reserved.
# Copyright (c) 2014-2016 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# OPAL_CHECK_OPENFABRICS(prefix, [action-if-found], [action-if-not-found])
# --------------------------------------------------------
# check if OPENIB support can be found. sets prefix_{CPPFLAGS,
# LDFLAGS, LIBS} as needed and runs action-if-found if there is
# support, otherwise executes action-if-not-found
AC_DEFUN([OPAL_CHECK_OPENFABRICS],[
OPAL_VAR_SCOPE_PUSH([$1_msg])
# Setup the --with switches to allow users to specify where
# verbs stuff lives.
AC_REQUIRE([OPAL_CHECK_VERBS_DIR])
if test -z "$opal_check_openib_happy" ; then
#
# Add padding to OpenIB header
#
AC_ARG_ENABLE([openib-control-hdr-padding],
[AC_HELP_STRING([--enable-openib-control-hdr-padding],
[Add padding bytes to the openib BTL control header (default:disabled)])])
AC_MSG_CHECKING([if want to add padding to the openib BTL control header])
if test "$enable_openib_control_hdr_padding" = "yes"; then
AC_MSG_RESULT([yes])
ompi_openib_pad_hdr=1
elif test "$enable_openib_control_hdr_padding" = "no"; then
AC_MSG_RESULT([no])
ompi_openib_pad_hdr=0
else
#
# Enable padding for SPARC platforms by default because the
# btl will segv otherwise. Keep padding disabled for other
# platforms since there are some performance implications with
# padding on for those platforms.
#
case "${host}" in
sparc*)
AC_MSG_RESULT([yes (enabled by default on SPARC)])
ompi_openib_pad_hdr=1
;;
*)
AC_MSG_RESULT([no])
ompi_openib_pad_hdr=0
;;
esac
fi
AC_DEFINE_UNQUOTED([OPAL_OPENIB_PAD_HDR], [$ompi_openib_pad_hdr],
[Add padding bytes to the openib BTL control header])
AS_IF([test "$opal_want_verbs" = "no"],
[opal_check_openib_happy="no"],
[opal_check_openib_happy="yes"])
ompi_check_openib_$1_save_CPPFLAGS="$CPPFLAGS"
ompi_check_openib_$1_save_LDFLAGS="$LDFLAGS"
ompi_check_openib_$1_save_LIBS="$LIBS"
AS_IF([test "$opal_check_openib_happy" = "yes"],
[AC_CHECK_HEADERS(
fcntl.h sys/poll.h,
[],
[AC_MSG_WARN([fcntl.h sys/poll.h not found. Can not build component.])
opal_check_openib_happy="no"])])
AS_IF([test "$opal_check_openib_happy" = "yes"],
[OPAL_CHECK_PACKAGE([opal_check_openib],
[infiniband/verbs.h],
[ibverbs],
[ibv_open_device],
[],
[$opal_verbs_dir],
[$opal_verbs_libdir],
[opal_check_openib_happy="yes"],
[opal_check_openib_happy="no"])])
CPPFLAGS="$CPPFLAGS $opal_check_openib_CPPFLAGS"
LDFLAGS="$LDFLAGS $opal_check_openib_LDFLAGS"
LIBS="$LIBS $opal_check_openib_LIBS"
AS_IF([test "$opal_check_openib_happy" = "yes"],
[AC_CACHE_CHECK(
[number of arguments to ibv_create_cq],
[ompi_cv_func_ibv_create_cq_args],
[AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[#include <infiniband/verbs.h> ]],
[[ibv_create_cq(NULL, 0, NULL, NULL, 0);]])],
[ompi_cv_func_ibv_create_cq_args=5],
[AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[#include <infiniband/verbs.h> ]],
[[ibv_create_cq(NULL, 0, NULL);]])],
[ompi_cv_func_ibv_create_cq_args=3],
[ompi_cv_func_ibv_create_cq_args="unknown"])])])
AS_IF([test "$ompi_cv_func_ibv_create_cq_args" = "unknown"],
[AC_MSG_WARN([Can not determine number of args to ibv_create_cq.])
AC_MSG_WARN([Not building component.])
opal_check_openib_happy="no"],
[AC_DEFINE_UNQUOTED([OPAL_IBV_CREATE_CQ_ARGS],
[$ompi_cv_func_ibv_create_cq_args],
[Number of arguments to ibv_create_cq])])])
#
# OpenIB dynamic SL
#
AC_ARG_ENABLE([openib-dynamic-sl],
[AC_HELP_STRING([--enable-openib-dynamic-sl],
[Enable openib BTL to query Subnet Manager for IB SL (default: enabled)])])
# Set these up so that we can do an AC_DEFINE below
# (unconditionally)
opal_check_openib_have_xrc=0
opal_check_openib_have_xrc_domains=0
opal_check_openib_have_opensm_devel=0
# If we have the openib stuff available, find out what we've got
AS_IF([test "$opal_check_openib_happy" = "yes"],
[AC_CHECK_DECLS([IBV_EVENT_CLIENT_REREGISTER, IBV_ACCESS_SO, IBV_ATOMIC_HCA], [], [],
[#include <infiniband/verbs.h>])
AC_CHECK_FUNCS([ibv_get_device_list ibv_resize_cq])
# struct ibv_device.transport_type was added in OFED v1.2
AC_CHECK_MEMBERS([struct ibv_device.transport_type], [], [],
[#include <infiniband/verbs.h>])
# We have to check functions both exits *and* are declared
# since some distro ship broken ibverbs devel headers
# IBV_DEVICE_XRC is common to all OFED versions
# ibv_create_xrc_rcv_qp was added in OFED 1.3
# ibv_cmd_open_xrcd (aka XRC Domains) was added in OFED 3.12
if test "$enable_connectx_xrc" = "yes"; then
AC_CHECK_DECLS([IBV_DEVICE_XRC],
[opal_check_openib_have_xrc=1
opal_check_openib_have_xrc_domains=1],
[],
[#include <infiniband/verbs.h>])
fi
if test "$enable_connectx_xrc" = "yes" \
&& test $opal_check_openib_have_xrc -eq 1; then
AC_CHECK_DECLS([ibv_create_xrc_rcv_qp],
[AC_CHECK_FUNCS([ibv_create_xrc_rcv_qp],
[], [opal_check_openib_have_xrc=0])],
[opal_check_openib_have_xrc=0],
[#include <infiniband/driver.h>])
fi
if test "$enable_connectx_xrc" = "yes" \
&& test $opal_check_openib_have_xrc_domains -eq 1; then
AC_CHECK_DECLS([ibv_cmd_open_xrcd],
[AC_CHECK_DECLS([IBV_SRQT_XRC],
[AC_CHECK_FUNCS([ibv_cmd_open_xrcd],
[], [opal_check_openib_have_xrc_domains=0])],
[opal_check_openib_have_xrc_domains=0],
[#include <infiniband/verbs.h>])],
[opal_check_openib_have_xrc_domains=0],
[#include <infiniband/driver.h>])
# XRC and XRC Domains should be considered as exclusive
if test "$opal_check_openib_have_xrc" -eq 1 && \
test "$opal_check_openib_have_xrc_domains" -eq 1; then
opal_check_openib_have_xrc=0
fi
fi
if test "no" != "$enable_openib_dynamic_sl"; then
# We need ib_types.h file, which is installed with opensm-devel
# package. However, ib_types.h has a bad include directive,
# which will cause AC_CHECK_HEADER to fail.
# So instead, we will look for another file that is also
# installed as part of opensm-devel package and included in
# ib_types.h, but it doesn't include any other IB-related files.
AC_CHECK_HEADER([infiniband/complib/cl_types_osd.h],
[AC_CHECK_LIB([osmcomp], [cl_map_init],
[opal_check_openib_have_opensm_devel=1],[])],
[],
[])
# Abort if dynamic SL support was explicitly requested but opensm-devel
# package wasn't found. Otherwise, OMPI will be built w/o dynamic SL.
AC_MSG_CHECKING([if can use dynamic SL support])
AS_IF([test "$opal_check_openib_have_opensm_devel" = "1"],
[AC_MSG_RESULT([yes])],
[AC_MSG_RESULT([no])
AS_IF([test "$enable_openib_dynamic_sl" = "yes"],
[AC_MSG_WARN([--enable-openib-dynamic-sl was specified but the])
AC_MSG_WARN([appropriate header/library files could not be found])
AC_MSG_WARN([Please install opensm-devel if you need dynamic SL support])
AC_MSG_ERROR([Cannot continue])])])
fi
# Check support for RDMAoE devices
$1_have_rdmaoe=0
AC_CHECK_DECLS([IBV_LINK_LAYER_ETHERNET],
[$1_have_rdmaoe=1], [],
[#include <infiniband/verbs.h>])
AC_MSG_CHECKING([if RDMAoE support is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_RDMAOE], [$$1_have_rdmaoe], [Enable RDMAoE support])
if test "1" = "$$1_have_rdmaoe"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
fi
])
# Check to see if <infiniband/driver.h> works. It is known to
# create problems on some platforms with some compilers (e.g.,
# RHEL4U3 with the PGI 32 bit compiler). Use undocumented (in AC
# 2.63) feature of AC_CHECK_HEADERS: if you explicitly pass in
# AC_INCLUDES_DEFAULT as the 4th arg to AC_CHECK_HEADERS, the test
# will fail if the header is present but not compilable, *but it
# will not print the big scary warning*. See
# http://lists.gnu.org/archive/html/autoconf/2008-10/msg00143.html.
AS_IF([test "$opal_check_openib_happy" = "yes"],
[AC_CHECK_HEADERS([infiniband/driver.h], [], [],
[AC_INCLUDES_DEFAULT])])
AC_MSG_CHECKING([if ConnectX XRC support is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_CONNECTX_XRC], [$opal_check_openib_have_xrc],
[Enable features required for ConnectX XRC support])
if test "1" = "$opal_check_openib_have_xrc"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
fi
AC_MSG_CHECKING([if ConnectIB XRC support is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_CONNECTX_XRC_DOMAINS], [$opal_check_openib_have_xrc_domains],
[Enable features required for XRC domains support])
if test "1" = "$opal_check_openib_have_xrc_domains"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
fi
AC_MSG_CHECKING([if dynamic SL is enabled])
AC_DEFINE_UNQUOTED([OPAL_ENABLE_DYNAMIC_SL], [$opal_check_openib_have_opensm_devel],
[Enable features required for dynamic SL support])
if test "1" = "$opal_check_openib_have_opensm_devel"; then
AC_MSG_RESULT([yes])
$1_LIBS="-losmcomp $$1_LIBS"
else
AC_MSG_RESULT([no])
fi
AS_IF([test -z "$opal_verbs_dir"],
[openib_include_dir="/usr/include"],
[openib_include_dir="$opal_verbs_dir/include"])
opal_check_openib_CPPFLAGS="$opal_check_openib_CPPFLAGS -I$openib_include_dir/infiniband"
CPPFLAGS="$ompi_check_openib_$1_save_CPPFLAGS"
LDFLAGS="$ompi_check_openib_$1_save_LDFLAGS"
LIBS="$ompi_check_openib_$1_save_LIBS"
OPAL_SUMMARY_ADD([[Transports]],[[OpenFabrics Verbs]],[$1],[$opal_check_openib_happy])
OPAL_VAR_SCOPE_POP
fi
$1_have_xrc=$opal_check_openib_have_xrc
$1_have_xrc_domains=$opal_check_openib_have_xrc_domains
$1_have_opensm_devel=$opal_check_openib_have_opensm_devel
AS_IF([test "$opal_check_openib_happy" = "yes"],
[$1_CPPFLAGS="[$]$1_CPPFLAGS $opal_check_openib_CPPFLAGS"
$1_LDFLAGS="[$]$1_LDFLAGS $opal_check_openib_LDFLAGS"
$1_LIBS="[$]$1_LIBS $opal_check_openib_LIBS"
$2],
[AS_IF([test "$opal_want_verbs" = "yes"],
[AC_MSG_WARN([Verbs support requested (via --with-verbs) but not found.])
AC_MSG_WARN([If you are using libibverbs v1.0 (i.e., OFED v1.0 or v1.1), you *MUST* have both the libsysfs headers and libraries installed. Later versions of libibverbs do not require libsysfs.])
AC_MSG_ERROR([Aborting.])])
$3])
])
AC_DEFUN([OPAL_CHECK_OPENFABRICS_CM_ARGS],[
#
# ConnectX XRC support - disabled see issue #3890
#
dnl AC_ARG_ENABLE([openib-connectx-xrc],
dnl [AC_HELP_STRING([--enable-openib-connectx-xrc],
dnl [Enable ConnectX XRC support in the openib BTL. (default: disabled)])],
dnl [enable_connectx_xrc="$enableval"], [enable_connectx_xrc="no"])
enable_connectx_xrc="no"
#
# Unconnect Datagram (UD) based connection manager
#
AC_ARG_ENABLE([openib-udcm],
[AC_HELP_STRING([--enable-openib-udcm],
[Enable datagram connection support in openib BTL (default: enabled)])],
[enable_openib_udcm="$enableval"], [enable_openib_udcm="yes"])
# Per discussion with Ralph and Nathan, disable UDCM for now.
# It's borked and needs some surgery to get back on its feet.
# enable_openib_udcm=no
#
# Openfabrics RDMACM
#
AC_ARG_ENABLE([openib-rdmacm],
[AC_HELP_STRING([--enable-openib-rdmacm],
[Enable Open Fabrics RDMACM support in openib BTL (default: enabled)])])
AC_ARG_ENABLE([openib-rdmacm-ibaddr],
[AC_HELP_STRING([--enable-openib-rdmacm-ibaddr],
[Enable Open Fabrics RDMACM with IB addressing support in openib BTL (default: disabled)])],
[enable_openib_rdmacm=yes])
])dnl
AC_DEFUN([OPAL_CHECK_OPENFABRICS_CM],[
AC_REQUIRE([OPAL_CHECK_OPENFABRICS_CM_ARGS])
$1_have_udcm=0
$1_have_rdmacm=0
ompi_check_openib_$1_save_CPPFLAGS="$CPPFLAGS"
ompi_check_openib_$1_save_LDFLAGS="$LDFLAGS"
ompi_check_openib_$1_save_LIBS="$LIBS"
# add back in all the InfiniBand flags so that these tests might work...
CPPFLAGS="$CPPFLAGS $$1_CPPFLAGS"
LDFLAGS="$LDFLAGS $$1_LDFLAGS"
LIBS="$LIBS $$1_LIBS"
AS_IF([test "$opal_check_openib_happy" = "yes"],
[# Do we have a recent enough RDMA CM? Need to have the
# rdma_get_peer_addr (inline) function (originally appeared
# in OFED v1.3).
if test "$enable_openib_rdmacm" != "no"; then
AC_CHECK_HEADERS([rdma/rdma_cma.h],
[AC_CHECK_LIB([rdmacm], [rdma_create_id],
[AC_MSG_CHECKING([for rdma_get_peer_addr])
$1_msg=no
AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include "rdma/rdma_cma.h"
]], [[void *ret = (void*) rdma_get_peer_addr((struct rdma_cm_id*)0);]])],
[$1_have_rdmacm=1
$1_msg=yes])
AC_MSG_RESULT([$$1_msg])])])
if test "1" = "$$1_have_rdmacm"; then
$1_LIBS="-lrdmacm $$1_LIBS"
else
AS_IF([test "$enable_openib_rdmacm" = "yes"],
[AC_MSG_WARN([--enable-openib-rdmacm was specified but the])
AC_MSG_WARN([appropriate files could not be found])
AC_MSG_WARN([Please install librdmacm and librdmacm-devel or disable rdmacm support])
AC_MSG_ERROR([Cannot continue])])
fi
fi
# is udcm enabled
if test "$enable_openib_udcm" = "yes"; then
$1_have_udcm=1
fi
])
CPPFLAGS="$ompi_check_openib_$1_save_CPPFLAGS"
LDFLAGS="$ompi_check_openib_$1_save_LDFLAGS"
LIBS="$ompi_check_openib_$1_save_LIBS"
AC_MSG_CHECKING([if UD CM is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_UDCM], [$$1_have_udcm],
[Whether UD CM is available or not])
if test "1" = "$$1_have_udcm"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
fi
AC_MSG_CHECKING([if OpenFabrics RDMACM support is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_RDMACM], [$$1_have_rdmacm],
[Whether RDMA CM is available or not])
if test "1" = "$$1_have_rdmacm"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
fi
])dnl
AC_DEFUN([OPAL_CHECK_EXP_VERBS],[
OPAL_VAR_SCOPE_PUSH([have_struct_ibv_exp_send_wr])
AC_MSG_CHECKING([whether expanded verbs are available])
AC_TRY_COMPILE([#include <infiniband/verbs_exp.h>], [struct ibv_exp_send_wr;],
[have_struct_ibv_exp_send_wr=1
AC_MSG_RESULT([yes])],
[have_struct_ibv_exp_send_wr=0
AC_MSG_RESULT([no])])
AC_DEFINE_UNQUOTED([HAVE_EXP_VERBS], [$have_struct_ibv_exp_send_wr], [Experimental verbs])
AC_CHECK_DECLS([IBV_EXP_ATOMIC_HCA_REPLY_BE, IBV_EXP_QP_CREATE_ATOMIC_BE_REPLY, ibv_exp_create_qp, ibv_exp_query_device, IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG],
[], [], [#include <infiniband/verbs_exp.h>])
AC_CHECK_MEMBERS([struct ibv_exp_device_attr.ext_atom, struct ibv_exp_device_attr.exp_atomic_cap], [], [],
[[#include <infiniband/verbs_exp.h>]])
AS_IF([test '$have_struct_ibv_exp_send_wr' = 1], [$1], [$2])
OPAL_VAR_SCOPE_POP
])dnl
AC_DEFUN([OPAL_CHECK_MLNX_OPENFABRICS],[
$1_have_mverbs=0
$1_have_mqe=0
AS_IF([test "$opal_check_openib_happy" = "yes"],
[OPAL_CHECK_PACKAGE([$1],
[infiniband/mverbs.h],
[mverbs],
[ibv_m_query_device],
["$$1_LIBS"],
[$opal_verbs_dir],
[$opal_verbs_libdir],
[$1_have_mverbs=1],
[])])
AS_IF([test "$opal_check_openib_happy" = "yes"],
[OPAL_CHECK_PACKAGE([$1],
[infiniband/mqe.h],
[mqe],
[mqe_context_create],
["$$1_LIBS"],
[$opal_verbs_dir],
[$opal_verbs_libdir],
[$1_have_mqe=1],
[])])
AC_MSG_CHECKING([if Mellanox OpenFabrics VERBS is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_MVERBS], [$$1_have_mverbs],
[Whether MVERBS is available or not])
AS_IF([test "1" = "$$1_have_mverbs"],
[AC_MSG_RESULT([yes])],
[AC_MSG_RESULT([no])])
# save the CPPFLAGS since we would have to update it for next test
ompi_check_mellanox_openfabrics_$1_save_CPPFLAGS="$CPPFLAGS"
# If openfabrics custom directory have been defined, we have
# to use it for MACRO test that uses mverbs.h file.
#
if test ! -z "$ompi_check_verbs_dir" ; then
CPPFLAGS="-I${opal_verbs_dir}/include $CPPFLAGS"
fi
AS_IF([test "1" = "$$1_have_mverbs"],
[AC_CHECK_DECLS([IBV_M_WR_CALC_RDMA_WRITE_WITH_IMM],
[AC_DEFINE_UNQUOTED([OPAL_HAVE_IBOFFLOAD_CALC_RDMA], [1],
[Whether IBV_M_WR_CALC_SEND is defined or not])],
[AC_DEFINE_UNQUOTED([OPAL_HAVE_IBOFFLOAD_CALC_RDMA], [0],
[Whether IBV_M_WR_CALC_SEND is defined or not])],
[#include <infiniband/mverbs.h>])])
# restoring the CPPFLAGS
CPPFLAGS="$ompi_check_mellanox_openfabrics_$1_save_CPPFLAGS"
AC_MSG_CHECKING([if Mellanox OpenFabrics MQE is enabled])
AC_DEFINE_UNQUOTED([OPAL_HAVE_MQE], [$$1_have_mqe],
[Whether MQE is available or not])
AS_IF([test "1" = "$$1_have_mqe"],
[AC_MSG_RESULT([yes])],
[AC_MSG_RESULT([no])])
AS_IF([test "1" = "$$1_have_mverbs" && test "1" = $$1_have_mqe],
[$2], [$3])
])dnl

Просмотреть файл

@ -1,120 +0,0 @@
dnl -*- shell-script -*-
dnl
dnl Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
dnl University Research and Technology
dnl Corporation. All rights reserved.
dnl Copyright (c) 2004-2005 The University of Tennessee and The University
dnl of Tennessee Research Foundation. All rights
dnl reserved.
dnl Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
dnl University of Stuttgart. All rights reserved.
dnl Copyright (c) 2004-2005 The Regents of the University of California.
dnl All rights reserved.
dnl Copyright (c) 2006-2014 Cisco Systems, Inc. All rights reserved.
dnl Copyright (c) 2006-2011 Los Alamos National Security, LLC. All rights
dnl reserved.
dnl Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
dnl Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
dnl Copyright (c) 2015 Research Organization for Information Science
dnl and Technology (RIST). All rights reserved.
dnl $COPYRIGHT$
dnl
dnl Additional copyrights may follow
dnl
dnl $HEADER$
dnl
# Internal helper macro to look for the verbs libdir
# --------------------------------------------------------
AC_DEFUN([_OPAL_CHECK_VERBS_LIBDIR],[
AS_IF([test -d "$1"],
[AS_IF([test "x`ls $1/libibverbs.* 2> /dev/null`" != "x"],
[opal_verbs_libdir="$1"])
])
])
# Internal helper macro to look for the verbs dir
# --------------------------------------------------------
AC_DEFUN([_OPAL_CHECK_VERBS_DIR],[
AS_IF([test -d "$1"],
[AS_IF([test -f "$1/include/infiniband/verbs.h"],
[opal_verbs_dir="$1"])
])
])
# OPAL_CHECK_VERBS_DIR
# --------------------------------------------------------
# Add --with-verbs options, and if directories are specified,
# sanity check them.
#
# At the end of this macro:
#
# 1. $opal_want_verbs will be set to:
# "yes" if --with-verbs or --with-verbs=DIR was specified
# "no" if --without-verbs was specified)
# "optional" if neither --with-verbs* nor --without-verbs was specified
#
# 2. $opal_verbs_dir and $opal_verbs_libdir with either both be set or
# both be empty.
#
AC_DEFUN([OPAL_CHECK_VERBS_DIR],[
# Add --with options
AC_ARG_WITH([verbs],
[AC_HELP_STRING([--with-verbs(=DIR)],
[Build verbs support, optionally adding DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries])])
AC_ARG_WITH([verbs-libdir],
[AC_HELP_STRING([--with-verbs-libdir=DIR],
[Search for verbs libraries in DIR])])
# Sanity check the --with values
OPAL_CHECK_WITHDIR([verbs], [$with_verbs],
[include/infiniband/verbs.h])
OPAL_CHECK_WITHDIR([verbs-libdir], [$with_verbs_libdir],
[libibverbs.*])
# Set standardized shell variables for OFED lovin' components to
# use. Either both of $opal_verbs_dir and
# $verbs_libdir will be set, or neither will be set.
opal_want_verbs=no
AS_IF([test -z "$with_verbs"],
[opal_want_verbs=optional],
[AS_IF([test "$with_verbs" = "no"],
[opal_want_verbs=no],
[opal_want_verbs=yes])
])
opal_verbs_dir=
AS_IF([test -n "$with_verbs" && test "$with_verbs" != "yes" && test "$with_verbs" != "no"],
[opal_verbs_dir=$with_verbs])
opal_verbs_libdir=
AS_IF([test -n "$with_verbs_libdir" && test "$with_verbs_libdir" != "yes" && test "$with_verbs_libdir" != "no"],
[opal_verbs_libdir=$with_verbs_libdir])
# If the top dir was specified but the libdir was not, look for
# it. Note that if the user needs a specific libdir (i.e., if our
# hueristic ordering below is not sufficient), they need to
# specify it.
AS_IF([test -z "$opal_verbs_libdir" && test -n "$opal_verbs_dir"],
[_OPAL_CHECK_VERBS_LIBDIR(["$opal_verbs_dir/lib"])])
AS_IF([test -z "$opal_verbs_libdir" && test -n "$opal_verbs_dir"],
[_OPAL_CHECK_VERBS_LIBDIR(["$opal_verbs_dir/lib64"])])
AS_IF([test -z "$opal_verbs_libdir" && test -n "$opal_verbs_dir"],
[_OPAL_CHECK_VERBS_LIBDIR(["$opal_verbs_dir/lib32"])])
AS_IF([test -z "$opal_verbs_libdir" && test -n "$opal_verbs_dir"],
[AC_MSG_WARN([Could not find libibverbs in the usual locations under $opal_verbs_dir])
AC_MSG_ERROR([Cannot continue])
])
# If the libdir was specified, but the top dir was not, look for
# it. Note that if the user needs a specific top dir (i.e., if
# our hueristic below is not sufficient), they need to specify it.
AS_IF([test -z "$opal_verbs" && test -n "$opal_verbs_libdir"],
[_OPAL_CHECK_VERBS_DIR([`dirname "$opal_verbs_libdir"`])])
AS_IF([test -z "$opal_verbs_dir" && test -n "$opal_verbs_libdir"],
[AC_MSG_WARN([Could not find verbs.h in the usual locations under $opal_verbs_dir])
AC_MSG_ERROR([Cannot continue])
])
])

Просмотреть файл

@ -158,7 +158,6 @@ AC_SUBST(libmca_opal_common_ofi_so_version)
AC_SUBST(libmca_opal_common_cuda_so_version)
AC_SUBST(libmca_opal_common_sm_so_version)
AC_SUBST(libmca_opal_common_ugni_so_version)
AC_SUBST(libmca_opal_common_verbs_so_version)
AC_SUBST(libmca_orte_common_alps_so_version)
AC_SUBST(libmca_ompi_common_ompio_so_version)
AC_SUBST(libmca_ompi_common_monitoring_so_version)

Просмотреть файл

@ -4,7 +4,6 @@ enable_debug=yes
enable_mem_profile=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
CXXFLAGS="-m64 -mcpu=power6 -mtune=power6 -O0 -g3 -ggdb"
CCASFLAGS="-m64 -mcpu=power6 -mtune=power6 -O0 -g3 -ggdb"
FCFLAGS="-m64 -mcpu=power6 -mtune=power6 -O0 -g3 -ggdb"

Просмотреть файл

@ -4,7 +4,6 @@ enable_debug=yes
enable_mem_profile=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
CXXFLAGS="-m64 -mcpu=power7 -mtune=power7 -O0 -g3 -ggdb"
CCASFLAGS="-m64 -mcpu=power7 -mtune=power7 -O0 -g3 -ggdb"
FCFLAGS="-m64 -mcpu=power7 -mtune=power7 -O0 -g3 -ggdb"

Просмотреть файл

@ -4,7 +4,6 @@ enable_debug=yes
enable_mem_profile=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
CXXFLAGS="-m32 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"
CCASFLAGS="-m32 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"
FCFLAGS="-m32 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"

Просмотреть файл

@ -4,7 +4,6 @@ enable_debug=yes
enable_mem_profile=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
CXXFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"
CCASFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"
FCFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O0 -g3 -ggdb"

Просмотреть файл

@ -3,7 +3,6 @@ enable_mem_profile=no
enable_debug=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
enable_shared=yes
enable_static=no
CXXFLAGS="-m64 -mcpu=power6 -mtune=power6 -O3"

Просмотреть файл

@ -3,7 +3,6 @@ enable_mem_profile=no
enable_debug=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
enable_shared=yes
enable_static=no
CXXFLAGS="-m64 -mcpu=power7 -mtune=power7 -O3"

Просмотреть файл

@ -3,7 +3,6 @@ enable_mem_profile=no
enable_debug=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
enable_shared=yes
enable_static=no
CXXFLAGS="-m32 -mcpu=powerpc64 -mtune=powerpc64 -O3"

Просмотреть файл

@ -3,7 +3,6 @@ enable_mem_profile=no
enable_debug=no
enable_contrib_no_build=libnbc
enable_ft_thread=no
with_verbs=/usr
enable_shared=yes
enable_static=no
CXXFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O3"

Просмотреть файл

@ -24,7 +24,6 @@ enable_mca_no_build=btl-tcp,btl-sm,rcache-udreg
enable_mca_direct=pml-ob1
with_memory_manager=no
with_tm=no
with_verbs=no
with_devel_headers=yes
with_portals=no
with_valgrind=no

Просмотреть файл

@ -21,7 +21,6 @@ enable_io_romio=no
enable_contrib_no_build=libnbc
with_memory_manager=no
with_tm=no
with_verbs=no
with_devel_headers=yes
with_portals=no
with_valgrind=no

Просмотреть файл

@ -1,6 +1,3 @@
# do not use IB verbs
with_verbs=no
enable_dlopen=no
enable_mem_profile=no
enable_binaries=yes

Просмотреть файл

@ -1,6 +1,3 @@
# do not use IB verbs
with_verbs=no
enable_dlopen=no
enable_mem_profile=no

Просмотреть файл

@ -1,9 +1,6 @@
# (c) 2012-2014 Los Alamos National Security, LLC. All rights reserved.
# Common Cray XE/XK-6 options (used by all builds)
# do not use IB verbs
with_verbs=no
# enable XPMEM enhanced shared memory (needs for Vader BTL)
with_xpmem=/opt/cray/xpmem/0.1-2.0400.30792.5.6.gem

Просмотреть файл

@ -32,9 +32,6 @@ with_tm=no
# Enable PMI support for direct launch
with_pmi=yes
# Always build ibverbs support
with_verbs=yes
# Install the development headers
with_devel_headers=yes

Просмотреть файл

@ -29,7 +29,6 @@ enable_pci=no
enable_libpci=no
with_pmi=no
with_slurm=no
with_verbs=no
MIC_LDFLAGS="-L$XCOMPOSER/compiler/lib/mic -Wl,-rpath=$XCOMPOSER/compiler/lib/mic"

Просмотреть файл

@ -17,7 +17,6 @@ created.
- with_slurm
- with_tm
- with_pmi
- with_verbs
- NOTE: common had "with_devel_headers=yes" in it that was not propagated.
This option should not be used in production as per Open MPI developer
mailing list guidance.
@ -58,9 +57,6 @@ created.
- change: comment "Disable components not needed on TOSS platforms with
high-speed networks" to "Disable components not needed on TOSS Ethernet-
connected clusters"
- change: with_verbs=no
- change: comment "Always build ibverbs support" to "Do not build ibverbs
support"
- toss3-wc-optimized.conf
- copy of toss3-hfi-optimized.conf with the following changes:
- change: comment "Add the interface for out-of-band communication and set
@ -83,7 +79,6 @@ created.
- enable_mca_no_build=crs,filem,routed-linear,snapc,pml-dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,ess-cnos,grpcomm-cnos,plm-rsh,btl-tcp,oob-ud,ras-simulator,mpool-fake
- enable_mca_static=btl:ugni,btl:self,btl:vader,pml:ob1
- enable_mca_directpml-ob1
- with_verbs=no
- with_tm=no
- enable_orte_static_ports=no
- enable_pty_support=no

Просмотреть файл

@ -20,9 +20,6 @@ enable_mca_static=btl:ugni,btl:self,btl:vader,pml:ob1
# enable direct calling for ob1
enable_mca_direct=pml-ob1
# do not use IB verbs
with_verbs=no
# do not use torque
with_tm=no

Просмотреть файл

@ -16,6 +16,3 @@ with_pmi=yes
# Enable lustre support in romio
with_io_romio_flags=--with-file-system=ufs+nfs+lustre
# Always build ibverbs support
with_verbs=yes

Просмотреть файл

@ -16,6 +16,3 @@ with_pmi=yes
# Enable lustre support in romio
with_io_romio_flags=--with-file-system=ufs+nfs+lustre
# Always build ibverbs support
with_verbs=yes

Просмотреть файл

@ -16,6 +16,3 @@ with_pmi=yes
# Enable lustre support in romio
with_io_romio_flags=--with-file-system=ufs+nfs+lustre
# Always build ibverbs support
with_verbs=yes

Просмотреть файл

@ -16,6 +16,3 @@ with_pmi=yes
# Enable lustre support in romio
with_io_romio_flags=--with-file-system=ufs+nfs+lustre
# Always build ibverbs support
with_verbs=yes

Просмотреть файл

@ -16,6 +16,3 @@ with_pmi=yes
# Enable lustre support in romio
with_io_romio_flags=--with-file-system=ufs+nfs+lustre
# Do not build ibverbs support
with_verbs=no

Просмотреть файл

@ -1,7 +1,6 @@
enable_mca_no_build=coll-ml,btl-uct
enable_debug_symbols=yes
enable_orterun_prefix_by_default=yes
with_verbs=no
with_devel_headers=yes
enable_oshmem=yes
enable_oshmem_fortran=yes

Просмотреть файл

@ -53,10 +53,7 @@ ac_cv_func_usleep=${ac_cv_func_usleep=no}
ac_cv_func_vm_read_overwrite=${ac_cv_func_vm_read_overwrite=no}
ac_cv_func_waitpid=${ac_cv_func_waitpid=no}
if test "with_verbs" != "no" ; then
enable_mca_direct=pml-ob1
enable_mca_no_built="$enable_mca_no_build,btl-sm"
elif test "with_portals4" != "no" ; then
if test "with_portals4" != "no" ; then
enable_mca_no_build="$enable_mca_no_build,pml-ob1,btl,bml,mpool,rcache"
enable_mca_direct=pml-cm,mtl-portals4
fi

Просмотреть файл

@ -4,6 +4,5 @@ enable_contrib_no_build=libnbc
enable_heterogeneous=no
enable_mem_debug=no
enable_mem_profile=no
with_verbs=no
with_gm=no
with_mx=no

Просмотреть файл

@ -24,7 +24,6 @@ with_alps=yes
with_tm=no
with_slurm=no
with_xpmem=yes
with_verbs=no
enable_mca_no_build=crs,filem,routed-linear,snapc,pml-dr,pml-crcp2,pml-crcpw,pml-example,crcp,pml-cm,ess-cnos,grpcomm-cnos,plm-rsh,btl-tcp,oob-ud,ras-simulator,mpool-fake,maffinity-first_use,maffinity-libnuma,paffinity-linux
enable_mca_static=btl:ugni,btl:self,btl:vader,pml:ob1,coll:ml
#enable_mca_direct=pml-ob1

Просмотреть файл

@ -1,131 +0,0 @@
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2014 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2011 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2011 Mellanox Technologies. All rights reserved.
# Copyright (c) 2012 Oak Ridge National Laboratory. All rights reserved
# Copyright (c) 2013 Intel, Inc. All rights reserved.
# Copyright (c) 2016 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
# Copyright (c) 2017 IBM Corporation. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
AM_CPPFLAGS = $(btl_openib_CPPFLAGS)
AM_LFLAGS = -Pbtl_openib_ini_yy
LEX_OUTPUT_ROOT = lex.btl_openib_ini_yy
amca_paramdir = $(AMCA_PARAM_SETS_DIR)
dist_amca_param_DATA = btl-openib-benchmark
dist_opaldata_DATA = \
help-mpi-btl-openib.txt \
connect/help-mpi-btl-openib-cpc-base.txt \
mca-btl-openib-device-params.ini
sources = \
btl_openib.c \
btl_openib.h \
btl_openib_component.c \
btl_openib_endpoint.c \
btl_openib_endpoint.h \
btl_openib_frag.c \
btl_openib_frag.h \
btl_openib_proc.c \
btl_openib_proc.h \
btl_openib_eager_rdma.h \
btl_openib_lex.h \
btl_openib_lex.l \
btl_openib_mca.c \
btl_openib_mca.h \
btl_openib_ini.c \
btl_openib_ini.h \
btl_openib_async.c \
btl_openib_async.h \
btl_openib_xrc.c \
btl_openib_xrc.h \
btl_openib_ip.h \
btl_openib_ip.c \
btl_openib_put.c \
btl_openib_get.c \
btl_openib_atomic.c \
connect/base.h \
connect/btl_openib_connect_base.c \
connect/btl_openib_connect_empty.c \
connect/btl_openib_connect_empty.h \
connect/connect.h
# If we have rdmacm support, build that CPC
if MCA_btl_openib_have_rdmacm
sources += \
connect/btl_openib_connect_rdmacm.c \
connect/btl_openib_connect_rdmacm.h
dist_opaldata_DATA += connect/help-mpi-btl-openib-cpc-rdmacm.txt
endif
# If we have udcm support, build that CPC
if MCA_btl_openib_have_udcm
sources += \
connect/btl_openib_connect_udcm.c \
connect/btl_openib_connect_udcm.h
# dist_opaldata_DATA += connect/help-mpi-btl-openib-cpc-ud.txt
endif
# If we have dynamic SL support, build those files
if MCA_btl_openib_have_dynamic_sl
sources += \
connect/btl_openib_connect_sl.c \
connect/btl_openib_connect_sl.h
endif
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if MCA_BUILD_opal_btl_openib_DSO
lib =
lib_sources =
component = mca_btl_openib.la
component_sources = $(sources)
else
lib = libmca_btl_openib.la
lib_sources = $(sources)
component =
component_sources =
endif
mcacomponentdir = $(opallibdir)
mcacomponent_LTLIBRARIES = $(component)
mca_btl_openib_la_SOURCES = $(component_sources)
mca_btl_openib_la_LDFLAGS = -module -avoid-version $(btl_openib_LDFLAGS)
mca_btl_openib_la_LIBADD = $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la \
$(btl_openib_LIBS) \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/verbs/lib@OPAL_LIB_PREFIX@mca_common_verbs.la
if OPAL_cuda_support
mca_btl_openib_la_LIBADD += \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/cuda/lib@OPAL_LIB_PREFIX@mca_common_cuda.la
endif
noinst_LTLIBRARIES = $(lib)
libmca_btl_openib_la_SOURCES = $(lib_sources)
libmca_btl_openib_la_LDFLAGS= -module -avoid-version $(btl_openib_LDFLAGS)
libmca_btl_openib_la_LIBADD = $(btl_openib_LIBS)
maintainer-clean-local:
rm -f btl_openib_lex.c

Просмотреть файл

@ -1,19 +0,0 @@
#
# These values are suitable for benchmarking with the openib and sm
# btls with a small number of MPI processes. If you're only going to
# use one process per node, remove "sm". These values are *NOT*
# scalable to large numbers of processes!
#
btl=openib,self,sm
btl_openib_max_btls=20
btl_openib_rd_num=128
btl_openib_rd_low=75
btl_openib_rd_win=50
btl_openib_max_eager_rdma=32
mpool_base_use_mem_hooks=1
mpi_leave_pinned=1
#
# Note that we are not limiting the max free list size, so for netpipe
# (for example), this is no problem. But we may want to explore the
# parameter space for other popular benchmarks.
#

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,933 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2009 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2007 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2011 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013-2014 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_IB_H
#define MCA_BTL_IB_H
#include "opal_config.h"
#include <sys/types.h>
#include <string.h>
#include <infiniband/verbs.h>
/* Open MPI includes */
#include "opal/class/opal_pointer_array.h"
#include "opal/class/opal_hash_table.h"
#include "opal/util/arch.h"
#include "opal/util/output.h"
#include "opal/mca/event/event.h"
#include "opal/threads/threads.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/rcache/rcache.h"
#include "opal/mca/mpool/mpool.h"
#include "opal/mca/btl/base/btl_base_error.h"
#include "opal/mca/btl/base/base.h"
#include "opal/runtime/opal_progress_threads.h"
#include "connect/connect.h"
BEGIN_C_DECLS
#define HAVE_XRC (OPAL_HAVE_CONNECTX_XRC || OPAL_HAVE_CONNECTX_XRC_DOMAINS)
#define ENABLE_DYNAMIC_SL OPAL_ENABLE_DYNAMIC_SL
#define MCA_BTL_IB_LEAVE_PINNED 1
#define IB_DEFAULT_GID_PREFIX 0xfe80000000000000ll
#define MCA_BTL_IB_PKEY_MASK 0x7fff
#define MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT (256)
/*--------------------------------------------------------------------*/
#if OPAL_ENABLE_DEBUG
#define ATTACH() do { \
int i = 0; \
opal_output(0, "WAITING TO DEBUG ATTACH"); \
while (i == 0) sleep(5); \
} while(0);
#else
#define ATTACH()
#endif
/*--------------------------------------------------------------------*/
/**
* Infiniband (IB) BTL component.
*/
enum {
BTL_OPENIB_HP_CQ,
BTL_OPENIB_LP_CQ,
BTL_OPENIB_MAX_CQ,
};
typedef enum {
MCA_BTL_OPENIB_TRANSPORT_IB,
MCA_BTL_OPENIB_TRANSPORT_IWARP,
MCA_BTL_OPENIB_TRANSPORT_RDMAOE,
MCA_BTL_OPENIB_TRANSPORT_UNKNOWN,
MCA_BTL_OPENIB_TRANSPORT_SIZE
} mca_btl_openib_transport_type_t;
typedef enum {
MCA_BTL_OPENIB_PP_QP,
MCA_BTL_OPENIB_SRQ_QP,
MCA_BTL_OPENIB_XRC_QP
} mca_btl_openib_qp_type_t;
struct mca_btl_openib_pp_qp_info_t {
int32_t rd_win;
int32_t rd_rsv;
}; typedef struct mca_btl_openib_pp_qp_info_t mca_btl_openib_pp_qp_info_t;
struct mca_btl_openib_srq_qp_info_t {
int32_t sd_max;
/* The init value for rd_curr_num variables of all SRQs */
int32_t rd_init;
/* The watermark, threshold - if the number of WQEs in SRQ is less then this value =>
the SRQ limit event (IBV_EVENT_SRQ_LIMIT_REACHED) will be generated on corresponding SRQ.
As result the maximal number of pre-posted WQEs on the SRQ will be increased */
int32_t srq_limit;
}; typedef struct mca_btl_openib_srq_qp_info_t mca_btl_openib_srq_qp_info_t;
struct mca_btl_openib_qp_info_t {
mca_btl_openib_qp_type_t type;
size_t size;
int32_t rd_num;
int32_t rd_low;
union {
mca_btl_openib_pp_qp_info_t pp_qp;
mca_btl_openib_srq_qp_info_t srq_qp;
} u;
}; typedef struct mca_btl_openib_qp_info_t mca_btl_openib_qp_info_t;
#define BTL_OPENIB_QP_TYPE(Q) (mca_btl_openib_component.qp_infos[(Q)].type)
#define BTL_OPENIB_QP_TYPE_PP(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_PP_QP)
#define BTL_OPENIB_QP_TYPE_SRQ(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_SRQ_QP)
#define BTL_OPENIB_QP_TYPE_XRC(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_XRC_QP)
typedef enum {
BTL_OPENIB_RQ_SOURCE_DEVICE_INI = MCA_BASE_VAR_SOURCE_MAX,
} btl_openib_receive_queues_source_t;
typedef enum {
BTL_OPENIB_DT_IB,
BTL_OPENIB_DT_IWARP,
BTL_OPENIB_DT_ALL
} btl_openib_device_type_t;
/* The structer for manage all BTL SRQs */
typedef struct mca_btl_openib_srq_manager_t {
opal_mutex_t lock;
/* The keys of this hash table are addresses of
SRQs structures, and the elements are BTL modules
pointers that associated with these SRQs */
opal_hash_table_t srq_addr_table;
} mca_btl_openib_srq_manager_t;
struct mca_btl_openib_component_t {
mca_btl_base_component_3_0_0_t super; /**< base BTL component */
int ib_max_btls;
/**< maximum number of devices available to openib component */
int ib_num_btls;
/**< number of devices available to the openib component */
int ib_allowed_btls;
/**< number of devices allowed to the openib component */
struct mca_btl_openib_module_t **openib_btls;
/**< array of available BTLs */
opal_pointer_array_t devices; /**< array of available devices */
int devices_count;
int ib_free_list_num;
/**< initial size of free lists */
int ib_free_list_max;
/**< maximum size of free lists */
int ib_free_list_inc;
/**< number of elements to alloc when growing free lists */
opal_list_t ib_procs;
/**< list of ib proc structures */
opal_event_t ib_send_event;
/**< event structure for sends */
opal_event_t ib_recv_event;
/**< event structure for recvs */
opal_mutex_t ib_lock;
/**< lock for accessing module state */
char* ib_mpool_hints;
/**< hints for selecting an mpool component */
char *ib_rcache_name;
/**< name of ib registration cache */
uint8_t num_pp_qps; /**< number of pp qp's */
uint8_t num_srq_qps; /**< number of srq qp's */
uint8_t num_xrc_qps; /**< number of xrc qp's */
uint8_t num_qps; /**< total number of qp's */
opal_hash_table_t ib_addr_table; /**< used only for xrc.hash-table that
keeps table of all lids/subnets */
mca_btl_openib_qp_info_t* qp_infos;
size_t eager_limit; /**< Eager send limit of first fragment, in Bytes */
size_t max_send_size; /**< Maximum send size, in Bytes */
uint32_t max_hw_msg_size;/**< Maximum message size for RDMA protocols in Bytes */
uint32_t reg_mru_len; /**< Length of the registration cache most recently used list */
uint32_t use_srq; /**< Use the Shared Receive Queue (SRQ mode) */
uint32_t ib_cq_size[BTL_OPENIB_MAX_CQ]; /**< Max outstanding CQE on the CQ */
int ib_max_inline_data; /**< Max size of inline data */
unsigned int ib_pkey_val;
unsigned int ib_psn;
unsigned int ib_qp_ous_rd_atom;
uint32_t ib_mtu;
unsigned int ib_min_rnr_timer;
unsigned int ib_timeout;
unsigned int ib_retry_count;
unsigned int ib_rnr_retry;
unsigned int ib_max_rdma_dst_ops;
unsigned int ib_service_level;
#if (ENABLE_DYNAMIC_SL)
unsigned int ib_path_record_service_level;
#endif
int use_eager_rdma;
int eager_rdma_threshold; /**< After this number of msg, use RDMA for short messages, always */
int eager_rdma_num;
int32_t max_eager_rdma;
unsigned int btls_per_lid;
unsigned int max_lmc;
int apm_lmc;
int apm_ports;
unsigned int buffer_alignment; /**< Preferred communication buffer alignment in Bytes (must be power of two) */
opal_atomic_int32_t error_counter; /**< Counts number on error events that we got on all devices */
opal_event_base_t *async_evbase; /**< Async event base */
bool use_async_event_thread; /**< Use the async event handler */
mca_btl_openib_srq_manager_t srq_manager; /**< Hash table for all BTL SRQs */
/* declare as an int instead of btl_openib_device_type_t since there is no
guarantee about the size of an enum. this value will be registered as an
integer with the MCA variable system */
int device_type;
bool allow_ib;
char *if_include;
char **if_include_list;
char *if_exclude;
char **if_exclude_list;
char *ipaddr_include;
char *ipaddr_exclude;
/* MCA param btl_openib_receive_queues */
char *receive_queues;
/* Whether we got a non-default value of btl_openib_receive_queues */
mca_base_var_source_t receive_queues_source;
/** Colon-delimited list of filenames for device parameters */
char *device_params_file_names;
/** Whether we're in verbose mode or not */
bool verbose;
/** Whether we want a warning if no device-specific parameters are
found in INI files */
bool warn_no_device_params_found;
/** Whether we want a warning if non default GID prefix is not configured
on multiport setup */
bool warn_default_gid_prefix;
/** Whether we want a warning if the user specifies a non-existent
device and/or port via btl_openib_if_[in|ex]clude MCA params */
bool warn_nonexistent_if;
/** Whether we want to abort if there's not enough registered
memory available */
bool abort_not_enough_reg_mem;
/** Dummy argv-style list; a copy of names from the
if_[in|ex]clude list that we use for error checking (to ensure
that they all exist) */
char **if_list;
bool use_message_coalescing;
unsigned int cq_poll_ratio;
unsigned int cq_poll_progress;
unsigned int cq_poll_batch;
unsigned int eager_rdma_poll_ratio;
int rdma_qp;
int credits_qp; /* qp used for software flow control */
bool cpc_explicitly_defined;
/**< free list of frags only; used for pining user memory */
opal_free_list_t send_user_free;
/**< free list of frags only; used for pining user memory */
opal_free_list_t recv_user_free;
/**< frags for coalesced massages */
opal_free_list_t send_free_coalesced;
/** Default receive queues */
char* default_recv_qps;
/** GID index to use */
int gid_index;
/* Whether we want to allow connecting processes from different subnets.
* set to 'no' by default */
bool allow_different_subnets;
/** Whether we want a dynamically resizing srq, enabled by default */
bool enable_srq_resize;
bool allow_max_memory_registration;
int memory_registration_verbose_level;
int memory_registration_verbose;
int ignore_locality;
#if OPAL_CUDA_SUPPORT
bool cuda_async_send;
bool cuda_async_recv;
bool cuda_have_gdr;
bool driver_have_gdr;
bool cuda_want_gdr;
#endif /* OPAL_CUDA_SUPPORT */
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
bool rroce_enable;
#endif
unsigned int num_default_gid_btls; /* numbers of btl in the default subnet */
}; typedef struct mca_btl_openib_component_t mca_btl_openib_component_t;
OPAL_MODULE_DECLSPEC extern mca_btl_openib_component_t mca_btl_openib_component;
typedef mca_btl_base_recv_reg_t mca_btl_openib_recv_reg_t;
/**
* Common information for all ports that is sent in the modex message
*/
typedef struct mca_btl_openib_modex_message_t {
/** The subnet ID of this port */
uint64_t subnet_id;
/** LID of this port */
uint16_t lid;
/** APM LID for this port */
uint16_t apm_lid;
/** The MTU used by this port */
uint8_t mtu;
/** vendor id define device type and tuning */
uint32_t vendor_id;
/** vendor part id define device type and tuning */
uint32_t vendor_part_id;
/** Transport type of remote port */
uint8_t transport_type;
/** Dummy field used to calculate the real length */
uint8_t end;
} mca_btl_openib_modex_message_t;
#define MCA_BTL_OPENIB_MODEX_MSG_NTOH(hdr) \
do { \
(hdr).subnet_id = ntoh64((hdr).subnet_id); \
(hdr).lid = ntohs((hdr).lid); \
} while (0)
#define MCA_BTL_OPENIB_MODEX_MSG_HTON(hdr) \
do { \
(hdr).subnet_id = hton64((hdr).subnet_id); \
(hdr).lid = htons((hdr).lid); \
} while (0)
typedef struct mca_btl_openib_device_qp_t {
opal_free_list_t send_free; /**< free lists of send buffer descriptors */
opal_free_list_t recv_free; /**< free lists of receive buffer descriptors */
} mca_btl_openib_device_qp_t;
struct mca_btl_base_endpoint_t;
typedef struct mca_btl_openib_device_t {
opal_object_t super;
struct ibv_device *ib_dev; /* the ib device */
#if OPAL_ENABLE_PROGRESS_THREADS == 1
struct ibv_comp_channel *ib_channel; /* Channel event for the device */
opal_thread_t thread; /* Progress thread */
volatile bool progress; /* Progress status */
#endif
opal_mutex_t device_lock; /* device level lock */
struct ibv_context *ib_dev_context;
#if HAVE_DECL_IBV_EXP_QUERY_DEVICE
struct ibv_exp_device_attr ib_exp_dev_attr;
#endif
struct ibv_device_attr ib_dev_attr;
struct ibv_pd *ib_pd;
struct ibv_cq *ib_cq[BTL_OPENIB_MAX_CQ];
uint32_t cq_size[BTL_OPENIB_MAX_CQ];
mca_mpool_base_module_t *mpool;
mca_rcache_base_module_t *rcache;
/* MTU for this device */
uint32_t mtu;
/* Whether this device supports eager RDMA */
uint8_t use_eager_rdma;
uint8_t btls; /** < number of btls using this device */
opal_pointer_array_t *endpoints;
opal_pointer_array_t *device_btls;
uint16_t hp_cq_polls;
uint16_t eager_rdma_polls;
bool pollme;
volatile bool got_fatal_event;
volatile bool got_port_event;
#if HAVE_XRC
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_xrcd *xrcd;
#else
struct ibv_xrc_domain *xrc_domain;
#endif
int xrc_fd;
#endif
opal_atomic_int32_t non_eager_rdma_endpoints;
opal_atomic_int32_t eager_rdma_buffers_count;
struct mca_btl_base_endpoint_t **eager_rdma_buffers;
/**< frags for control massages */
opal_free_list_t send_free_control;
/* QP types and attributes that will be used on this device */
mca_btl_openib_device_qp_t *qps;
/* Maximum value supported by this device for max_inline_data */
uint32_t max_inline_data;
/* Registration limit and current count */
uint64_t mem_reg_max, mem_reg_max_total, mem_reg_active;
/* Device is ready for use */
bool ready_for_use;
/* Async event */
opal_event_t async_event;
} mca_btl_openib_device_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_device_t);
struct mca_btl_openib_module_pp_qp_t {
int32_t dummy;
}; typedef struct mca_btl_openib_module_pp_qp_t mca_btl_openib_module_pp_qp_t;
struct mca_btl_openib_module_srq_qp_t {
struct ibv_srq *srq;
opal_atomic_int32_t rd_posted;
opal_atomic_int32_t sd_credits; /* the max number of outstanding sends on a QP when using SRQ */
/* i.e. the number of frags that can be outstanding (down counter) */
opal_list_t pending_frags[2]; /**< list of high/low prio frags */
/** The number of receive buffers that can be post in the current time.
The value may be increased in the IBV_EVENT_SRQ_LIMIT_REACHED
event handler. The value starts from (rd_num / 4) and increased up to rd_num */
int32_t rd_curr_num;
/** We post additional WQEs only if a number of WQEs (in specific SRQ) is less of this value.
The value increased together with rd_curr_num. The value is unique for every SRQ. */
int32_t rd_low_local;
/** The flag points if we want to get the
IBV_EVENT_SRQ_LIMIT_REACHED events for dynamically resizing SRQ */
bool srq_limit_event_flag;
/**< In difference of the "--mca enable_srq_resize" parameter that says, if we want(or no)
to start with small num of pre-posted receive buffers (rd_curr_num) and to increase this number by needs
(the max of this value is rd_num * the whole size of SRQ), the "srq_limit_event_flag" says if we want to get limit event
from device if the defined srq limit was reached (signal to the main thread) and we put off this flag if the rd_curr_num
was increased up to rd_num.
In order to prevent lock/unlock operation in the critical path we prefer only put-on
the srq_limit_event_flag in asynchronous thread, because in this way we post receive buffers
in the main thread only and only after posting we set (if srq_limit_event_flag is true)
the limit for IBV_EVENT_SRQ_LIMIT_REACHED event. */
}; typedef struct mca_btl_openib_module_srq_qp_t mca_btl_openib_module_srq_qp_t;
struct mca_btl_openib_module_qp_t {
union {
mca_btl_openib_module_pp_qp_t pp_qp;
mca_btl_openib_module_srq_qp_t srq_qp;
} u;
}; typedef struct mca_btl_openib_module_qp_t mca_btl_openib_module_qp_t;
/**
* IB BTL Interface
*/
struct mca_btl_openib_module_t {
/* Base BTL module */
mca_btl_base_module_t super;
bool btl_inited;
bool srqs_created;
/** Common information about all ports */
mca_btl_openib_modex_message_t port_info;
/** Array of CPCs on this port */
opal_btl_openib_connect_base_module_t **cpcs;
/** Number of elements in the cpcs array */
uint8_t num_cpcs;
mca_btl_openib_device_t *device;
uint8_t port_num; /**< ID of the PORT */
uint16_t pkey_index;
struct ibv_port_attr ib_port_attr;
uint16_t lid; /**< lid that is actually used (for LMC) */
int apm_port; /**< Alternative port that may be used for APM */
uint8_t src_path_bits; /**< offset from base lid (for LMC) */
opal_atomic_int32_t num_peers;
opal_mutex_t ib_lock; /**< module level lock */
size_t eager_rdma_frag_size; /**< length of eager frag */
opal_atomic_int32_t eager_rdma_channels; /**< number of open RDMA channels */
mca_btl_base_module_error_cb_fn_t error_cb; /**< error handler */
mca_btl_openib_module_qp_t * qps;
int local_procs; /** number of local procs */
bool atomic_ops_be; /** atomic result is big endian */
bool allowed; /** is this port allowed */
};
typedef struct mca_btl_openib_module_t mca_btl_openib_module_t;
extern mca_btl_openib_module_t mca_btl_openib_module;
struct mca_btl_base_registration_handle_t {
uint32_t rkey;
uint32_t lkey;
};
struct mca_btl_openib_reg_t {
mca_rcache_base_registration_t base;
struct ibv_mr *mr;
mca_btl_base_registration_handle_t btl_handle;
};
typedef struct mca_btl_openib_reg_t mca_btl_openib_reg_t;
#if OPAL_ENABLE_PROGRESS_THREADS == 1
extern void* mca_btl_openib_progress_thread(opal_object_t*);
#endif
/**
* Register a callback function that is called on error..
*
* @param btl (IN) BTL module
* @return Status indicating if cleanup was successful
*/
int mca_btl_openib_register_error_cb(
struct mca_btl_base_module_t* btl,
mca_btl_base_module_error_cb_fn_t cbfunc
);
/**
* Cleanup any resources held by the BTL.
*
* @param btl BTL instance.
* @return OPAL_SUCCESS or error status on failure.
*/
extern int mca_btl_openib_finalize(
struct mca_btl_base_module_t* btl
);
/**
* PML->BTL notification of change in the process list.
*
* @param btl (IN) BTL module
* @param nprocs (IN) Number of processes
* @param procs (IN) Set of processes
* @param peers (OUT) Set of (optional) peer addressing info.
* @param reachable (IN/OUT) Set of processes that are reachable via this BTL.
* @return OPAL_SUCCESS or error status on failure.
*
*/
extern int mca_btl_openib_add_procs(
struct mca_btl_base_module_t* btl,
size_t nprocs,
struct opal_proc_t **procs,
struct mca_btl_base_endpoint_t** peers,
opal_bitmap_t* reachable
);
/**
* PML->BTL notification of change in the process list.
*
* @param btl (IN) BTL instance
* @param nproc (IN) Number of processes.
* @param procs (IN) Set of processes.
* @param peers (IN) Set of peer data structures.
* @return Status indicating if cleanup was successful
*
*/
extern int mca_btl_openib_del_procs(
struct mca_btl_base_module_t* btl,
size_t nprocs,
struct opal_proc_t **procs,
struct mca_btl_base_endpoint_t** peers
);
/**
* PML->BTL Initiate a send of the specified size.
*
* @param btl (IN) BTL instance
* @param btl_peer (IN) BTL peer addressing
* @param descriptor (IN) Descriptor of data to be transmitted.
* @param tag (IN) Tag.
*/
extern int mca_btl_openib_send(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* btl_peer,
struct mca_btl_base_descriptor_t* descriptor,
mca_btl_base_tag_t tag
);
/**
* PML->BTL Initiate a immediate send of the specified size.
*
* @param btl (IN) BTL instance
* @param ep (IN) Endpoint
* @param convertor (IN) Datatypes converter
* @param header (IN) PML header
* @param header_size (IN) PML header size
* @param payload_size (IN) Payload size
* @param order (IN) Order
* @param flags (IN) Flags
* @param tag (IN) Tag
* @param descriptor (OUT) Messages descriptor
*/
extern int mca_btl_openib_sendi( struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* ep,
struct opal_convertor_t* convertor,
void* header,
size_t header_size,
size_t payload_size,
uint8_t order,
uint32_t flags,
mca_btl_base_tag_t tag,
mca_btl_base_descriptor_t** descriptor
);
/* forward decaration for internal put/get */
struct mca_btl_openib_put_frag_t;
struct mca_btl_openib_get_frag_t;
/**
* @brief Schedule a put fragment with the HCA (internal)
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param frag (IN) Fragment prepared by mca_btl_openib_put
*
* If the fragment can not be scheduled due to resource limitations then
* the fragment will be put on the pending put fragment list and retried
* when another get/put fragment has completed.
*/
int mca_btl_openib_put_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
struct mca_btl_openib_put_frag_t *frag);
/**
* @brief Schedule an RDMA write with the HCA
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param local_address (IN) Source address
* @param remote_address (IN) Destination address
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
* @param size (IN) Number of bytes to write
* @param flags (IN) Transfer flags
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion
* @param cbcontext (IN) Context for completion callback
* @param cbdata (IN) Data for completion callback
*
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
* @return OPAL_SUCCCESS if the operation was successfully scheduled
*
* This function will attempt to schedule a put operation with the HCA.
*/
int mca_btl_openib_put (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
/**
* @brief Schedule a get fragment with the HCA (internal)
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param qp (IN) ID of queue pair to schedule the get on
* @param frag (IN) Fragment prepared by mca_btl_openib_get
*
* If the fragment can not be scheduled due to resource limitations then
* the fragment will be put on the pending get fragment list and retried
* when another get/put fragment has completed.
*/
int mca_btl_openib_get_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
struct mca_btl_openib_get_frag_t *frag);
/**
* @brief Schedule an RDMA read with the HCA
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param local_address (IN) Destination address
* @param remote_address (IN) Source address
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
* @param size (IN) Number of bytes to read
* @param flags (IN) Transfer flags
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion
* @param cbcontext (IN) Context for completion callback
* @param cbdata (IN) Data for completion callback
*
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
* @return OPAL_SUCCCESS if the operation was successfully scheduled
*
* This function will attempt to schedule a get operation with the HCA.
*/
int mca_btl_openib_get (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
/**
* Initiate an asynchronous fetching atomic operation.
* Completion Semantics: if this function returns a 1 then the operation
* is complete. a return of OPAL_SUCCESS indicates
* the atomic operation has been queued with the
* network.
*
* @param btl (IN) BTL module
* @param endpoint (IN) BTL addressing information
* @param local_address (OUT) Local address to store the result in
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
* @param local_handle (IN) Local registration handle for region containing
* (local_address, local_address + 8)
* @param remote_handle (IN) Remote registration handle for region containing
* (remote_address, remote_address + 8)
* @param op (IN) Operation to perform
* @param operand (IN) Operand for the operation
* @param flags (IN) Flags for this put operation
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion (if queued)
* @param cbcontext (IN) Context for the callback
* @param cbdata (IN) Data for callback
*
* @retval OPAL_SUCCESS The operation was successfully queued
* @retval 1 The operation is complete
* @retval OPAL_ERROR The operation was NOT successfully queued
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
* operation. Try again later
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
* alignment restrictions or the operation {op} is not supported
* by the hardware.
*
* After the operation is complete the remote address specified by {remote_address} and
* {remote_handle} will be updated with (*remote_address) = (*remote_address) op operand.
* {local_address} will be updated with the previous value stored in {remote_address}.
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
* however, that not all btls will provide consistency between btl atomic operations and
* cpu atomics.
*/
int mca_btl_openib_atomic_fop (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, mca_btl_base_atomic_op_t op,
uint64_t operand, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata);
/**
* Initiate an asynchronous compare and swap operation.
* Completion Semantics: if this function returns a 1 then the operation
* is complete. a return of OPAL_SUCCESS indicates
* the atomic operation has been queued with the
* network.
*
* @param btl (IN) BTL module
* @param endpoint (IN) BTL addressing information
* @param local_address (OUT) Local address to store the result in
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
* @param local_handle (IN) Local registration handle for region containing
* (local_address, local_address + 8)
* @param remote_handle (IN) Remote registration handle for region containing
* (remote_address, remote_address + 8)
* @param compare (IN) Operand for the operation
* @param value (IN) Value to store on success
* @param flags (IN) Flags for this put operation
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion (if queued)
* @param cbcontext (IN) Context for the callback
* @param cbdata (IN) Data for callback
*
* @retval OPAL_SUCCESS The operation was successfully queued
* @retval 1 The operation is complete
* @retval OPAL_ERROR The operation was NOT successfully queued
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
* operation. Try again later
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
* alignment restrictions or the operation {op} is not supported
* by the hardware.
*
* After the operation is complete the remote address specified by {remote_address} and
* {remote_handle} will be updated with {value} if *remote_address == compare.
* {local_address} will be updated with the previous value stored in {remote_address}.
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
* however, that not all btls will provide consistency between btl atomic operations and
* cpu atomics.
*/
int mca_btl_openib_atomic_cswap (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, uint64_t compare,
uint64_t value, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata);
/**
* Allocate a descriptor.
*
* @param btl (IN) BTL module
* @param size (IN) Requested descriptor size.
*/
extern mca_btl_base_descriptor_t* mca_btl_openib_alloc(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint,
uint8_t order,
size_t size,
uint32_t flags);
/**
* Return a segment allocated by this BTL.
*
* @param btl (IN) BTL module
* @param descriptor (IN) Allocated descriptor.
*/
extern int mca_btl_openib_free(
struct mca_btl_base_module_t* btl,
mca_btl_base_descriptor_t* des);
/**
* Pack data and return a descriptor that can be
* used for send/put.
*
* @param btl (IN) BTL module
* @param peer (IN) BTL peer addressing
*/
mca_btl_base_descriptor_t* mca_btl_openib_prepare_src(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* peer,
struct opal_convertor_t* convertor,
uint8_t order,
size_t reserve,
size_t* size,
uint32_t flags
);
extern void mca_btl_openib_frag_progress_pending_put_get(
struct mca_btl_base_endpoint_t*, const int);
/**
* Fault Tolerance Event Notification Function
*
* @param state (IN) Checkpoint State
* @return OPAL_SUCCESS or failure status
*/
extern int mca_btl_openib_ft_event(int state);
/**
* Show an error during init, particularly when running out of
* registered memory.
*/
void mca_btl_openib_show_init_error(const char *file, int line,
const char *func, const char *dev);
/**
* Post to Shared Receive Queue with certain priority
*
* @param openib_btl (IN) BTL module
* @param additional (IN) Additional Bytes to reserve
* @param prio (IN) Priority (either BTL_OPENIB_HP_QP or BTL_OPENIB_LP_QP)
* @return OPAL_SUCCESS or failure status
*/
int mca_btl_openib_post_srr(mca_btl_openib_module_t* openib_btl, const int qp);
/**
* Get a transport name of btl by its transport type.
*/
const char* btl_openib_get_transport_name(mca_btl_openib_transport_type_t transport_type);
/**
* Get an endpoint for a process
*
* @param btl (IN) BTL module
* @param proc (IN) opal process object
*
* This function will return an existing endpoint if one exists otherwise it will allocate
* a new endpoint and return it.
*/
struct mca_btl_base_endpoint_t *mca_btl_openib_get_ep (struct mca_btl_base_module_t *btl,
struct opal_proc_t *proc);
/**
* Get a transport type of btl.
*/
mca_btl_openib_transport_type_t mca_btl_openib_get_transport_type(mca_btl_openib_module_t* openib_btl);
static inline int qp_cq_prio(const int qp)
{
if(0 == qp)
return BTL_OPENIB_HP_CQ; /* smallest qp is always HP */
/* If the size for this qp is <= the eager limit, make it a
high priority QP. Otherwise, make it a low priority QP. */
return (mca_btl_openib_component.qp_infos[qp].size <=
mca_btl_openib_component.eager_limit) ?
BTL_OPENIB_HP_CQ : BTL_OPENIB_LP_CQ;
}
#define BTL_OPENIB_RDMA_QP(QP) \
((QP) == mca_btl_openib_component.rdma_qp)
/**
* Run function as part of opal_progress()
*
* @param[in] fn function to run
* @param[in] arg function data
*/
int mca_btl_openib_run_in_main (void *(*fn)(void *), void *arg);
END_C_DECLS
#endif /* MCA_BTL_IB_H */

Просмотреть файл

@ -1,508 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2008-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved
* Copyright (c) 2013-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#include <fcntl.h>
#include <sys/poll.h>
#include <unistd.h>
#include <errno.h>
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "opal/mca/btl/base/base.h"
#include "btl_openib.h"
#include "btl_openib_mca.h"
#include "btl_openib_async.h"
#include "btl_openib_proc.h"
#include "btl_openib_endpoint.h"
static opal_list_t ignore_qp_err_list;
static opal_mutex_t ignore_qp_err_list_lock;
static opal_atomic_int32_t btl_openib_async_device_count = 0;
struct mca_btl_openib_async_poll {
int active_poll_size;
int poll_size;
struct pollfd *async_pollfd;
};
typedef struct mca_btl_openib_async_poll mca_btl_openib_async_poll;
typedef struct {
opal_list_item_t super;
struct ibv_qp *qp;
} mca_btl_openib_qp_list;
OBJ_CLASS_INSTANCE(mca_btl_openib_qp_list, opal_list_item_t, NULL, NULL);
static const char *openib_event_to_str (enum ibv_event_type event);
/* Function converts event to string (name)
* Open Fabris don't have function that do this job :(
*/
static const char *openib_event_to_str (enum ibv_event_type event)
{
switch (event) {
case IBV_EVENT_CQ_ERR:
return "IBV_EVENT_CQ_ERR";
case IBV_EVENT_QP_FATAL:
return "IBV_EVENT_QP_FATAL";
case IBV_EVENT_QP_REQ_ERR:
return "IBV_EVENT_QP_REQ_ERR";
case IBV_EVENT_QP_ACCESS_ERR:
return "IBV_EVENT_QP_ACCESS_ERR";
case IBV_EVENT_PATH_MIG:
return "IBV_EVENT_PATH_MIG";
case IBV_EVENT_PATH_MIG_ERR:
return "IBV_EVENT_PATH_MIG_ERR";
case IBV_EVENT_DEVICE_FATAL:
return "IBV_EVENT_DEVICE_FATAL";
case IBV_EVENT_SRQ_ERR:
return "IBV_EVENT_SRQ_ERR";
case IBV_EVENT_PORT_ERR:
return "IBV_EVENT_PORT_ERR";
case IBV_EVENT_COMM_EST:
return "IBV_EVENT_COMM_EST";
case IBV_EVENT_PORT_ACTIVE:
return "IBV_EVENT_PORT_ACTIVE";
case IBV_EVENT_SQ_DRAINED:
return "IBV_EVENT_SQ_DRAINED";
case IBV_EVENT_LID_CHANGE:
return "IBV_EVENT_LID_CHANGE";
case IBV_EVENT_PKEY_CHANGE:
return "IBV_EVENT_PKEY_CHANGE";
case IBV_EVENT_SM_CHANGE:
return "IBV_EVENT_SM_CHANGE";
case IBV_EVENT_QP_LAST_WQE_REACHED:
return "IBV_EVENT_QP_LAST_WQE_REACHED";
#if HAVE_DECL_IBV_EVENT_CLIENT_REREGISTER
case IBV_EVENT_CLIENT_REREGISTER:
return "IBV_EVENT_CLIENT_REREGISTER";
#endif
case IBV_EVENT_SRQ_LIMIT_REACHED:
return "IBV_EVENT_SRQ_LIMIT_REACHED";
default:
return "UNKNOWN";
}
}
/* QP to endpoint */
static mca_btl_openib_endpoint_t * qp2endpoint(struct ibv_qp *qp, mca_btl_openib_device_t *device)
{
mca_btl_openib_endpoint_t *ep;
int ep_i, qp_i;
for(ep_i = 0; ep_i < opal_pointer_array_get_size(device->endpoints); ep_i++) {
ep = opal_pointer_array_get_item(device->endpoints, ep_i);
for(qp_i = 0; qp_i < mca_btl_openib_component.num_qps; qp_i++) {
if (qp == ep->qps[qp_i].qp->lcl_qp)
return ep;
}
}
return NULL;
}
#if OPAL_HAVE_CONNECTX_XRC
/* XRC recive QP to endpoint */
static mca_btl_openib_endpoint_t * xrc_qp2endpoint(uint32_t qp_num, mca_btl_openib_device_t *device)
{
mca_btl_openib_endpoint_t *ep;
int ep_i;
for(ep_i = 0; ep_i < opal_pointer_array_get_size(device->endpoints); ep_i++) {
ep = opal_pointer_array_get_item(device->endpoints, ep_i);
if (qp_num == ep->xrc_recv_qp_num)
return ep;
}
return NULL;
}
#endif
/* Function inits mca_btl_openib_async_poll */
/* The main idea of resizing SRQ algorithm -
We create a SRQ with size = rd_num, but for efficient usage of resources
the number of WQEs that we post = rd_curr_num < rd_num and this value is
increased (by needs) in IBV_EVENT_SRQ_LIMIT_REACHED event handler (i.e. in this function),
the event will thrown by device if number of WQEs in SRQ will be less than srq_limit */
static int btl_openib_async_srq_limit_event(struct ibv_srq* srq)
{
int qp, rc = OPAL_SUCCESS;
mca_btl_openib_module_t *openib_btl = NULL;
opal_mutex_t *lock = &mca_btl_openib_component.srq_manager.lock;
opal_hash_table_t *srq_addr_table = &mca_btl_openib_component.srq_manager.srq_addr_table;
opal_mutex_lock(lock);
if (OPAL_SUCCESS != opal_hash_table_get_value_ptr(srq_addr_table,
&srq, sizeof(struct ibv_srq*), (void*) &openib_btl)) {
/* If there isn't any element with the key in the table =>
we assume that SRQ was destroyed and don't serve the event */
goto srq_limit_event_exit;
}
for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) {
if (!BTL_OPENIB_QP_TYPE_PP(qp)) {
if(openib_btl->qps[qp].u.srq_qp.srq == srq) {
break;
}
}
}
if(qp >= mca_btl_openib_component.num_qps) {
BTL_ERROR(("Open MPI tried to access a shared receive queue (SRQ) on the device %s that was not found. This should not happen, and is a fatal error. Your MPI job will now abort.\n", ibv_get_device_name(openib_btl->device->ib_dev)));
rc = OPAL_ERROR;
goto srq_limit_event_exit;
}
/* dynamically re-size the SRQ to be larger */
openib_btl->qps[qp].u.srq_qp.rd_curr_num <<= 1;
if(openib_btl->qps[qp].u.srq_qp.rd_curr_num >=
mca_btl_openib_component.qp_infos[qp].rd_num) {
openib_btl->qps[qp].u.srq_qp.rd_curr_num = mca_btl_openib_component.qp_infos[qp].rd_num;
openib_btl->qps[qp].u.srq_qp.rd_low_local = mca_btl_openib_component.qp_infos[qp].rd_low;
openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = false;
goto srq_limit_event_exit;
}
openib_btl->qps[qp].u.srq_qp.rd_low_local <<= 1;
openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = true;
srq_limit_event_exit:
opal_mutex_unlock(lock);
return rc;
}
/* Function handle async device events */
static void btl_openib_async_device (int fd, short flags, void *arg)
{
mca_btl_openib_device_t *device = (mca_btl_openib_device_t *) arg;
struct ibv_async_event event;
int event_type;
if (ibv_get_async_event((struct ibv_context *)device->ib_dev_context,&event) < 0) {
if (EWOULDBLOCK != errno) {
BTL_ERROR(("Failed to get async event"));
}
return;
}
event_type = event.event_type;
#if OPAL_HAVE_CONNECTX_XRC
/* is it XRC event ?*/
bool xrc_event = false;
if (IBV_XRC_QP_EVENT_FLAG & event.event_type) {
xrc_event = true;
/* Clean the bitnd handel as usual */
event_type ^= IBV_XRC_QP_EVENT_FLAG;
}
#endif
switch(event_type) {
case IBV_EVENT_PATH_MIG:
BTL_ERROR(("Alternative path migration event reported"));
if (APM_ENABLED) {
BTL_ERROR(("Trying to find additional path..."));
#if OPAL_HAVE_CONNECTX_XRC
if (xrc_event)
mca_btl_openib_load_apm_xrc_rcv(event.element.xrc_qp_num,
xrc_qp2endpoint(event.element.xrc_qp_num, device));
else
#endif
mca_btl_openib_load_apm(event.element.qp,
qp2endpoint(event.element.qp, device));
}
break;
case IBV_EVENT_DEVICE_FATAL:
/* Set the flag to fatal */
device->got_fatal_event = true;
/* It is not critical to protect the counter */
OPAL_THREAD_ADD_FETCH32(&mca_btl_openib_component.error_counter, 1);
/* fall through */
case IBV_EVENT_CQ_ERR:
case IBV_EVENT_QP_FATAL:
if (event_type == IBV_EVENT_QP_FATAL) {
mca_btl_openib_qp_list *qp_item;
bool in_ignore_list = false;
BTL_VERBOSE(("QP is in err state %p", (void *)event.element.qp));
/* look through ignore list */
opal_mutex_lock (&ignore_qp_err_list_lock);
OPAL_LIST_FOREACH(qp_item, &ignore_qp_err_list, mca_btl_openib_qp_list) {
if (qp_item->qp == event.element.qp) {
BTL_VERBOSE(("QP %p is in error ignore list",
(void *)event.element.qp));
in_ignore_list = true;
break;
}
}
opal_mutex_unlock (&ignore_qp_err_list_lock);
if (in_ignore_list) {
break;
}
}
/* fall through */
case IBV_EVENT_QP_REQ_ERR:
case IBV_EVENT_QP_ACCESS_ERR:
case IBV_EVENT_PATH_MIG_ERR:
case IBV_EVENT_SRQ_ERR:
opal_show_help("help-mpi-btl-openib.txt", "of error event",
true,opal_process_info.nodename, (int)getpid(),
event_type,
openib_event_to_str((enum ibv_event_type)event_type));
break;
case IBV_EVENT_PORT_ERR:
opal_show_help("help-mpi-btl-openib.txt", "of error event",
true,opal_process_info.nodename, (int)getpid(),
event_type,
openib_event_to_str((enum ibv_event_type)event_type));
/* Set the flag to indicate port error */
device->got_port_event = true;
OPAL_THREAD_ADD_FETCH32(&mca_btl_openib_component.error_counter, 1);
break;
case IBV_EVENT_COMM_EST:
case IBV_EVENT_PORT_ACTIVE:
case IBV_EVENT_SQ_DRAINED:
case IBV_EVENT_LID_CHANGE:
case IBV_EVENT_PKEY_CHANGE:
case IBV_EVENT_SM_CHANGE:
case IBV_EVENT_QP_LAST_WQE_REACHED:
#if HAVE_DECL_IBV_EVENT_CLIENT_REREGISTER
case IBV_EVENT_CLIENT_REREGISTER:
#endif
break;
/* The event is signaled when number of prepost receive WQEs is going
under predefined threshold - srq_limit */
case IBV_EVENT_SRQ_LIMIT_REACHED:
(void) btl_openib_async_srq_limit_event (event.element.srq);
break;
default:
opal_show_help("help-mpi-btl-openib.txt", "of unknown event",
true,opal_process_info.nodename, (int)getpid(),
event_type);
}
ibv_ack_async_event(&event);
}
static void apm_update_attr(struct ibv_qp_attr *attr, enum ibv_qp_attr_mask *mask)
{
*mask = IBV_QP_ALT_PATH|IBV_QP_PATH_MIG_STATE;
attr->alt_ah_attr.dlid = attr->ah_attr.dlid + 1;
attr->alt_ah_attr.src_path_bits = attr->ah_attr.src_path_bits + 1;
attr->alt_ah_attr.static_rate = attr->ah_attr.static_rate;
attr->alt_ah_attr.sl = attr->ah_attr.sl;
attr->alt_pkey_index = attr->pkey_index;
attr->alt_port_num = attr->port_num;
attr->alt_timeout = attr->timeout;
attr->path_mig_state = IBV_MIG_REARM;
BTL_VERBOSE(("New APM LMC loaded: alt_src_port:%d, dlid: %d, src_bits %d, old_src_bits: %d, old_dlid %d",
attr->alt_port_num, attr->alt_ah_attr.dlid,
attr->alt_ah_attr.src_path_bits, attr->ah_attr.src_path_bits, attr->ah_attr.dlid));
}
static int apm_update_port(mca_btl_openib_endpoint_t *ep,
struct ibv_qp_attr *attr, enum ibv_qp_attr_mask *mask)
{
size_t port_i;
uint16_t apm_lid = 0;
if (attr->port_num == ep->endpoint_btl->apm_port) {
/* all ports were used */
BTL_ERROR(("APM: already all ports were used port_num %d apm_port %d",
attr->port_num, ep->endpoint_btl->apm_port));
return OPAL_ERROR;
}
/* looking for alternatve lid on remote site */
for(port_i = 0; port_i < ep->endpoint_proc->proc_port_count; port_i++) {
if (ep->endpoint_proc->proc_ports[port_i].pm_port_info.lid == attr->ah_attr.dlid - mca_btl_openib_component.apm_lmc) {
apm_lid = ep->endpoint_proc->proc_ports[port_i].pm_port_info.apm_lid;
}
}
if (0 == apm_lid) {
/* APM was disabled on one of site ? */
BTL_VERBOSE(("APM: Was disabled ? dlid %d %d %d", attr->ah_attr.dlid, attr->ah_attr.src_path_bits, ep->endpoint_btl->src_path_bits));
return OPAL_ERROR;
}
/* We guess cthat the LMC is the same on all ports */
attr->alt_ah_attr.static_rate = attr->ah_attr.static_rate;
attr->alt_ah_attr.sl = attr->ah_attr.sl;
attr->alt_pkey_index = attr->pkey_index;
attr->alt_timeout = attr->timeout;
attr->path_mig_state = IBV_MIG_REARM;
*mask = IBV_QP_ALT_PATH|IBV_QP_PATH_MIG_STATE;
attr->alt_port_num = ep->endpoint_btl->apm_port;
attr->alt_ah_attr.src_path_bits = ep->endpoint_btl->src_path_bits;
attr->alt_ah_attr.dlid = apm_lid;
BTL_VERBOSE(("New APM port loaded: alt_src_port:%d, dlid: %d, src_bits: %d:%d, old_dlid %d",
attr->alt_port_num, attr->alt_ah_attr.dlid,
attr->ah_attr.src_path_bits, attr->alt_ah_attr.src_path_bits,
attr->ah_attr.dlid));
return OPAL_SUCCESS;
}
/* Load new dlid to the QP */
void mca_btl_openib_load_apm(struct ibv_qp *qp, mca_btl_openib_endpoint_t *ep)
{
struct ibv_qp_init_attr qp_init_attr;
struct ibv_qp_attr attr;
enum ibv_qp_attr_mask mask = 0;
struct mca_btl_openib_module_t *btl;
BTL_VERBOSE(("APM: Loading alternative path"));
assert (NULL != ep);
btl = ep->endpoint_btl;
if (ibv_query_qp(qp, &attr, mask, &qp_init_attr))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d", qp->qp_num));
if (mca_btl_openib_component.apm_lmc &&
attr.ah_attr.src_path_bits - btl->src_path_bits < mca_btl_openib_component.apm_lmc) {
BTL_VERBOSE(("APM LMC: src: %d btl_src: %d lmc_max: %d",
attr.ah_attr.src_path_bits,
btl->src_path_bits,
mca_btl_openib_component.apm_lmc));
apm_update_attr(&attr, &mask);
} else {
if (mca_btl_openib_component.apm_ports) {
/* Try to migrate to next port */
if (OPAL_SUCCESS != apm_update_port(ep, &attr, &mask))
return;
} else {
BTL_ERROR(("Failed to load alternative path, all %d were used",
attr.ah_attr.src_path_bits - btl->src_path_bits));
}
}
if (ibv_modify_qp(qp, &attr, mask))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d, errno says: %s (%d)",
qp->qp_num, strerror(errno), errno));
}
#if OPAL_HAVE_CONNECTX_XRC
void mca_btl_openib_load_apm_xrc_rcv(uint32_t qp_num, mca_btl_openib_endpoint_t *ep)
{
struct ibv_qp_init_attr qp_init_attr;
struct ibv_qp_attr attr;
enum ibv_qp_attr_mask mask = 0;
struct mca_btl_openib_module_t *btl;
BTL_VERBOSE(("APM XRC: Loading alternative path"));
assert (NULL != ep);
btl = ep->endpoint_btl;
if (ibv_query_xrc_rcv_qp(btl->device->xrc_domain, qp_num, &attr, mask, &qp_init_attr))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d", qp_num));
if (mca_btl_openib_component.apm_lmc &&
attr.ah_attr.src_path_bits - btl->src_path_bits < mca_btl_openib_component.apm_lmc) {
apm_update_attr(&attr, &mask);
} else {
if (mca_btl_openib_component.apm_ports) {
/* Try to migrate to next port */
if (OPAL_SUCCESS != apm_update_port(ep, &attr, &mask))
return;
} else {
BTL_ERROR(("Failed to load alternative path, all %d were used",
attr.ah_attr.src_path_bits - btl->src_path_bits));
}
}
ibv_modify_xrc_rcv_qp(btl->device->xrc_domain, qp_num, &attr, mask);
/* Maybe the qp already was modified by other process - ignoring error */
}
#endif
int mca_btl_openib_async_init (void)
{
if (!mca_btl_openib_component.use_async_event_thread ||
mca_btl_openib_component.async_evbase) {
return OPAL_SUCCESS;
}
mca_btl_openib_component.async_evbase = opal_progress_thread_init (NULL);
OBJ_CONSTRUCT(&ignore_qp_err_list, opal_list_t);
OBJ_CONSTRUCT(&ignore_qp_err_list_lock, opal_mutex_t);
/* Set the error counter to zero */
mca_btl_openib_component.error_counter = 0;
return OPAL_SUCCESS;
}
void mca_btl_openib_async_fini (void)
{
if (mca_btl_openib_component.async_evbase) {
OPAL_LIST_DESTRUCT(&ignore_qp_err_list);
OBJ_DESTRUCT(&ignore_qp_err_list_lock);
opal_progress_thread_finalize (NULL);
mca_btl_openib_component.async_evbase = NULL;
}
}
void mca_btl_openib_async_add_device (mca_btl_openib_device_t *device)
{
if (mca_btl_openib_component.async_evbase) {
if (1 == OPAL_THREAD_ADD_FETCH32 (&btl_openib_async_device_count, 1)) {
mca_btl_openib_async_init ();
}
opal_event_set (mca_btl_openib_component.async_evbase, &device->async_event,
device->ib_dev_context->async_fd, OPAL_EV_READ | OPAL_EV_PERSIST,
btl_openib_async_device, device);
opal_event_add (&device->async_event, 0);
}
}
void mca_btl_openib_async_rem_device (mca_btl_openib_device_t *device)
{
if (mca_btl_openib_component.async_evbase) {
opal_event_del (&device->async_event);
if (0 == OPAL_THREAD_ADD_FETCH32 (&btl_openib_async_device_count, -1)) {
mca_btl_openib_async_fini ();
}
}
}
void mca_btl_openib_async_add_qp_ignore (struct ibv_qp *qp)
{
if (mca_btl_openib_component.async_evbase) {
mca_btl_openib_qp_list *new_qp = OBJ_NEW(mca_btl_openib_qp_list);
if (OPAL_UNLIKELY(NULL == new_qp)) {
/* can allocate a small object. not much more can be done */
return;
}
BTL_VERBOSE(("Ignoring errors on QP %p", (void *) qp));
new_qp->qp = qp;
opal_mutex_lock (&ignore_qp_err_list_lock);
opal_list_append (&ignore_qp_err_list, (opal_list_item_t *) new_qp);
opal_mutex_unlock (&ignore_qp_err_list_lock);
}
}

Просмотреть файл

@ -1,59 +0,0 @@
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 Los Alamos National Security, LLC. All rights
* received.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_ASYNC_H
#define MCA_BTL_OPENIB_ASYNC_H
#include "btl_openib_endpoint.h"
void mca_btl_openib_load_apm(struct ibv_qp *qp, mca_btl_openib_endpoint_t *ep);
#if OPAL_HAVE_CONNECTX_XRC
void mca_btl_openib_load_apm_xrc_rcv(uint32_t qp_num, mca_btl_openib_endpoint_t *ep);
#endif
#define APM_ENABLED (0 != mca_btl_openib_component.apm_lmc || 0 != mca_btl_openib_component.apm_ports)
/**
* Initialize the async event base
*/
int mca_btl_openib_async_init (void);
/**
* Finalize the async event base
*/
void mca_btl_openib_async_fini (void);
/**
* Register a device with the async event base
*
* @param[in] device device to register
*/
void mca_btl_openib_async_add_device (mca_btl_openib_device_t *device);
/**
* Deregister a device with the async event base
*
* @param[in] device device to deregister
*/
void mca_btl_openib_async_rem_device (mca_btl_openib_device_t *device);
/**
* Ignore error events on a queue pair
*
* @param[in] qp queue pair to ignore
*/
void mca_btl_openib_async_add_qp_ignore (struct ibv_qp *qp);
#endif

Просмотреть файл

@ -1,140 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2014-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
#if HAVE_DECL_IBV_ATOMIC_HCA
static int mca_btl_openib_atomic_internal (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, enum ibv_wr_opcode opcode,
int64_t operand, int64_t operand2, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
mca_btl_openib_get_frag_t* frag = NULL;
int qp = order;
int32_t rkey;
int rc;
frag = to_get_frag(alloc_recv_user_frag());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = 8;
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t) local_address;
to_com_frag(frag)->endpoint = endpoint;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* set up descriptor */
frag->sr_desc.wr.atomic.remote_addr = remote_address;
frag->sr_desc.opcode = opcode;
frag->sr_desc.wr.atomic.compare_add = operand;
frag->sr_desc.wr.atomic.swap = operand2;
rkey = remote_handle->rkey;
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if((endpoint->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
rkey = opal_swap_bytes4 (rkey);
}
#endif
frag->sr_desc.wr.atomic.rkey = rkey;
/* NTH: the SRQ# is set in mca_btl_get_internal */
if (endpoint->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&endpoint->endpoint_lock);
rc = check_endpoint_state(endpoint, &to_base_frag(frag)->base, &endpoint->pending_get_frags);
OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
return OPAL_SUCCESS;
}
if (OPAL_SUCCESS != rc) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_get_internal (btl, endpoint, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
OPAL_THREAD_SCOPED_LOCK(&endpoint->endpoint_lock,
opal_list_append(&endpoint->pending_get_frags, (opal_list_item_t*)frag));
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_atomic_fop (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, mca_btl_base_atomic_op_t op,
uint64_t operand, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
if (OPAL_UNLIKELY(MCA_BTL_ATOMIC_ADD != op || (MCA_BTL_ATOMIC_FLAG_32BIT & flags))) {
return OPAL_ERR_NOT_SUPPORTED;
}
return mca_btl_openib_atomic_internal (btl, endpoint, local_address, remote_address, local_handle,
remote_handle, IBV_WR_ATOMIC_FETCH_AND_ADD, operand, 0,
flags, order, cbfunc, cbcontext, cbdata);
}
int mca_btl_openib_atomic_cswap (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, uint64_t compare,
uint64_t value, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
if (OPAL_UNLIKELY(MCA_BTL_ATOMIC_FLAG_32BIT & flags)) {
return OPAL_ERR_NOT_SUPPORTED;
}
return mca_btl_openib_atomic_internal (btl, endpoint, local_address, remote_address, local_handle,
remote_handle, IBV_WR_ATOMIC_CMP_AND_SWP, compare, value,
flags, order, cbfunc, cbcontext, cbdata);
}
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,118 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2015-2018 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_OPENIB_EAGER_RDMA_BUF_H
#define MCA_BTL_OPENIB_EAGER_RDMA_BUF_H
#include "opal_config.h"
#include "btl_openib.h"
BEGIN_C_DECLS
struct mca_btl_openib_eager_rdma_local_t {
opal_ptr_t base; /**< buffer for RDMAing eager messages */
void *alloc_base; /**< allocated base */
mca_btl_openib_recv_frag_t *frags;
mca_btl_openib_reg_t *reg;
uint16_t head; /**< RDMA buffer to poll */
uint16_t tail; /**< Needed for credit managment */
opal_atomic_int32_t credits; /**< number of RDMA credits */
int32_t rd_win;
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
opal_mutex_t lock; /**< guard access to RDMA buffer */
int32_t rd_low;
};
typedef struct mca_btl_openib_eager_rdma_local_t mca_btl_openib_eager_rdma_local_t;
struct mca_btl_openib_eager_rdma_remote_t {
opal_ptr_t base; /**< address of remote buffer */
uint32_t rkey; /**< RKey for accessing remote buffer */
opal_atomic_int32_t head; /**< RDMA buffer to post to */
opal_atomic_int32_t tokens; /**< number of rdma tokens */
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
};
typedef struct mca_btl_openib_eager_rdma_remote_t mca_btl_openib_eager_rdma_remote_t;
#define MCA_BTL_OPENIB_RDMA_FRAG(F) \
(openib_frag_type(F) == MCA_BTL_OPENIB_FRAG_EAGER_RDMA)
#define EAGER_RDMA_BUFFER_REMOTE (0)
#define EAGER_RDMA_BUFFER_LOCAL (0xff)
#ifdef WORDS_BIGENDIAN
#define MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE(F) ((F)->u.size >> 8)
#define MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(F, S) \
((F)->u.size = (S) << 8)
#else
#define MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE(F) ((F)->u.size & 0x00ffffff)
#define MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(F, S) \
((F)->u.size = (S) & 0x00ffffff)
#endif
#define MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(F) \
(((volatile uint8_t*)(F)->ftr->u.buf)[3] != EAGER_RDMA_BUFFER_REMOTE)
#define MCA_BTL_OPENIB_RDMA_FRAG_REMOTE(F) \
(!MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(F))
#define MCA_BTL_OPENIB_RDMA_MAKE_REMOTE(F) do { \
((volatile uint8_t*)(F)->u.buf)[3] = EAGER_RDMA_BUFFER_REMOTE; \
}while (0)
#define MCA_BTL_OPENIB_RDMA_MAKE_LOCAL(F) do { \
((volatile uint8_t*)(F)->u.buf)[3] = EAGER_RDMA_BUFFER_LOCAL; \
}while (0)
#define MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG(E, I) \
(&(E)->eager_rdma_local.frags[(I)])
#define MCA_BTL_OPENIB_RDMA_NEXT_INDEX(I) do { \
(I) = ((I) + 1); \
if((I) == \
mca_btl_openib_component.eager_rdma_num) \
(I) = 0; \
} while (0)
#if OPAL_ENABLE_DEBUG
/**
* @brief read and increment the remote head index and generate a sequence
* number
*/
#define MCA_BTL_OPENIB_RDMA_MOVE_INDEX(HEAD, OLD_HEAD, SEQ) \
do { \
(SEQ) = OPAL_THREAD_ADD_FETCH32(&(HEAD), 1) - 1; \
(OLD_HEAD) = (SEQ) % mca_btl_openib_component.eager_rdma_num; \
} while(0)
#else
/**
* @brief read and increment the remote head index
*/
#define MCA_BTL_OPENIB_RDMA_MOVE_INDEX(HEAD, OLD_HEAD) \
do { \
(OLD_HEAD) = (OPAL_THREAD_ADD_FETCH32((opal_atomic_int32_t *) &(HEAD), 1) - 1) % mca_btl_openib_component.eager_rdma_num; \
} while(0)
#endif
END_C_DECLS
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,720 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2009 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2007-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 NVIDIA Corporation. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_ENDPOINT_H
#define MCA_BTL_IB_ENDPOINT_H
#include <errno.h>
#include <string.h>
#include "opal/class/opal_list.h"
#include "opal/mca/event/event.h"
#include "opal/util/output.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/btl/base/btl_base_error.h"
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_eager_rdma.h"
#include "connect/base.h"
#define QP_TX_BATCH_COUNT 64
#define QP_TX_BATCH_COUNT 64
BEGIN_C_DECLS
struct mca_btl_openib_frag_t;
struct mca_btl_openib_proc_modex_t;
/**
* State of IB endpoint connection.
*/
typedef enum {
/* Defines the state in which this BTL instance
* has started the process of connection */
MCA_BTL_IB_CONNECTING,
/* Waiting for ack from endpoint */
MCA_BTL_IB_CONNECT_ACK,
/*Waiting for final connection ACK from endpoint */
MCA_BTL_IB_WAITING_ACK,
/* Connected ... both sender & receiver have
* buffers associated with this connection */
MCA_BTL_IB_CONNECTED,
/* Connection is closed, there are no resources
* associated with this */
MCA_BTL_IB_CLOSED,
/* Maximum number of retries have been used.
* Report failure on send to upper layer */
MCA_BTL_IB_FAILED
} mca_btl_openib_endpoint_state_t;
typedef struct mca_btl_openib_rem_qp_info_t {
uint32_t rem_qp_num;
/* Remote QP number */
uint32_t rem_psn;
/* Remote processes port sequence number */
} mca_btl_openib_rem_qp_info_t;
typedef struct mca_btl_openib_rem_srq_info_t {
/* Remote SRQ number */
uint32_t rem_srq_num;
} mca_btl_openib_rem_srq_info_t;
typedef struct mca_btl_openib_rem_info_t {
/* Local identifier of the remote process */
uint16_t rem_lid;
/* subnet id of remote process */
uint64_t rem_subnet_id;
/* MTU of remote process */
uint32_t rem_mtu;
/* index of remote endpoint in endpoint array */
uint32_t rem_index;
/* Remote QPs */
mca_btl_openib_rem_qp_info_t *rem_qps;
/* Remote xrc_srq info, used only with XRC connections */
mca_btl_openib_rem_srq_info_t *rem_srqs;
/* Vendor id of remote HCA */
uint32_t rem_vendor_id;
/* Vendor part id of remote HCA */
uint32_t rem_vendor_part_id;
/* Transport type of remote port */
mca_btl_openib_transport_type_t rem_transport_type;
} mca_btl_openib_rem_info_t;
/**
* Agggregates all per peer qp info for an endpoint
*/
typedef struct mca_btl_openib_endpoint_pp_qp_t {
opal_atomic_int32_t sd_credits; /**< this rank's view of the credits
* available for sending:
* this is the credits granted by the
* remote peer which has some relation to the
* number of receive buffers posted remotely
*/
opal_atomic_int32_t rd_posted; /**< number of descriptors posted to the nic*/
opal_atomic_int32_t rd_credits; /**< number of credits to return to peer */
opal_atomic_int32_t cm_received; /**< Credit messages received */
opal_atomic_int32_t cm_return; /**< how may credits to return */
opal_atomic_int32_t cm_sent; /**< Outstanding number of credit messages */
} mca_btl_openib_endpoint_pp_qp_t;
/**
* Aggregates all srq qp info for an endpoint
*/
typedef struct mca_btl_openib_endpoint_srq_qp_t {
int32_t dummy;
} mca_btl_openib_endpoint_srq_qp_t;
typedef struct mca_btl_openib_qp_t {
struct ibv_qp *lcl_qp;
uint32_t lcl_psn;
opal_atomic_int32_t sd_wqe; /**< number of available send wqe entries */
opal_atomic_int32_t sd_wqe_inflight;
int wqe_count;
int users;
opal_mutex_t lock;
} mca_btl_openib_qp_t;
typedef struct mca_btl_openib_endpoint_qp_t {
mca_btl_openib_qp_t *qp;
opal_list_t no_credits_pending_frags[2]; /**< put fragment here if there is no credits
available */
opal_list_t no_wqe_pending_frags[2]; /**< put fragments here if there is no wqe
available */
opal_atomic_int32_t rd_credit_send_lock; /**< Lock credit send fragment */
mca_btl_openib_send_control_frag_t *credit_frag;
size_t ib_inline_max; /**< max size of inline send*/
union {
mca_btl_openib_endpoint_srq_qp_t srq_qp;
mca_btl_openib_endpoint_pp_qp_t pp_qp;
} u;
} mca_btl_openib_endpoint_qp_t;
/**
* An abstraction that represents a connection to a endpoint process.
* An instance of mca_btl_base_endpoint_t is associated w/ each process
* and BTL pair at startup. However, connections to the endpoint
* are established dynamically on an as-needed basis:
*/
struct mca_btl_base_endpoint_t {
opal_list_item_t super;
/** BTL module that created this connection */
struct mca_btl_openib_module_t* endpoint_btl;
/** proc structure corresponding to endpoint */
struct mca_btl_openib_proc_t* endpoint_proc;
/** local CPC to connect to this endpoint */
opal_btl_openib_connect_base_module_t *endpoint_local_cpc;
/** hook for local CPC to hang endpoint-specific data */
void *endpoint_local_cpc_data;
/** If endpoint_local_cpc->cbm_uses_cts is true and this endpoint
is iWARP, then endpoint_initiator must be true on the side
that actually initiates the QP, false on the other side. This
bool is used to know which way to send the first CTS
message. */
bool endpoint_initiator;
/** pointer to remote proc's CPC data (essentially its CPC modex
message) */
opal_btl_openib_connect_base_module_data_t *endpoint_remote_cpc_data;
/** current state of the connection */
mca_btl_openib_endpoint_state_t endpoint_state;
/** number of connection retries attempted */
size_t endpoint_retries;
/** timestamp of when the first connection was attempted */
double endpoint_tstamp;
/** lock for concurrent access to endpoint state */
opal_mutex_t endpoint_lock;
/** list of pending frags due to lazy connection establishment
for this endpotint */
opal_list_t pending_lazy_frags;
mca_btl_openib_endpoint_qp_t *qps;
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_qp *xrc_recv_qp;
#else
uint32_t xrc_recv_qp_num; /* in xrc we will use it as recv qp */
#endif
uint32_t xrc_recv_psn;
/** list of pending rget ops */
opal_list_t pending_get_frags;
/** list of pending rput ops */
opal_list_t pending_put_frags;
/** number of available get tokens */
opal_atomic_int32_t get_tokens;
/** subnet id of this endpoint*/
uint64_t subnet_id;
/** used only for xrc; pointer to struct that keeps remote port
info */
struct ib_address_t *ib_addr;
/** number of eager received */
opal_atomic_int32_t eager_recv_count;
/** info about remote RDMA buffer */
mca_btl_openib_eager_rdma_remote_t eager_rdma_remote;
/** info about local RDMA buffer */
mca_btl_openib_eager_rdma_local_t eager_rdma_local;
/** index of the endpoint in endpoints array */
int32_t index;
/** does the endpoint require network byte ordering? */
bool nbo;
/** use eager rdma for this peer? */
bool use_eager_rdma;
/** information about the remote port */
mca_btl_openib_rem_info_t rem_info;
/** Frag for initial wireup CTS protocol; will be NULL if CPC
indicates that it does not want to use CTS */
mca_btl_openib_recv_frag_t endpoint_cts_frag;
/** Memory registration info for the CTS frag */
struct ibv_mr *endpoint_cts_mr;
/** Whether we've posted receives on this EP or not (only used in
CTS protocol) */
bool endpoint_posted_recvs;
/** Whether we've received the CTS from the peer or not (only used
in CTS protocol) */
bool endpoint_cts_received;
/** Whether we've send out CTS to the peer or not (only used in
CTS protocol) */
bool endpoint_cts_sent;
};
typedef struct mca_btl_base_endpoint_t mca_btl_base_endpoint_t;
typedef mca_btl_base_endpoint_t mca_btl_openib_endpoint_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_endpoint_t);
static inline int32_t qp_get_wqe(mca_btl_openib_endpoint_t *ep, const int qp)
{
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, -1);
}
static inline int32_t qp_put_wqe(mca_btl_openib_endpoint_t *ep, const int qp)
{
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, 1);
}
static inline int32_t qp_inc_inflight_wqe(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
frag->n_wqes_inflight = 0;
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe_inflight, 1);
}
static inline void qp_inflight_wqe_to_frag(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
frag->n_wqes_inflight = ep->qps[qp].qp->sd_wqe_inflight;
ep->qps[qp].qp->sd_wqe_inflight = 0;
}
static inline int qp_frag_to_wqe(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
int n;
n = frag->n_wqes_inflight;
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, n);
frag->n_wqes_inflight = 0;
return n;
}
static inline int qp_need_signal(mca_btl_openib_endpoint_t *ep, const int qp, size_t size, int rdma)
{
/* note that size here is payload only */
if (ep->qps[qp].qp->sd_wqe <= 0 ||
size + sizeof(mca_btl_openib_header_t) + (rdma ? sizeof(mca_btl_openib_footer_t) : 0) > ep->qps[qp].ib_inline_max ||
(!BTL_OPENIB_QP_TYPE_PP(qp) && ep->endpoint_btl->qps[qp].u.srq_qp.sd_credits <= 0)) {
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
return 1;
}
if (0 < --ep->qps[qp].qp->wqe_count) {
return 0;
}
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
return 1;
}
static inline void qp_reset_signal_count(mca_btl_openib_endpoint_t *ep, const int qp)
{
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
}
int mca_btl_openib_endpoint_send(mca_btl_base_endpoint_t*,
mca_btl_openib_send_frag_t*);
int mca_btl_openib_endpoint_post_send(mca_btl_openib_endpoint_t*,
mca_btl_openib_send_frag_t*);
void mca_btl_openib_endpoint_send_credits(mca_btl_base_endpoint_t*, const int);
void mca_btl_openib_endpoint_connect_eager_rdma(mca_btl_openib_endpoint_t*);
int mca_btl_openib_endpoint_post_recvs(mca_btl_openib_endpoint_t*);
/* the endpoint lock must be held with OPAL_THREAD_LOCK for both CTS and cpc complete */
void mca_btl_openib_endpoint_send_cts(mca_btl_openib_endpoint_t *endpoint);
void mca_btl_openib_endpoint_cpc_complete(mca_btl_openib_endpoint_t*);
void mca_btl_openib_endpoint_connected(mca_btl_openib_endpoint_t*);
void mca_btl_openib_endpoint_init(mca_btl_openib_module_t*,
mca_btl_base_endpoint_t*,
opal_btl_openib_connect_base_module_t *local_cpc,
struct mca_btl_openib_proc_modex_t *remote_proc_info,
opal_btl_openib_connect_base_module_data_t *remote_cpc_data);
/*
* Invoke an error on the btl associated with an endpoint. If we
* don't have an endpoint, then just use the first one on the
* component list of BTLs.
*/
void *mca_btl_openib_endpoint_invoke_error(void *endpoint);
static inline int post_recvs(mca_btl_base_endpoint_t *ep, const int qp,
const int num_post)
{
int i, rc;
struct ibv_recv_wr *bad_wr, *wr_list = NULL, *wr = NULL;
mca_btl_openib_module_t *openib_btl = ep->endpoint_btl;
if(0 == num_post)
return OPAL_SUCCESS;
for(i = 0; i < num_post; i++) {
opal_free_list_item_t* item;
item = opal_free_list_wait (&openib_btl->device->qps[qp].recv_free);
to_base_frag(item)->base.order = qp;
to_com_frag(item)->endpoint = ep;
if(NULL == wr)
wr = wr_list = &to_recv_frag(item)->rd_desc;
else
wr = wr->next = &to_recv_frag(item)->rd_desc;
OPAL_OUTPUT((-1, "Posting recv (QP num %d): WR ID %p, SG addr %p, len %d, lkey %d",
ep->qps[qp].qp->lcl_qp->qp_num,
(void*) ((uintptr_t*)wr->wr_id),
(void*)((uintptr_t*) wr->sg_list[0].addr),
wr->sg_list[0].length,
wr->sg_list[0].lkey));
}
wr->next = NULL;
rc = ibv_post_recv(ep->qps[qp].qp->lcl_qp, wr_list, &bad_wr);
if (0 == rc)
return OPAL_SUCCESS;
BTL_ERROR(("error %d posting receive on qp %d", rc, qp));
return OPAL_ERROR;
}
static inline int mca_btl_openib_endpoint_post_rr_nolock(
mca_btl_base_endpoint_t *ep, const int qp)
{
int rd_rsv = mca_btl_openib_component.qp_infos[qp].u.pp_qp.rd_rsv;
int rd_num = mca_btl_openib_component.qp_infos[qp].rd_num;
int rd_low = mca_btl_openib_component.qp_infos[qp].rd_low;
int cqp = mca_btl_openib_component.credits_qp, rc;
int cm_received = 0, num_post = 0;
assert(BTL_OPENIB_QP_TYPE_PP(qp));
if(ep->qps[qp].u.pp_qp.rd_posted <= rd_low)
num_post = rd_num - ep->qps[qp].u.pp_qp.rd_posted;
assert(num_post >= 0);
if(ep->qps[qp].u.pp_qp.cm_received >= (rd_rsv >> 2))
cm_received = ep->qps[qp].u.pp_qp.cm_received;
if((rc = post_recvs(ep, qp, num_post)) != OPAL_SUCCESS) {
return rc;
}
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.rd_posted, num_post);
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.rd_credits, num_post);
/* post buffers for credit management on credit management qp */
if((rc = post_recvs(ep, cqp, cm_received)) != OPAL_SUCCESS) {
return rc;
}
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.cm_return, cm_received);
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.cm_received, -cm_received);
assert(ep->qps[qp].u.pp_qp.rd_credits <= rd_num &&
ep->qps[qp].u.pp_qp.rd_credits >= 0);
return OPAL_SUCCESS;
}
static inline int mca_btl_openib_endpoint_post_rr(
mca_btl_base_endpoint_t *ep, const int qp)
{
int ret;
OPAL_THREAD_LOCK(&ep->endpoint_lock);
ret = mca_btl_openib_endpoint_post_rr_nolock(ep, qp);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
return ret;
}
static inline __opal_attribute_always_inline__ bool btl_openib_credits_send_trylock (mca_btl_openib_endpoint_t *ep, int qp)
{
int32_t _tmp_value = 0;
return OPAL_ATOMIC_COMPARE_EXCHANGE_STRONG_32(&ep->qps[qp].rd_credit_send_lock, &_tmp_value, 1);
}
#define BTL_OPENIB_CREDITS_SEND_UNLOCK(E, Q) \
OPAL_ATOMIC_SWAP_32 (&(E)->qps[(Q)].rd_credit_send_lock, 0)
#define BTL_OPENIB_GET_CREDITS(FROM, TO) \
TO = OPAL_ATOMIC_SWAP_32(&FROM, 0)
static inline bool check_eager_rdma_credits(const mca_btl_openib_endpoint_t *ep)
{
return (ep->eager_rdma_local.credits > ep->eager_rdma_local.rd_win) ? true :
false;
}
static inline bool
check_send_credits(const mca_btl_openib_endpoint_t *ep, const int qp)
{
if(!BTL_OPENIB_QP_TYPE_PP(qp))
return false;
return (ep->qps[qp].u.pp_qp.rd_credits >=
mca_btl_openib_component.qp_infos[qp].u.pp_qp.rd_win) ? true : false;
}
static inline void send_credits(mca_btl_openib_endpoint_t *ep, int qp)
{
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
if(check_send_credits(ep, qp))
goto try_send;
} else {
qp = mca_btl_openib_component.credits_qp;
}
if(!check_eager_rdma_credits(ep))
return;
try_send:
if(btl_openib_credits_send_trylock(ep, qp))
mca_btl_openib_endpoint_send_credits(ep, qp);
}
static inline int check_endpoint_state(mca_btl_openib_endpoint_t *ep,
mca_btl_base_descriptor_t *des, opal_list_t *pending_list)
{
int rc = OPAL_ERR_RESOURCE_BUSY;
switch(ep->endpoint_state) {
case MCA_BTL_IB_CLOSED:
rc = ep->endpoint_local_cpc->cbm_start_connect(ep->endpoint_local_cpc, ep);
if (OPAL_SUCCESS == rc) {
rc = OPAL_ERR_RESOURCE_BUSY;
}
/* fall through */
default:
opal_list_append(pending_list, (opal_list_item_t *)des);
break;
case MCA_BTL_IB_FAILED:
rc = OPAL_ERR_UNREACH;
break;
case MCA_BTL_IB_CONNECTED:
rc = OPAL_SUCCESS;
break;
}
return rc;
}
static inline __opal_attribute_always_inline__ int
ib_send_flags(uint32_t size, mca_btl_openib_endpoint_qp_t *qp, int do_signal)
{
if (do_signal) {
return IBV_SEND_SIGNALED |
((size <= qp->ib_inline_max) ? IBV_SEND_INLINE : 0);
} else {
return ((size <= qp->ib_inline_max) ? IBV_SEND_INLINE : 0);
}
}
static inline int
acquire_eager_rdma_send_credit(mca_btl_openib_endpoint_t *endpoint)
{
if(OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, 1);
return OPAL_ERR_OUT_OF_RESOURCE;
}
return OPAL_SUCCESS;
}
static inline int post_send(mca_btl_openib_endpoint_t *ep,
mca_btl_openib_send_frag_t *frag, const bool rdma, int do_signal)
{
mca_btl_openib_module_t *openib_btl = ep->endpoint_btl;
mca_btl_base_segment_t *seg = &to_base_frag(frag)->segment;
struct ibv_sge *sg = &to_com_frag(frag)->sg_entry;
struct ibv_send_wr *sr_desc = &to_out_frag(frag)->sr_desc;
struct ibv_send_wr *bad_wr;
int qp = to_base_frag(frag)->base.order;
sg->length = seg->seg_len + sizeof(mca_btl_openib_header_t) +
(rdma ? sizeof(mca_btl_openib_footer_t) : 0) + frag->coalesced_length;
sr_desc->send_flags = ib_send_flags(sg->length, &(ep->qps[qp]), do_signal);
if(ep->nbo)
BTL_OPENIB_HEADER_HTON(*frag->hdr);
if(rdma) {
int32_t head;
mca_btl_openib_footer_t* ftr =
(mca_btl_openib_footer_t*)(((char*)frag->hdr) + sg->length +
BTL_OPENIB_FTR_PADDING(sg->length) - sizeof(mca_btl_openib_footer_t));
sr_desc->opcode = IBV_WR_RDMA_WRITE;
MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(ftr, sg->length);
MCA_BTL_OPENIB_RDMA_MAKE_LOCAL(ftr);
#if OPAL_ENABLE_DEBUG
/* NTH: generate the sequence from the remote head index to ensure that the
* wrong sequence isn't set. The way this code used to look the sequence number
* and head were updated independently and it led to false positives for incorrect
* sequence numbers. */
MCA_BTL_OPENIB_RDMA_MOVE_INDEX(ep->eager_rdma_remote.head, head, ftr->seq);
#else
MCA_BTL_OPENIB_RDMA_MOVE_INDEX(ep->eager_rdma_remote.head, head);
#endif
if(ep->nbo)
BTL_OPENIB_FOOTER_HTON(*ftr);
sr_desc->wr.rdma.rkey = ep->eager_rdma_remote.rkey;
sr_desc->wr.rdma.remote_addr =
ep->eager_rdma_remote.base.lval +
head * openib_btl->eager_rdma_frag_size +
sizeof(mca_btl_openib_header_t) +
mca_btl_openib_component.eager_limit +
sizeof(mca_btl_openib_footer_t);
sr_desc->wr.rdma.remote_addr -= sg->length + BTL_OPENIB_FTR_PADDING(sg->length);
} else {
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
sr_desc->opcode = IBV_WR_SEND;
} else {
sr_desc->opcode = IBV_WR_SEND_WITH_IMM;
#if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
sr_desc->imm_data = htonl(ep->rem_info.rem_index);
#else
sr_desc->imm_data = ep->rem_info.rem_index;
#endif
}
}
#if HAVE_XRC
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if(BTL_OPENIB_QP_TYPE_XRC(qp))
sr_desc->qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
if(BTL_OPENIB_QP_TYPE_XRC(qp))
sr_desc->xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#endif
#endif
assert(sg->addr == (uint64_t)(uintptr_t)frag->hdr);
if (sr_desc->send_flags & IBV_SEND_SIGNALED) {
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
} else {
qp_inc_inflight_wqe(ep, qp, to_com_frag(frag));
}
return ibv_post_send(ep->qps[qp].qp->lcl_qp, sr_desc, &bad_wr);
}
/* called with the endpoint lock held */
static inline int mca_btl_openib_endpoint_credit_acquire (struct mca_btl_base_endpoint_t *endpoint, int qp,
int prio, size_t size, bool *do_rdma,
mca_btl_openib_send_frag_t *frag, bool queue_frag)
{
mca_btl_openib_module_t *openib_btl = endpoint->endpoint_btl;
mca_btl_openib_header_t *hdr = frag->hdr;
size_t eager_limit;
int32_t cm_return;
eager_limit = mca_btl_openib_component.eager_limit +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t);
if (!(prio && size < eager_limit && acquire_eager_rdma_send_credit(endpoint) == OPAL_SUCCESS)) {
*do_rdma = false;
prio = !prio;
if (BTL_OPENIB_QP_TYPE_PP(qp)) {
if (OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, 1);
if (queue_frag) {
opal_list_append(&endpoint->qps[qp].no_credits_pending_frags[prio],
(opal_list_item_t *)frag);
}
return OPAL_ERR_OUT_OF_RESOURCE;
}
} else {
if(OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, 1);
if (queue_frag) {
OPAL_THREAD_LOCK(&openib_btl->ib_lock);
opal_list_append(&openib_btl->qps[qp].u.srq_qp.pending_frags[prio],
(opal_list_item_t *)frag);
OPAL_THREAD_UNLOCK(&openib_btl->ib_lock);
}
return OPAL_ERR_OUT_OF_RESOURCE;
}
}
} else {
/* High priority frag. Try to send over eager RDMA */
*do_rdma = true;
}
/* Set all credits */
BTL_OPENIB_GET_CREDITS(endpoint->eager_rdma_local.credits, hdr->credits);
if (hdr->credits) {
hdr->credits |= BTL_OPENIB_RDMA_CREDITS_FLAG;
}
if (!*do_rdma) {
if (BTL_OPENIB_QP_TYPE_PP(qp) && 0 == hdr->credits) {
BTL_OPENIB_GET_CREDITS(endpoint->qps[qp].u.pp_qp.rd_credits, hdr->credits);
}
} else {
hdr->credits |= (qp << 11);
}
BTL_OPENIB_GET_CREDITS(endpoint->qps[qp].u.pp_qp.cm_return, cm_return);
/* cm_seen is only 8 bytes, but cm_return is 32 bytes */
if(cm_return > 255) {
hdr->cm_seen = 255;
cm_return -= 255;
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.cm_return, cm_return);
} else {
hdr->cm_seen = cm_return;
}
return OPAL_SUCCESS;
}
/* called with the endpoint lock held. */
static inline void mca_btl_openib_endpoint_credit_release (struct mca_btl_base_endpoint_t *endpoint, int qp,
bool do_rdma, mca_btl_openib_send_frag_t *frag)
{
mca_btl_openib_header_t *hdr = frag->hdr;
if (BTL_OPENIB_IS_RDMA_CREDITS(hdr->credits)) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_local.credits, BTL_OPENIB_CREDITS(hdr->credits));
}
if (do_rdma) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, 1);
} else {
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
OPAL_THREAD_ADD_FETCH32 (&endpoint->qps[qp].u.pp_qp.rd_credits, hdr->credits);
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, 1);
} else if BTL_OPENIB_QP_TYPE_SRQ(qp){
mca_btl_openib_module_t *openib_btl = endpoint->endpoint_btl;
OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, 1);
}
}
}
END_C_DECLS
#endif

Просмотреть файл

@ -1,222 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2012 Oracle and/or its affiliates. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_eager_rdma.h"
int mca_btl_openib_frag_init(opal_free_list_item_t* item, void* ctx)
{
mca_btl_openib_frag_init_data_t* init_data = (mca_btl_openib_frag_init_data_t *) ctx;
mca_btl_openib_frag_t *frag = to_base_frag(item);
if(MCA_BTL_OPENIB_FRAG_RECV == frag->type) {
to_recv_frag(frag)->qp_idx = init_data->order;
to_com_frag(frag)->sg_entry.length =
mca_btl_openib_component.qp_infos[init_data->order].size +
sizeof(mca_btl_openib_header_t) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t);
}
if(MCA_BTL_OPENIB_FRAG_SEND == frag->type)
to_send_frag(frag)->qp_idx = init_data->order;
frag->list = init_data->list;
return OPAL_SUCCESS;
}
static void base_constructor(mca_btl_openib_frag_t *frag)
{
frag->base.order = MCA_BTL_NO_ORDER;
}
static void com_constructor(mca_btl_openib_com_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
mca_btl_openib_reg_t* reg =
(mca_btl_openib_reg_t*)base_frag->base.super.registration;
frag->registration = reg;
if(reg) {
frag->sg_entry.lkey = reg->mr->lkey;
}
frag->n_wqes_inflight = 0;
}
static void out_constructor(mca_btl_openib_out_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->sr_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->sr_desc.num_sge = 1;
frag->sr_desc.opcode = IBV_WR_SEND;
frag->sr_desc.send_flags = IBV_SEND_SIGNALED;
frag->sr_desc.next = NULL;
}
static void in_constructor(mca_btl_openib_in_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
}
static void send_constructor(mca_btl_openib_send_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_SEND;
frag->chdr = (mca_btl_openib_header_t*)base_frag->base.super.ptr;
frag->hdr = (mca_btl_openib_header_t*)
(((unsigned char*)base_frag->base.super.ptr) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t));
base_frag->segment.seg_addr.pval = frag->hdr + 1;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
frag->coalesced_length = 0;
OBJ_CONSTRUCT(&frag->coalesced_frags, opal_list_t);
}
static void recv_constructor(mca_btl_openib_recv_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_RECV;
frag->hdr = (mca_btl_openib_header_t*)base_frag->base.super.ptr;
base_frag->segment.seg_addr.pval =
((unsigned char* )frag->hdr) + sizeof(mca_btl_openib_header_t);
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
frag->rd_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->rd_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->rd_desc.num_sge = 1;
frag->rd_desc.next = NULL;
}
static void send_control_constructor(mca_btl_openib_send_control_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_CONTROL;
/* adjusting headers because there is no coalesce header in control messages */
frag->hdr = frag->chdr;
to_base_frag(frag)->segment.seg_addr.pval = frag->hdr + 1;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
}
static void put_constructor(mca_btl_openib_put_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_SEND_USER;
to_out_frag(frag)->sr_desc.opcode = IBV_WR_RDMA_WRITE;
frag->cb.func = NULL;
}
static void get_constructor(mca_btl_openib_get_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_RECV_USER;
frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->sr_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->sr_desc.num_sge = 1;
frag->sr_desc.opcode = IBV_WR_RDMA_READ;
frag->sr_desc.send_flags = IBV_SEND_SIGNALED;
frag->sr_desc.next = NULL;
}
static void coalesced_constructor(mca_btl_openib_coalesced_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_COALESCED;
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
}
OBJ_CLASS_INSTANCE(
mca_btl_openib_frag_t,
mca_btl_base_descriptor_t,
base_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_com_frag_t,
mca_btl_openib_frag_t,
com_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_out_frag_t,
mca_btl_openib_com_frag_t,
out_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_in_frag_t,
mca_btl_openib_com_frag_t,
in_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_send_frag_t,
mca_btl_openib_out_frag_t,
send_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_recv_frag_t,
mca_btl_openib_in_frag_t,
recv_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_send_control_frag_t,
mca_btl_openib_send_frag_t,
send_control_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_put_frag_t,
mca_btl_openib_out_frag_t,
put_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_get_frag_t,
mca_btl_openib_in_frag_t,
get_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_coalesced_frag_t,
mca_btl_openib_frag_t,
coalesced_constructor,
NULL);

Просмотреть файл

@ -1,422 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_FRAG_H
#define MCA_BTL_IB_FRAG_H
#include "opal_config.h"
#include "opal/align.h"
#include "opal/mca/btl/btl.h"
#include <infiniband/verbs.h>
BEGIN_C_DECLS
struct mca_btl_openib_reg_t;
struct mca_btl_openib_header_t {
mca_btl_base_tag_t tag;
uint8_t cm_seen;
uint16_t credits;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[4];
#endif
};
typedef struct mca_btl_openib_header_t mca_btl_openib_header_t;
#define BTL_OPENIB_RDMA_CREDITS_FLAG (1<<15)
#define BTL_OPENIB_IS_RDMA_CREDITS(I) ((I)&BTL_OPENIB_RDMA_CREDITS_FLAG)
#define BTL_OPENIB_CREDITS(I) ((I)&~BTL_OPENIB_RDMA_CREDITS_FLAG)
#define BTL_OPENIB_HEADER_HTON(h) \
do { \
(h).credits = htons((h).credits); \
} while (0)
#define BTL_OPENIB_HEADER_NTOH(h) \
do { \
(h).credits = ntohs((h).credits); \
} while (0)
typedef struct mca_btl_openib_header_coalesced_t {
mca_btl_base_tag_t tag;
uint32_t size;
uint32_t alloc_size;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[4];
#endif
} mca_btl_openib_header_coalesced_t;
#define BTL_OPENIB_HEADER_COALESCED_NTOH(h) \
do { \
(h).size = ntohl((h).size); \
(h).alloc_size = ntohl((h).alloc_size); \
} while(0)
#define BTL_OPENIB_HEADER_COALESCED_HTON(h) \
do { \
(h).size = htonl((h).size); \
(h).alloc_size = htonl((h).alloc_size); \
} while(0)
#if OPAL_OPENIB_PAD_HDR
/* BTL_OPENIB_FTR_PADDING
* This macro is used to keep the pointer to openib footers aligned for
* systems like SPARC64 that take a big performance hit when addresses
* are not aligned (and by default sigbus instead of coercing the type on
* an unaligned address).
*
* We assure alignment of a packet's structures when OPAL_OPENIB_PAD_HDR
* is set to 1. When this is the case then several structures are padded
* to assure alignment and the mca_btl_openib_footer_t structure itself
* will uses the BTL_OPENIB_FTR_PADDING macro to shift the location of the
* pointer to assure proper alignment after the PML Header and data.
* For example sending a 1 byte data packet the memory layout without
* footer alignment would look something like the following:
*
* 0x00 : mca_btl_openib_coalesced_header_t (12 bytes + 4 byte pad)
* 0x10 : mca_btl_openib_control_header_t (1 byte + 7 byte pad)
* 0x18 : mca_btl_openib_header_t (4 bytes + 4 byte pad)
* 0x20 : PML Header and data (16 bytes PML + 1 byte data)
* 0x29 : mca_btl_openib_footer_t (4 bytes + 4 byte pad)
* 0x31 : end of packet
*
* By applying the BTL_OPENIB_FTR_PADDING() in the progress_one_device
* and post_send routines we adjust the pointer to mca_btl_openib_footer_t
* from 0x29 to 0x2C thus correctly aligning the start of the
* footer pointer. This adjustment will cause the padding field of
* mca_btl_openib_footer_t to overlap with the neighboring memory but since
* we never use the padding we do not end up inadvertently overwriting
* memory that does not belong to the fragment.
*/
#define BTL_OPENIB_FTR_PADDING(size) \
OPAL_ALIGN_PAD_AMOUNT(size, sizeof(uint64_t))
/* BTL_OPENIB_ALIGN_COALESCE_HDR
* This macro is used in btl_openib.c, while creating a coalesce fragment,
* to align the coalesce headers.
*/
#define BTL_OPENIB_ALIGN_COALESCE_HDR(ptr) \
OPAL_ALIGN_PTR(ptr, sizeof(uint32_t), unsigned char*)
/* BTL_OPENIB_COALESCE_HDR_PADDING
* This macro is used in btl_openib_component.c, while parsing an incoming
* coalesce fragment, to determine the padding amount used to align the
* mca_btl_openib_coalesce_hdr_t.
*/
#define BTL_OPENIB_COALESCE_HDR_PADDING(ptr) \
OPAL_ALIGN_PAD_AMOUNT(ptr, sizeof(uint32_t))
#else
#define BTL_OPENIB_FTR_PADDING(size) 0
#define BTL_OPENIB_ALIGN_COALESCE_HDR(ptr) ptr
#define BTL_OPENIB_COALESCE_HDR_PADDING(ptr) 0
#endif
struct mca_btl_openib_footer_t {
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
union {
uint32_t size;
uint8_t buf[4];
} u;
#if OPAL_OPENIB_PAD_HDR
#if OPAL_ENABLE_DEBUG
/* this footer needs to be of a 8-byte multiple so by adding the
* seq field you throw this off and you cannot just remove the
* padding because the padding is needed in order to adjust the alignment
* and not overwrite other packets.
*/
uint8_t padding[12];
#else
uint8_t padding[8];
#endif
#endif
};
typedef struct mca_btl_openib_footer_t mca_btl_openib_footer_t;
#ifdef WORDS_BIGENDIAN
#define MCA_BTL_OPENIB_FTR_SIZE_REVERSE(ftr)
#else
#define MCA_BTL_OPENIB_FTR_SIZE_REVERSE(ftr) \
do { \
uint8_t tmp = (ftr).u.buf[0]; \
(ftr).u.buf[0]=(ftr).u.buf[2]; \
(ftr).u.buf[2]=tmp; \
} while (0)
#endif
#if OPAL_ENABLE_DEBUG
#define BTL_OPENIB_FOOTER_SEQ_HTON(h) ((h).seq = htonl((h).seq))
#define BTL_OPENIB_FOOTER_SEQ_NTOH(h) ((h).seq = ntohl((h).seq))
#else
#define BTL_OPENIB_FOOTER_SEQ_HTON(h)
#define BTL_OPENIB_FOOTER_SEQ_NTOH(h)
#endif
#define BTL_OPENIB_FOOTER_HTON(h) \
do { \
BTL_OPENIB_FOOTER_SEQ_HTON(h); \
MCA_BTL_OPENIB_FTR_SIZE_REVERSE(h); \
} while (0)
#define BTL_OPENIB_FOOTER_NTOH(h) \
do { \
BTL_OPENIB_FOOTER_SEQ_NTOH(h); \
MCA_BTL_OPENIB_FTR_SIZE_REVERSE(h); \
} while (0)
#define MCA_BTL_OPENIB_CONTROL_CREDITS 0
#define MCA_BTL_OPENIB_CONTROL_RDMA 1
#define MCA_BTL_OPENIB_CONTROL_COALESCED 2
#define MCA_BTL_OPENIB_CONTROL_CTS 3
struct mca_btl_openib_control_header_t {
uint8_t type;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[7];
#endif
};
typedef struct mca_btl_openib_control_header_t mca_btl_openib_control_header_t;
struct mca_btl_openib_eager_rdma_header_t {
mca_btl_openib_control_header_t control;
uint32_t rkey;
opal_ptr_t rdma_start;
};
typedef struct mca_btl_openib_eager_rdma_header_t mca_btl_openib_eager_rdma_header_t;
#define BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_HTON(h) \
do { \
(h).rkey = htonl((h).rkey); \
(h).rdma_start.lval = hton64((h).rdma_start.lval); \
} while (0)
#define BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_NTOH(h) \
do { \
(h).rkey = ntohl((h).rkey); \
(h).rdma_start.lval = ntoh64((h).rdma_start.lval); \
} while (0)
struct mca_btl_openib_rdma_credits_header_t {
mca_btl_openib_control_header_t control;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[1];
#endif
uint8_t qpn;
uint16_t rdma_credits;
};
typedef struct mca_btl_openib_rdma_credits_header_t mca_btl_openib_rdma_credits_header_t;
#define BTL_OPENIB_RDMA_CREDITS_HEADER_HTON(h) \
do { \
(h).rdma_credits = htons((h).rdma_credits); \
} while (0)
#define BTL_OPENIB_RDMA_CREDITS_HEADER_NTOH(h) \
do { \
(h).rdma_credits = ntohs((h).rdma_credits); \
} while (0)
enum mca_btl_openib_frag_type_t {
MCA_BTL_OPENIB_FRAG_RECV,
MCA_BTL_OPENIB_FRAG_RECV_USER,
MCA_BTL_OPENIB_FRAG_SEND,
MCA_BTL_OPENIB_FRAG_SEND_USER,
MCA_BTL_OPENIB_FRAG_EAGER_RDMA,
MCA_BTL_OPENIB_FRAG_CONTROL,
MCA_BTL_OPENIB_FRAG_COALESCED
};
typedef enum mca_btl_openib_frag_type_t mca_btl_openib_frag_type_t;
#define openib_frag_type(f) (to_base_frag(f)->type)
/**
* IB fragment derived type.
*/
/* base openib frag */
typedef struct mca_btl_openib_frag_t {
mca_btl_base_descriptor_t base;
mca_btl_base_segment_t segment;
mca_btl_openib_frag_type_t type;
opal_free_list_t* list;
} mca_btl_openib_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_frag_t);
#define to_base_frag(f) ((mca_btl_openib_frag_t*)(f))
/* frag used for communication */
typedef struct mca_btl_openib_com_frag_t {
mca_btl_openib_frag_t super;
struct ibv_sge sg_entry;
struct mca_btl_openib_reg_t *registration;
struct mca_btl_base_endpoint_t *endpoint;
/* number of unsignaled frags sent before this frag. */
uint32_t n_wqes_inflight;
} mca_btl_openib_com_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_com_frag_t);
#define to_com_frag(f) ((mca_btl_openib_com_frag_t*)(f))
typedef struct mca_btl_openib_out_frag_t {
mca_btl_openib_com_frag_t super;
struct ibv_send_wr sr_desc;
} mca_btl_openib_out_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_out_frag_t);
#define to_out_frag(f) ((mca_btl_openib_out_frag_t*)(f))
typedef struct mca_btl_openib_com_frag_t mca_btl_openib_in_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_in_frag_t);
#define to_in_frag(f) ((mca_btl_openib_in_frag_t*)(f))
typedef struct mca_btl_openib_send_frag_t {
mca_btl_openib_out_frag_t super;
mca_btl_openib_header_t *hdr, *chdr;
mca_btl_openib_footer_t *ftr;
uint8_t qp_idx;
uint32_t coalesced_length;
opal_list_t coalesced_frags;
} mca_btl_openib_send_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_send_frag_t);
#define to_send_frag(f) ((mca_btl_openib_send_frag_t*)(f))
typedef struct mca_btl_openib_recv_frag_t {
mca_btl_openib_in_frag_t super;
mca_btl_openib_header_t *hdr;
mca_btl_openib_footer_t *ftr;
struct ibv_recv_wr rd_desc;
uint8_t qp_idx;
} mca_btl_openib_recv_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_recv_frag_t);
#define to_recv_frag(f) ((mca_btl_openib_recv_frag_t*)(f))
typedef struct mca_btl_openib_put_frag_t {
mca_btl_openib_out_frag_t super;
struct {
mca_btl_base_rdma_completion_fn_t func;
mca_btl_base_registration_handle_t *local_handle;
void *context;
void *data;
} cb;
} mca_btl_openib_put_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_put_frag_t);
#define to_put_frag(f) ((mca_btl_openib_put_frag_t*)(f))
typedef struct mca_btl_openib_get_frag_t {
mca_btl_openib_in_frag_t super;
struct ibv_send_wr sr_desc;
struct {
mca_btl_base_rdma_completion_fn_t func;
mca_btl_base_registration_handle_t *local_handle;
void *context;
void *data;
} cb;
} mca_btl_openib_get_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_get_frag_t);
#define to_get_frag(f) ((mca_btl_openib_get_frag_t*)(f))
typedef struct mca_btl_openib_send_frag_t mca_btl_openib_send_control_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_send_control_frag_t);
#define to_send_control_frag(f) ((mca_btl_openib_send_control_frag_t*)(f))
typedef struct mca_btl_openib_coalesced_frag_t {
mca_btl_openib_frag_t super;
mca_btl_openib_send_frag_t *send_frag;
mca_btl_openib_header_coalesced_t *hdr;
bool sent;
} mca_btl_openib_coalesced_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_coalesced_frag_t);
#define to_coalesced_frag(f) ((mca_btl_openib_coalesced_frag_t*)(f))
/*
* Allocate an IB send descriptor
*
*/
static inline mca_btl_openib_send_control_frag_t *
alloc_control_frag(mca_btl_openib_module_t *btl)
{
return to_send_control_frag(opal_free_list_wait (&btl->device->send_free_control));
}
static inline uint8_t frag_size_to_order(mca_btl_openib_module_t* btl,
size_t size)
{
int qp;
for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++)
if(mca_btl_openib_component.qp_infos[qp].size >= size)
return qp;
return MCA_BTL_NO_ORDER;
}
static inline mca_btl_openib_com_frag_t *alloc_send_user_frag(void)
{
return to_com_frag(opal_free_list_get (&mca_btl_openib_component.send_user_free));
}
static inline mca_btl_openib_com_frag_t *alloc_recv_user_frag(void)
{
return to_com_frag(opal_free_list_get (&mca_btl_openib_component.recv_user_free));
}
static inline mca_btl_openib_coalesced_frag_t *alloc_coalesced_frag(void)
{
return to_coalesced_frag(opal_free_list_get (&mca_btl_openib_component.send_free_coalesced));
}
#define MCA_BTL_IB_FRAG_RETURN(frag) \
do { \
opal_free_list_return (to_base_frag(frag)->list, \
(opal_free_list_item_t*)(frag)); \
} while(0)
#define MCA_BTL_OPENIB_CLEAN_PENDING_FRAGS(list) \
do { \
opal_list_item_t *_frag_item; \
while (NULL != (_frag_item = opal_list_remove_first(list))) { \
MCA_BTL_IB_FRAG_RETURN(_frag_item); \
} \
} while (0)
struct mca_btl_openib_module_t;
struct mca_btl_openib_frag_init_data_t {
uint8_t order;
opal_free_list_t* list;
};
typedef struct mca_btl_openib_frag_init_data_t mca_btl_openib_frag_init_data_t;
int mca_btl_openib_frag_init(opal_free_list_item_t* item, void* ctx);
END_C_DECLS
#endif

Просмотреть файл

@ -1,167 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
/*
* RDMA READ remote buffer to local buffer address.
*/
int mca_btl_openib_get (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata)
{
mca_btl_openib_get_frag_t* frag = NULL;
int qp = order;
int rc;
if (OPAL_UNLIKELY(size > btl->btl_get_limit)) {
return OPAL_ERR_BAD_PARAM;
}
frag = to_get_frag(alloc_recv_user_frag());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = size;
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t) local_address;
to_com_frag(frag)->endpoint = ep;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* set up descriptor */
frag->sr_desc.wr.rdma.remote_addr = remote_address;
/* the opcode may have been changed by an atomic operation */
frag->sr_desc.opcode = IBV_WR_RDMA_READ;
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
frag->sr_desc.wr.rdma.rkey = opal_swap_bytes4 (remote_handle->rkey);
} else
#endif
{
frag->sr_desc.wr.rdma.rkey = remote_handle->rkey;
}
if (ep->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&ep->endpoint_lock);
rc = check_endpoint_state(ep, &to_base_frag(frag)->base, &ep->pending_get_frags);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
return OPAL_SUCCESS;
}
if (OPAL_SUCCESS != rc) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_get_internal (btl, ep, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
OPAL_THREAD_LOCK(&ep->endpoint_lock);
opal_list_append(&ep->pending_get_frags, (opal_list_item_t*)frag);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_get_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
mca_btl_openib_get_frag_t *frag)
{
int qp = to_base_frag(frag)->base.order;
struct ibv_send_wr *bad_wr;
#if HAVE_XRC
if (MCA_BTL_XRC_ENABLED && BTL_OPENIB_QP_TYPE_XRC(qp)) {
/* NTH: the remote SRQ number is only available once the endpoint is connected. By
* setting the value here instead of mca_btl_openib_get we guarantee the rem_srqs
* array is initialized. */
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
frag->sr_desc.qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
frag->sr_desc.xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#endif
}
#endif
/* check for a send wqe */
if (qp_get_wqe(ep, qp) < 0) {
qp_put_wqe(ep, qp);
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* check for a get token */
if (OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,-1) < 0) {
qp_put_wqe(ep, qp);
OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,1);
return OPAL_ERR_OUT_OF_RESOURCE;
}
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
if (ibv_post_send(ep->qps[qp].qp->lcl_qp, &frag->sr_desc, &bad_wr)) {
qp_put_wqe(ep, qp);
OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,1);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,664 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2012-2017 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "opal/util/show_help.h"
#include "opal/util/string_copy.h"
#include "btl_openib.h"
#include "btl_openib_lex.h"
#include "btl_openib_ini.h"
static const char *ini_filename = NULL;
static bool initialized = false;
static opal_list_t devices;
static char *key_buffer = NULL;
static size_t key_buffer_len = 0;
/*
* Struct to hold the section name, vendor ID, and list of vendor part
* ID's and a corresponding set of values (parsed from an INI file).
*/
typedef struct parsed_section_values_t {
char *name;
uint32_t *vendor_ids;
int vendor_ids_len;
uint32_t *vendor_part_ids;
int vendor_part_ids_len;
opal_btl_openib_ini_values_t values;
} parsed_section_values_t;
/*
* Struct to hold the final values. Different from above in a few ways:
*
* - The vendor and part IDs will always be set properly
* - There will only be one part ID (i.e., the above struct is
* exploded into multiple of these for each of searching)
* - There is a super of opal_list_item_t so that we can have a list
* of these
*/
typedef struct device_values_t {
opal_list_item_t super;
char *section_name;
uint32_t vendor_id;
uint32_t vendor_part_id;
opal_btl_openib_ini_values_t values;
} device_values_t;
static void device_values_constructor(device_values_t *s);
static void device_values_destructor(device_values_t *s);
OBJ_CLASS_INSTANCE(device_values_t,
opal_list_item_t,
device_values_constructor,
device_values_destructor);
/*
* Local functions
*/
static int parse_file(char *filename);
static int parse_line(parsed_section_values_t *item);
static void reset_section(bool had_previous_value, parsed_section_values_t *s);
static void reset_values(opal_btl_openib_ini_values_t *v);
static int save_section(parsed_section_values_t *s);
/*
* Read the INI files for device-specific values and save them in
* internal data structures for later lookup.
*/
int opal_btl_openib_ini_init(void)
{
int ret = OPAL_ERR_NOT_FOUND;
char *colon;
char separator = ':';
OBJ_CONSTRUCT(&devices, opal_list_t);
colon = strchr(mca_btl_openib_component.device_params_file_names, separator);
if (NULL == colon) {
/* If we've only got 1 file (i.e., no colons found), parse it
and be done */
ret = parse_file(mca_btl_openib_component.device_params_file_names);
} else {
/* Otherwise, loop over all the files and parse them */
char *orig = strdup(mca_btl_openib_component.device_params_file_names);
char *str = orig;
while (NULL != (colon = strchr(str, ':'))) {
*colon = '\0';
ret = parse_file(str);
/* Note that NOT_FOUND and SUCCESS are not fatal errors
and we keep going. Other errors are treated as
fatal */
if (OPAL_ERR_NOT_FOUND != ret && OPAL_SUCCESS != ret) {
break;
}
str = colon + 1;
}
/* Parse the last file if we didn't have a fatal error above */
if (OPAL_ERR_NOT_FOUND != ret && OPAL_SUCCESS != ret) {
ret = parse_file(str);
}
/* All done */
free(orig);
}
/* Return SUCCESS unless we got a fatal error */
initialized = true;
return (OPAL_SUCCESS == ret || OPAL_ERR_NOT_FOUND == ret) ?
OPAL_SUCCESS : ret;
}
/*
* The component found a device and is querying to see if an INI file
* specified any parameters for it.
*/
int opal_btl_openib_ini_query(uint32_t vendor_id, uint32_t vendor_part_id,
opal_btl_openib_ini_values_t *values)
{
int ret;
device_values_t *h;
if (!initialized) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_init())) {
return ret;
}
}
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Querying INI files for vendor 0x%04x, part ID %d",
vendor_id, vendor_part_id));
}
reset_values(values);
/* Iterate over all the saved devices */
OPAL_LIST_FOREACH(h, &devices, device_values_t) {
if (vendor_id == h->vendor_id &&
vendor_part_id == h->vendor_part_id) {
/* Found it! */
/* NOTE: There is a bug in the PGI 6.2 series that causes
the compiler to choke when copying structs containing
bool members by value. So do a memcpy here instead. */
memcpy(values, &h->values, sizeof(h->values));
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Found corresponding INI values: %s",
h->section_name));
}
return OPAL_SUCCESS;
}
}
/* If we fall through to here, we didn't find it */
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Did not find corresponding INI values"));
}
return OPAL_ERR_NOT_FOUND;
}
/*
* The component is shutting down; release all internal state
*/
int opal_btl_openib_ini_finalize(void)
{
if (initialized) {
OPAL_LIST_DESTRUCT(&devices);
initialized = true;
}
return OPAL_SUCCESS;
}
/**************************************************************************/
/*
* Parse a single file
*/
static int parse_file(char *filename)
{
int val;
int ret = OPAL_SUCCESS;
bool showed_no_section_warning = false;
bool showed_unexpected_tokens_warning = false;
parsed_section_values_t section;
reset_section(false, &section);
/* Open the file */
ini_filename = filename;
btl_openib_ini_yyin = fopen(filename, "r");
if (NULL == btl_openib_ini_yyin) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:file not found",
true, filename);
ret = OPAL_ERR_NOT_FOUND;
goto cleanup;
}
/* Do the parsing */
btl_openib_ini_parse_done = false;
btl_openib_ini_yynewlines = 1;
btl_openib_ini_init_buffer(btl_openib_ini_yyin);
while (!btl_openib_ini_parse_done) {
val = btl_openib_ini_yylex();
switch (val) {
case BTL_OPENIB_INI_PARSE_DONE:
/* This will also set btl_openib_ini_parse_done to true, so just
break here */
break;
case BTL_OPENIB_INI_PARSE_NEWLINE:
/* blank line! ignore it */
break;
case BTL_OPENIB_INI_PARSE_SECTION:
/* We're starting a new section; if we have previously
parsed a section, go see if we can use its values. */
save_section(&section);
reset_section(true, &section);
section.name = strdup(btl_openib_ini_yytext);
break;
case BTL_OPENIB_INI_PARSE_SINGLE_WORD:
if (NULL == section.name) {
/* Warn that there is no current section, and ignore
this parameter */
if (!showed_no_section_warning) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:not in a section", true);
showed_no_section_warning = true;
}
/* Parse it and then dump it */
parse_line(&section);
reset_section(true, &section);
} else {
parse_line(&section);
}
break;
default:
/* anything else is an error */
if (!showed_unexpected_tokens_warning) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:unexpected token", true);
showed_unexpected_tokens_warning = true;
}
break;
}
}
save_section(&section);
fclose(btl_openib_ini_yyin);
btl_openib_ini_yylex_destroy ();
cleanup:
reset_section(true, &section);
if (NULL != key_buffer) {
free(key_buffer);
key_buffer = NULL;
key_buffer_len = 0;
}
return ret;
}
/*
* Parse a single line in the INI file
*/
static int parse_line(parsed_section_values_t *sv)
{
int val, ret = OPAL_SUCCESS;
char *value = NULL;
bool showed_unknown_field_warning = false;
/* Save the name name */
if (key_buffer_len < strlen(btl_openib_ini_yytext) + 1) {
char *tmp;
key_buffer_len = strlen(btl_openib_ini_yytext) + 1;
tmp = (char *) realloc(key_buffer, key_buffer_len);
if (NULL == tmp) {
free(key_buffer);
key_buffer_len = 0;
key_buffer = NULL;
return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
}
key_buffer = tmp;
}
opal_string_copy(key_buffer, btl_openib_ini_yytext, key_buffer_len);
/* The first thing we have to see is an "=" */
val = btl_openib_ini_yylex();
if (btl_openib_ini_parse_done || BTL_OPENIB_INI_PARSE_EQUAL != val) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:expected equals", true);
return OPAL_ERROR;
}
/* Next we get the value */
val = btl_openib_ini_yylex();
if (BTL_OPENIB_INI_PARSE_SINGLE_WORD != val && BTL_OPENIB_INI_PARSE_VALUE != val) {
return OPAL_ERROR;
}
value = strdup(btl_openib_ini_yytext);
/* Now we need to see the newline */
val = btl_openib_ini_yylex();
/* If we did not get EOL or EOF, something is wrong */
if (BTL_OPENIB_INI_PARSE_NEWLINE != val && BTL_OPENIB_INI_PARSE_DONE != val) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:expected newline", true);
free(value);
return OPAL_ERROR;
}
/* Ok, we got a good parse. Now figure out what it is and save
the value. Note that the flex already took care of trimming
all whitespace at the beginning and ending of the value. */
if (0 == strcasecmp(key_buffer, "vendor_id")) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_intify_list(value, &sv->vendor_ids,
&sv->vendor_ids_len))) {
return ret;
}
}
else if (0 == strcasecmp(key_buffer, "vendor_part_id")) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_intify_list(value, &sv->vendor_part_ids,
&sv->vendor_part_ids_len))) {
return ret;
}
}
else if (0 == strcasecmp(key_buffer, "mtu")) {
/* Single value */
sv->values.mtu = (uint32_t) opal_btl_openib_ini_intify(value);
sv->values.mtu_set = true;
}
else if (0 == strcasecmp(key_buffer, "use_eager_rdma")) {
/* Single value */
sv->values.use_eager_rdma = (uint32_t) opal_btl_openib_ini_intify(value);
sv->values.use_eager_rdma_set = true;
}
else if (0 == strcasecmp(key_buffer, "receive_queues")) {
/* Single value (already strdup'ed) */
sv->values.receive_queues = value;
value = NULL;
}
else if (0 == strcasecmp(key_buffer, "max_inline_data")) {
/* Single value */
sv->values.max_inline_data = (int32_t) opal_btl_openib_ini_intify(value);
sv->values.max_inline_data_set = true;
}
else if (0 == strcasecmp(key_buffer, "rdmacm_reject_causes_connect_error")) {
/* Single value */
sv->values.rdmacm_reject_causes_connect_error =
(bool) opal_btl_openib_ini_intify(value);
sv->values.rdmacm_reject_causes_connect_error_set = true;
}
else if (0 == strcasecmp(key_buffer, "ignore_device")) {
/* Single value */
sv->values.ignore_device = (bool) opal_btl_openib_ini_intify(value);
sv->values.ignore_device_set = true;
}
else {
/* Have no idea what this parameter is. Not an error -- just
ignore it */
if (!showed_unknown_field_warning) {
opal_show_help("help-mpi-btl-openib.txt",
"ini file:unknown field", true,
ini_filename, btl_openib_ini_yynewlines,
key_buffer);
showed_unknown_field_warning = true;
}
}
/* All done */
if (NULL != value) {
free(value);
}
return ret;
}
/*
* Construct an device_values_t and set all of its values to known states
*/
static void device_values_constructor(device_values_t *s)
{
s->section_name = NULL;
s->vendor_id = 0;
s->vendor_part_id = 0;
reset_values(&s->values);
}
/*
* Destruct an device_values_t and free any memory that it has
*/
static void device_values_destructor(device_values_t *s)
{
if (NULL != s->section_name) {
free(s->section_name);
}
if (NULL != s->values.receive_queues) {
free(s->values.receive_queues);
}
}
/*
* Reset a parsed section; free any memory that it may have had
*/
static void reset_section(bool had_previous_value, parsed_section_values_t *s)
{
if (had_previous_value) {
if (NULL != s->name) {
free(s->name);
}
if (NULL != s->vendor_ids) {
free(s->vendor_ids);
}
if (NULL != s->vendor_part_ids) {
free(s->vendor_part_ids);
}
}
s->name = NULL;
s->vendor_ids = NULL;
s->vendor_ids_len = 0;
s->vendor_part_ids = NULL;
s->vendor_part_ids_len = 0;
reset_values(&s->values);
}
/*
* Reset the values to known states
*/
static void reset_values(opal_btl_openib_ini_values_t *v)
{
v->mtu = 0;
v->mtu_set = false;
v->use_eager_rdma = 0;
v->use_eager_rdma_set = false;
v->receive_queues = NULL;
v->max_inline_data = 0;
v->max_inline_data_set = false;
v->rdmacm_reject_causes_connect_error = false;
v->rdmacm_reject_causes_connect_error_set = false;
v->ignore_device = false;
v->ignore_device_set = false;
}
/*
* If we have a valid section, see if we have a matching section
* somewhere (i.e., same vendor ID and vendor part ID). If we do,
* update the values. If not, save the values in a new instance and
* add it to the list.
*/
static int save_section(parsed_section_values_t *s)
{
int i, j;
device_values_t *h;
bool found;
/* Is the parsed section valid? */
if (NULL == s->name || 0 == s->vendor_ids_len ||
0 == s->vendor_part_ids_len) {
return OPAL_ERR_BAD_PARAM;
}
/* Iterate over each of the vendor/part IDs in the parsed
values */
for (i = 0; i < s->vendor_ids_len; ++i) {
for (j = 0; j < s->vendor_part_ids_len; ++j) {
found = false;
/* Iterate over all the saved devices */
OPAL_LIST_FOREACH(h, &devices, device_values_t) {
if (s->vendor_ids[i] == h->vendor_id &&
s->vendor_part_ids[j] == h->vendor_part_id) {
/* Found a match. Update any newly-set values. */
if (s->values.mtu_set) {
h->values.mtu = s->values.mtu;
h->values.mtu_set = true;
}
if (s->values.use_eager_rdma_set) {
h->values.use_eager_rdma = s->values.use_eager_rdma;
h->values.use_eager_rdma_set = true;
}
if (NULL != s->values.receive_queues) {
h->values.receive_queues =
strdup(s->values.receive_queues);
}
if (s->values.max_inline_data_set) {
h->values.max_inline_data = s->values.max_inline_data;
h->values.max_inline_data_set = true;
}
if (s->values.rdmacm_reject_causes_connect_error_set) {
h->values.rdmacm_reject_causes_connect_error =
s->values.rdmacm_reject_causes_connect_error;
h->values.rdmacm_reject_causes_connect_error_set =
true;
}
if (s->values.ignore_device_set) {
h->values.ignore_device = s->values.ignore_device;
h->values.ignore_device_set = true;
}
found = true;
break;
}
}
/* Did we find/update it in the exising list? If not,
create a new one. */
if (!found) {
h = OBJ_NEW(device_values_t);
h->section_name = strdup(s->name);
h->vendor_id = s->vendor_ids[i];
h->vendor_part_id = s->vendor_part_ids[j];
/* NOTE: There is a bug in the PGI 6.2 series that
causes the compiler to choke when copying structs
containing bool members by value. So do a memcpy
here instead. */
memcpy(&h->values, &s->values, sizeof(s->values));
/* Need to strdup the string, though */
if (NULL != s->values.receive_queues) {
h->values.receive_queues = strdup(s->values.receive_queues);
}
opal_list_append(&devices, &h->super);
}
}
}
/* All done */
return OPAL_SUCCESS;
}
/*
* Do string-to-integer conversion, for both hex and decimal numbers
*/
int opal_btl_openib_ini_intify(char *str)
{
return strtol (str, NULL, 0);
}
/*
* Take a comma-delimited list and infity them all
*/
int opal_btl_openib_ini_intify_list(char *value, uint32_t **values, int *len)
{
char *comma;
char *str = value;
*len = 0;
/* Comma-delimited list of values */
comma = strchr(str, ',');
if (NULL == comma) {
/* If we only got one value (i.e., no comma found), then
just make an array of one value and save it */
*values = (uint32_t *) malloc(sizeof(uint32_t));
if (NULL == *values) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
*values[0] = (uint32_t) opal_btl_openib_ini_intify(str);
*len = 1;
} else {
int newsize = 1;
/* Count how many values there are and allocate enough space
for them */
while (NULL != comma) {
++newsize;
str = comma + 1;
comma = strchr(str, ',');
}
*values = (uint32_t *) malloc(sizeof(uint32_t) * newsize);
if (NULL == *values) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Iterate over the values and save them */
str = value;
comma = strchr(str, ',');
while (NULL != comma) {
*comma = '\0';
(*values)[*len] = (uint32_t) opal_btl_openib_ini_intify(str);
++(*len);
str = comma + 1;
comma = strchr(str, ',');
}
/* Get the last value (i.e., the value after the last
comma, because it won't have been snarfed in the
loop) */
(*values)[*len] = (uint32_t) opal_btl_openib_ini_intify(str);
++(*len);
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,70 +0,0 @@
/*
* Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2008 Mellanox Technologies. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_PTL_IB_PARAMS_H
#define MCA_PTL_IB_PARAMS_H
#include "btl_openib.h"
/*
* Struct to hold the settable values that may be specified in the INI
* file
*/
typedef struct opal_btl_openib_ini_values_t {
uint32_t mtu;
bool mtu_set;
uint32_t use_eager_rdma;
bool use_eager_rdma_set;
char *receive_queues;
int32_t max_inline_data;
bool max_inline_data_set;
bool rdmacm_reject_causes_connect_error;
bool rdmacm_reject_causes_connect_error_set;
bool ignore_device;
bool ignore_device_set;
} opal_btl_openib_ini_values_t;
BEGIN_C_DECLS
/**
* Read in the INI files containing device params
*/
int opal_btl_openib_ini_init(void);
/**
* Query the read-in params for a given device
*/
int opal_btl_openib_ini_query(uint32_t vendor_id,
uint32_t vendor_part_id,
opal_btl_openib_ini_values_t *values);
/**
* Shut down / release all internal state
*/
int opal_btl_openib_ini_finalize(void);
/**
* string to int convertors with dec/hex autodetection
*/
int opal_btl_openib_ini_intify(char *string);
int opal_btl_openib_ini_intify_list(char *str, uint32_t **values, int *len);
END_C_DECLS
#endif

Просмотреть файл

@ -1,433 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2008 Chelsio, Inc. All rights reserved.
* Copyright (c) 2008-2010 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2017 Los Alamos National Security, LLC. All rights
* reserved.
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#if OPAL_HAVE_RDMACM
#include <rdma/rdma_cma.h>
#include <stdlib.h>
#include <stdio.h>
#include "opal/util/argv.h"
#include "opal/util/if.h"
#include "opal/util/proc.h"
#include "opal/util/show_help.h"
#include "connect/connect.h"
#endif
/* Always want to include this file */
#include "btl_openib_endpoint.h"
#include "btl_openib_ip.h"
#if OPAL_HAVE_RDMACM
/*
* The cruft below maintains the linked list of rdma ipv4 addresses and their
* associated rdma device names and device port numbers.
*/
struct rdma_addr_list {
opal_list_item_t super;
uint32_t addr;
uint32_t subnet;
char addr_str[16];
char dev_name[IBV_SYSFS_NAME_MAX];
uint8_t dev_port;
};
typedef struct rdma_addr_list rdma_addr_list_t;
static OBJ_CLASS_INSTANCE(rdma_addr_list_t, opal_list_item_t,
NULL, NULL);
static opal_list_t *myaddrs = NULL;
#if OPAL_ENABLE_DEBUG
static char *stringify(uint32_t addr)
{
static char line[64];
memset(line, 0, sizeof(line));
snprintf(line, sizeof(line) - 1, "%d.%d.%d.%d (0x%x)",
#if defined(WORDS_BIGENDIAN)
(addr >> 24),
(addr >> 16) & 0xff,
(addr >> 8) & 0xff,
addr & 0xff,
#else
addr & 0xff,
(addr >> 8) & 0xff,
(addr >> 16) & 0xff,
(addr >> 24),
#endif
addr);
return line;
}
#endif
/* Note that each device port can have multiple IP addresses
* associated with it (aka IP aliasing). However, the openib module
* only knows about (device,port) tuples -- not IP addresses (only the
* RDMA CM CPC knows which IP addresses are associated with each
* (device,port) tuple). Thus, any searching of device list for the
* IP Address or subnets may not work as one might expect. The
* current behavior is to return the IP address (or subnet) of the
* *first* instance of the device on the list. This behavior is
* uniform for subnet and IP addresses and thus should not cause any
* mismatches. If this behavior is not preferred by the user, the MCA
* parameters to include/exclude specific IP addresses can be used to
* precisely specify which addresses are used (e.g., to effect
* specific subnet routing).
*/
uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port)
{
struct rdma_addr_list *addr;
/* In the off chance that the user forces a non-RDMACM CPC and an
* IP-based mechanism, the list will be uninitialized. Return 0
* to prevent crashes, and the lack of it actually working will be
* caught at a later stage.
*/
if (NULL == myaddrs) {
return 0;
}
OPAL_LIST_FOREACH(addr, myaddrs, struct rdma_addr_list) {
if (!strcmp(addr->dev_name, ib_dev->name) &&
port == addr->dev_port) {
return addr->subnet;
}
}
return 0;
}
/* This function should not be necessary, as rdma_get_local_addr would
* be more correct in returning the IP address given the cm_id (and
* not necessitate having to do a list look up). Unfortunately, the
* subnet and IP address look up needs to match or there could be a
* mismatch if IP Aliases are being used. For more information on
* this, please read comment above mca_btl_openib_get_ip_subnet_id.
*/
uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port)
{
struct rdma_addr_list *addr;
/* Sanity check */
if (NULL == myaddrs) {
return 0;
}
BTL_VERBOSE(("Looking for %s:%d in IP address list",
ibv_get_device_name(verbs->device), port));
OPAL_LIST_FOREACH(addr, myaddrs, struct rdma_addr_list) {
if (!strcmp(addr->dev_name, verbs->device->name) &&
port == addr->dev_port) {
BTL_VERBOSE(("FOUND: %s:%d is %s",
ibv_get_device_name(verbs->device), port,
stringify(addr->addr)));
return addr->addr;
}
}
return 0;
}
static int dev_specified(char *name, int port)
{
char **list;
if (NULL != mca_btl_openib_component.if_include) {
int i;
list = opal_argv_split(mca_btl_openib_component.if_include, ',');
for (i = 0; NULL != list[i]; i++) {
char **temp = opal_argv_split(list[i], ':');
if (0 == strcmp(name, temp[0]) &&
(NULL == temp[1] || port == atoi(temp[1]))) {
return 0;
}
}
return 1;
}
if (NULL != mca_btl_openib_component.if_exclude) {
int i;
list = opal_argv_split(mca_btl_openib_component.if_exclude, ',');
for (i = 0; NULL != list[i]; i++) {
char **temp = opal_argv_split(list[i], ':');
if (0 == strcmp(name, temp[0]) &&
(NULL == temp[1] || port == atoi(temp[1]))) {
return 1;
}
}
}
return 0;
}
static int ipaddr_specified(struct sockaddr_in *ipaddr, uint32_t netmask)
{
uint32_t all = ~((uint32_t) 0);
if (NULL != mca_btl_openib_component.ipaddr_include) {
char **list;
int i;
list = opal_argv_split(mca_btl_openib_component.ipaddr_include, ',');
for (i = 0; NULL != list[i]; i++) {
uint32_t subnet, list_subnet;
struct in_addr ipae;
char **temp = opal_argv_split(list[i], '/');
if (NULL == temp || NULL == temp[0] || NULL == temp[1] ||
NULL != temp[2]) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "include",
opal_process_info.nodename, list[i],
"Invalid specification (missing \"/\")");
if (NULL != temp) {
opal_argv_free(temp);
}
continue;
}
if (1 != inet_pton(ipaddr->sin_family, temp[0], &ipae)) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "include",
opal_process_info.nodename, list[i],
"Invalid specification (inet_pton() failed)");
opal_argv_free(temp);
continue;
}
list_subnet = ntohl(ipae.s_addr) & ~(all >> atoi(temp[1]));
subnet = ntohl(ipaddr->sin_addr.s_addr) & ~(all >> netmask);
opal_argv_free(temp);
if (subnet == list_subnet) {
return 0;
}
}
return 1;
}
if (NULL != mca_btl_openib_component.ipaddr_exclude) {
char **list;
int i;
list = opal_argv_split(mca_btl_openib_component.ipaddr_exclude, ',');
for (i = 0; NULL != list[i]; i++) {
uint32_t subnet, list_subnet;
struct in_addr ipae;
char **temp = opal_argv_split(list[i], '/');
if (NULL == temp || NULL == temp[0] || NULL == temp[1] ||
NULL != temp[2]) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "exclude",
opal_process_info.nodename, list[i],
"Invalid specification (missing \"/\")");
if (NULL != temp) {
opal_argv_free(temp);
}
continue;
}
if (1 != inet_pton(ipaddr->sin_family, temp[0], &ipae)) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "exclude",
opal_process_info.nodename, list[i],
"Invalid specification (inet_pton() failed)");
opal_argv_free(temp);
continue;
}
list_subnet = ntohl(ipae.s_addr) & ~(all >> atoi(temp[1]));
subnet = ntohl(ipaddr->sin_addr.s_addr) & ~(all >> netmask);
opal_argv_free(temp);
if (subnet == list_subnet) {
return 1;
}
}
}
return 0;
}
static int add_rdma_addr(struct sockaddr *ipaddr, uint32_t netmask)
{
struct sockaddr_in *sinp;
struct rdma_cm_id *cm_id;
struct rdma_event_channel *ch;
int rc = OPAL_SUCCESS;
struct rdma_addr_list *myaddr;
uint32_t all = ~((uint32_t) 0);
/* Ensure that this IP address is not in 127.0.0.1/8. If it is,
skip it because we never want loopback addresses to be
considered RDMA devices that remote peers can use to connect
to.
This check is necessary because of a change that almost went
into RDMA CM in OFED 1.5.1. We asked for a delay so that we
could get a release of Open MPI out that includes the
127-ignoring logic; hence, this change will likely be in a
future version of OFED (perhaps OFED 1.6?).
OMPI uses rdma_bind_addr() to determine if a local IP address
is an RDMA device or not. If it succeeds and we get a non-NULL
verbs pointer back in the return, we say that it's a valid RDMA
device. Up through OFED 1.5, rdma_bind_addr(127.0.0.1), would
succeed, but the verbs pointer returned would be NULL. Hence,
we knew it was loopback, and therefore we skipped it.
The proposed RDMA CM change would return a non-NULL/valid verbs
pointer when binding to 127.0.0.1/8. This, of course, screws
up OMPI because we then advertise 127.0.0.1 in the modex as an
address that remote peers can use to contact this process via
RDMA. Hence, we have to specifically exclude 127.0.0.1/8 --
don't even both trying to rdma_bind_addr() to it because we
know we don't want loopback addresses at all. */
sinp = (struct sockaddr_in *)ipaddr;
if ((sinp->sin_addr.s_addr & htonl(0xff000000)) == htonl(0x7f000000)) {
rc = OPAL_SUCCESS;
goto out1;
}
ch = rdma_create_event_channel();
if (NULL == ch) {
BTL_VERBOSE(("failed creating RDMA CM event channel"));
rc = OPAL_ERROR;
goto out1;
}
rc = rdma_create_id(ch, &cm_id, NULL, RDMA_PS_TCP);
if (rc) {
BTL_VERBOSE(("rdma_create_id returned %d", rc));
rc = OPAL_ERROR;
goto out2;
}
/* Bind the newly created cm_id to the IP address. This will,
amongst other things, verify that the device is verbs
capable */
rc = rdma_bind_addr(cm_id, ipaddr);
if (rc || !cm_id->verbs) {
rc = OPAL_SUCCESS;
goto out3;
}
/* Verify that the device has not been excluded */
rc = dev_specified(cm_id->verbs->device->name, cm_id->port_num);
if (rc) {
rc = OPAL_SUCCESS;
goto out3;
}
/* Verify that the device has a valid IP address */
if (0 == ((struct sockaddr_in *)ipaddr)->sin_addr.s_addr ||
ipaddr_specified((struct sockaddr_in *)ipaddr, netmask)) {
rc = OPAL_SUCCESS;
goto out3;
}
myaddr = OBJ_NEW(rdma_addr_list_t);
if (NULL == myaddr) {
BTL_ERROR(("malloc failed!"));
rc = OPAL_ERROR;
goto out3;
}
myaddr->addr = sinp->sin_addr.s_addr;
myaddr->subnet = ntohl(myaddr->addr) & ~(all >> netmask);
inet_ntop(sinp->sin_family, &sinp->sin_addr,
myaddr->addr_str, sizeof(myaddr->addr_str));
memcpy(myaddr->dev_name, cm_id->verbs->device->name, IBV_SYSFS_NAME_MAX);
myaddr->dev_port = cm_id->port_num;
BTL_VERBOSE(("Adding addr %s (0x%x) subnet 0x%x as %s:%d",
myaddr->addr_str, myaddr->addr, myaddr->subnet,
myaddr->dev_name, myaddr->dev_port));
opal_list_append(myaddrs, &(myaddr->super));
out3:
rdma_destroy_id(cm_id);
out2:
rdma_destroy_event_channel(ch);
out1:
return rc;
}
int mca_btl_openib_build_rdma_addr_list(void)
{
int rc = OPAL_SUCCESS, i;
myaddrs = OBJ_NEW(opal_list_t);
if (NULL == myaddrs) {
BTL_ERROR(("malloc failed!"));
return OPAL_ERROR;
}
for (i = opal_ifbegin(); i >= 0; i = opal_ifnext(i)) {
struct sockaddr ipaddr;
uint32_t netmask;
opal_ifindextoaddr(i, &ipaddr, sizeof(struct sockaddr));
opal_ifindextomask(i, &netmask, sizeof(uint32_t));
if (ipaddr.sa_family == AF_INET) {
rc = add_rdma_addr(&ipaddr, netmask);
if (OPAL_SUCCESS != rc) {
break;
}
}
}
return rc;
}
void mca_btl_openib_free_rdma_addr_list(void)
{
if (NULL != myaddrs) {
OPAL_LIST_RELEASE(myaddrs);
myaddrs = NULL;
}
}
#else
/* !OPAL_HAVE_RDMACM case */
uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port)
{
return 0;
}
uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port)
{
return 0;
}
int mca_btl_openib_build_rdma_addr_list(void)
{
return OPAL_SUCCESS;
}
void mca_btl_openib_free_rdma_addr_list(void)
{
}
#endif

Просмотреть файл

@ -1,55 +0,0 @@
/*
* Copyright (c) 2008 Chelsio, Inc. All rights reserved.
* Copyright (c) 2008-2010 Cisco Systems, Inc. All rights reserved.
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_IP_H
#define MCA_BTL_OPENIB_IP_H
#include "opal_config.h"
BEGIN_C_DECLS
/**
* Get an IP equivalent of a subnet ID.
*
* @param ib_dev (IN) IBV device
* @return Value of the IPv4 Address bitwise-and'ed with the Netmask
*/
extern uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port);
/**
* Get the IPv4 address of the specified HCA/RNIC device and physical port.
*
* @param verbs (IN) cm_id verbs of the IBV device
* @param port (IN) physical port of the IBV device
* @return IPv4 Address
*/
extern uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port);
/**
* Create a list of all available IBV devices and each device's
* relevant information. This is necessary for
* mca_btl_openib_rdma_get_ipv4addr to work.
*
* @return OPAL_SUCCESS or failure status
*/
extern int mca_btl_openib_build_rdma_addr_list(void);
/**
* Free the list of all available IBV devices created by
* mca_btl_openib_build_rdma_addr_list.
*/
extern void mca_btl_openib_free_rdma_addr_list(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,74 +0,0 @@
/* -*- C -*-
*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_INI_LEX_H_
#define BTL_OPENIB_INI_LEX_H_
#include "opal_config.h"
#ifdef malloc
#undef malloc
#endif
#ifdef realloc
#undef realloc
#endif
#ifdef free
#undef free
#endif
#include <stdio.h>
BEGIN_C_DECLS
int btl_openib_ini_yylex(void);
int btl_openib_ini_init_buffer(FILE *file);
int btl_openib_ini_yylex_destroy(void);
extern FILE *btl_openib_ini_yyin;
extern bool btl_openib_ini_parse_done;
extern char *btl_openib_ini_yytext;
extern int btl_openib_ini_yynewlines;
/*
* Make lex-generated files not issue compiler warnings
*/
#define YY_STACK_USED 0
#define YY_ALWAYS_INTERACTIVE 0
#define YY_NEVER_INTERACTIVE 0
#define YY_MAIN 0
#define YY_NO_UNPUT 1
#define YY_SKIP_YYWRAP 1
enum {
BTL_OPENIB_INI_PARSE_DONE,
BTL_OPENIB_INI_PARSE_ERROR,
BTL_OPENIB_INI_PARSE_NEWLINE,
BTL_OPENIB_INI_PARSE_SECTION,
BTL_OPENIB_INI_PARSE_EQUAL,
BTL_OPENIB_INI_PARSE_SINGLE_WORD,
BTL_OPENIB_INI_PARSE_VALUE,
BTL_OPENIB_INI_PARSE_MAX
};
END_C_DECLS
#endif

Просмотреть файл

@ -1,148 +0,0 @@
%option nounput
%option noinput
%{ /* -*- C -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <stdio.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "btl_openib_lex.h"
BEGIN_C_DECLS
/*
* local functions
*/
static int btl_openib_ini_yywrap(void);
END_C_DECLS
/*
* global variables
*/
int btl_openib_ini_yynewlines = 1;
bool btl_openib_ini_parse_done = false;
char *btl_openib_ini_string = NULL;
%}
WHITE [\f\t\v ]
CHAR [A-Za-z0-9_\-\.]
NAME_CHAR [A-Za-z0-9_\-\.\\\/]
%x comment
%x section_name
%x section_end
%x value
%%
{WHITE}*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
#.*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
"//".*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
"/*" { BEGIN(comment);
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<comment>[^*\n]* ; /* Eat up non '*'s */
<comment>"*"+[^*/\n]* ; /* Eat '*'s not followed by a '/' */
<comment>\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<comment>"*"+"/" { BEGIN(INITIAL); /* Done with block comment */
return BTL_OPENIB_INI_PARSE_NEWLINE; }
{WHITE}*\[{WHITE}* { BEGIN(section_name); }
<section_name>({NAME_CHAR}|{WHITE})*{NAME_CHAR}/{WHITE}*\] {
BEGIN(section_end);
return BTL_OPENIB_INI_PARSE_SECTION; }
<section_name>\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_ERROR; }
<section_name>. { return BTL_OPENIB_INI_PARSE_ERROR; }
<section_end>{WHITE}*\]{WHITE}*\n { BEGIN(INITIAL);
++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
{WHITE}*"="{WHITE}* { BEGIN(value);
return BTL_OPENIB_INI_PARSE_EQUAL; }
{WHITE}+ ; /* whitespace */
{CHAR}+ { return BTL_OPENIB_INI_PARSE_SINGLE_WORD; }
<value>{WHITE}*\n { BEGIN(INITIAL);
++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<value>[^\n]*[^\t \n]/[\t ]* {
return BTL_OPENIB_INI_PARSE_VALUE; }
. { return BTL_OPENIB_INI_PARSE_ERROR; }
%%
/* Old flex (2.5.4a? and older) does not define a destroy function */
#if !defined(YY_FLEX_SUBMINOR_VERSION)
#define YY_FLEX_SUBMINOR_VERSION 0
#endif
#if (YY_FLEX_MAJOR_VERSION < 2) || (YY_FLEX_MAJOR_VERSION == 2 && (YY_FLEX_MINOR_VERSION < 5 || (YY_FLEX_MINOR_VERSION == 5 && YY_FLEX_SUBMINOR_VERSION < 5)))
int btl_openib_ini_yylex_destroy(void)
{
if (NULL != YY_CURRENT_BUFFER) {
yy_delete_buffer(YY_CURRENT_BUFFER);
#if defined(YY_CURRENT_BUFFER_LVALUE)
YY_CURRENT_BUFFER_LVALUE = NULL;
#else
YY_CURRENT_BUFFER = NULL;
#endif /* YY_CURRENT_BUFFER_LVALUE */
}
return YY_NULL;
}
#endif
static int btl_openib_ini_yywrap(void)
{
btl_openib_ini_parse_done = true;
return 1;
}
/*
* Ensure that we have a valid yybuffer to use. Specifically, if this
* scanner is invoked a second time, finish_parsing() (above) will
* have been executed, and the current buffer will have been freed.
* Flex doesn't recognize this fact because as far as it's concerned,
* its internal state was already initialized, so it thinks it should
* have a valid buffer. Hence, here we ensure to give it a valid
* buffer.
*/
int btl_openib_ini_init_buffer(FILE *file)
{
YY_BUFFER_STATE buf = yy_create_buffer(file, YY_BUF_SIZE);
yy_switch_to_buffer(buf);
return 0;
}

Просмотреть файл

@ -1,799 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013-2015 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2016 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <string.h>
#include "opal/util/bit_ops.h"
#include "opal/util/printf.h"
#include "opal/mca/common/verbs/common_verbs.h"
#include "opal/mca/installdirs/installdirs.h"
#include "opal/util/os_dirpath.h"
#include "opal/util/output.h"
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "btl_openib.h"
#include "btl_openib_mca.h"
#include "btl_openib_ini.h"
#include "connect/base.h"
#ifdef HAVE_IBV_FORK_INIT
#define OPAL_HAVE_IBV_FORK_INIT 1
#else
#define OPAL_HAVE_IBV_FORK_INIT 0
#endif
/*
* Local flags
*/
enum {
REGINT_NEG_ONE_OK = 0x01,
REGINT_GE_ZERO = 0x02,
REGINT_GE_ONE = 0x04,
REGINT_NONZERO = 0x08,
REGINT_MAX = 0x88
};
enum {
REGSTR_EMPTY_OK = 0x01,
REGSTR_MAX = 0x88
};
static mca_base_var_enum_value_t ib_mtu_values[] = {
{IBV_MTU_256, "256B"},
{IBV_MTU_512, "512B"},
{IBV_MTU_1024, "1k"},
{IBV_MTU_2048, "2k"},
{IBV_MTU_4096, "4k"},
{0, NULL}
};
static mca_base_var_enum_value_t device_type_values[] = {
{BTL_OPENIB_DT_IB, "infiniband"},
{BTL_OPENIB_DT_IB, "ib"},
{BTL_OPENIB_DT_IWARP, "iwarp"},
{BTL_OPENIB_DT_IWARP, "iw"},
{BTL_OPENIB_DT_ALL, "all"},
{0, NULL}
};
static int btl_openib_cq_size;
static bool btl_openib_have_fork_support = OPAL_HAVE_IBV_FORK_INIT;
/*
* utility routine for string parameter registration
*/
static int reg_string(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
const char* default_value, char **storage,
int flags)
{
int index;
assert (NULL != storage);
/* The MCA variable system will not change this pointer */
*storage = (char *) default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if (0 != (flags & REGSTR_EMPTY_OK) && (NULL == *storage || 0 == strlen(*storage))) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_int(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
int default_value, int *storage, int flags)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_INT,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if (0 != (flags & REGINT_NEG_ONE_OK) && -1 == *storage) {
return OPAL_SUCCESS;
}
if ((0 != (flags & REGINT_GE_ZERO) && *storage < 0) ||
(0 != (flags & REGINT_GE_ONE) && *storage < 1) ||
(0 != (flags & REGINT_NONZERO) && 0 == *storage)) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_uint(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
unsigned int default_value, unsigned int *storage,
int flags)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_UNSIGNED_INT,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if ((0 != (flags & REGINT_GE_ONE) && *storage < 1) ||
(0 != (flags & REGINT_NONZERO) && 0 == *storage)) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_bool(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
bool default_value, bool *storage)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_BOOL,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
return OPAL_SUCCESS;
}
/*
* Register and check all MCA parameters
*/
int btl_openib_register_mca_params(void)
{
mca_base_var_enum_t *new_enum;
char *default_qps;
uint32_t mid_qp_size;
char *msg, *str;
int ret, tmp;
ret = OPAL_SUCCESS;
#define CHECK(expr) do {\
tmp = (expr); \
if (OPAL_SUCCESS != tmp) ret = tmp; \
} while (0)
/* register openib component parameters */
CHECK(reg_bool("verbose", NULL,
"Output some verbose OpenIB BTL information "
"(0 = no output, nonzero = output)", false,
&mca_btl_openib_component.verbose));
CHECK(reg_bool("warn_no_device_params_found",
"warn_no_hca_params_found",
"Warn when no device-specific parameters are found in the INI file specified by the btl_openib_device_param_files MCA parameter "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_no_device_params_found));
CHECK(reg_bool("warn_default_gid_prefix", NULL,
"Warn when there is more than one active ports and at least one of them connected to the network with only default GID prefix configured "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_default_gid_prefix));
CHECK(reg_bool("warn_nonexistent_if", NULL,
"Warn if non-existent devices and/or ports are specified in the btl_openib_if_[in|ex]clude MCA parameters "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_nonexistent_if));
/* If we print a warning about not having enough registered memory
available, do we want to abort? */
CHECK(reg_bool("abort_not_enough_reg_mem", NULL,
"If there is not enough registered memory available on the system for Open MPI to function properly, Open MPI will issue a warning. If this MCA parameter is set to true, then Open MPI will also abort all MPI jobs "
"(0 = warn, but do not abort; any other value = warn and abort)",
false, &mca_btl_openib_component.abort_not_enough_reg_mem));
CHECK(reg_uint("poll_cq_batch", NULL,
"Retrieve up to poll_cq_batch completions from CQ",
MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT, &mca_btl_openib_component.cq_poll_batch,
REGINT_GE_ONE));
opal_asprintf(&str, "%s/mca-btl-openib-device-params.ini",
opal_install_dirs.opaldatadir);
if (NULL == str) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
CHECK(reg_string("device_param_files", "hca_param_files",
"Colon-delimited list of INI-style files that contain device vendor/part-specific parameters (use semicolon for Windows)",
str, &mca_btl_openib_component.device_params_file_names,
0));
free(str);
(void)mca_base_var_enum_create("btl_openib_device_types", device_type_values, &new_enum);
mca_btl_openib_component.device_type = BTL_OPENIB_DT_ALL;
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"device_type", "Specify to only use IB or iWARP "
"network adapters (infiniband = only use InfiniBand "
"HCAs; iwarp = only use iWARP NICs; all = use any "
"available adapters)", MCA_BASE_VAR_TYPE_INT, new_enum,
0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.device_type);
if (0 > tmp) ret = tmp;
OBJ_RELEASE(new_enum);
/*
* Provide way for using to override policy of ignoring IB HCAs
*/
mca_btl_openib_component.allow_ib = false;
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"allow_ib",
"Override policy since Open MPI 4.0 of ignoring IB HCAs for openib BTL",
MCA_BASE_VAR_TYPE_BOOL, NULL,
0, 0, OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.allow_ib);
CHECK(reg_int("max_btls", NULL,
"Maximum number of device ports to use "
"(-1 = use all available, otherwise must be >= 1)",
-1, &mca_btl_openib_component.ib_max_btls,
REGINT_NEG_ONE_OK | REGINT_GE_ONE));
CHECK(reg_int("free_list_num", NULL,
"Initial size of free lists "
"(must be >= 1)",
8, &mca_btl_openib_component.ib_free_list_num,
REGINT_GE_ONE));
CHECK(reg_int("free_list_max", NULL,
"Maximum size of free lists "
"(-1 = infinite, otherwise must be >= 0)",
-1, &mca_btl_openib_component.ib_free_list_max,
REGINT_NEG_ONE_OK | REGINT_GE_ONE));
CHECK(reg_int("free_list_inc", NULL,
"Increment size of free lists "
"(must be >= 1)",
32, &mca_btl_openib_component.ib_free_list_inc,
REGINT_GE_ONE));
CHECK(reg_string("mpool_hints", NULL, "hints for selecting a memory pool (default: none)",
NULL, &mca_btl_openib_component.ib_mpool_hints,
0));
CHECK(reg_string("rcache", NULL,
"Name of the registration cache to be used (it is unlikely that you will ever want to change this)",
"grdma", &mca_btl_openib_component.ib_rcache_name,
0));
CHECK(reg_int("reg_mru_len", NULL,
"Length of the registration cache most recently used list "
"(must be >= 1)",
16, (int*) &mca_btl_openib_component.reg_mru_len,
REGINT_GE_ONE));
CHECK(reg_int("cq_size", "ib_cq_size",
"Minimum size of the OpenFabrics completion queue "
"(CQs are automatically sized based on the number "
"of peer MPI processes; this value determines the "
"*minimum* size of all CQs)",
8192, &btl_openib_cq_size, REGINT_GE_ONE));
mca_btl_openib_component.ib_cq_size[BTL_OPENIB_LP_CQ] =
mca_btl_openib_component.ib_cq_size[BTL_OPENIB_HP_CQ] = (uint32_t) btl_openib_cq_size;
CHECK(reg_int("max_inline_data", "ib_max_inline_data",
"Maximum size of inline data segment "
"(-1 = run-time probe to discover max value, otherwise must be >= 0). "
"If not explicitly set, use max_inline_data from "
"the INI file containing device-specific parameters",
-1, &mca_btl_openib_component.ib_max_inline_data,
REGINT_NEG_ONE_OK | REGINT_GE_ZERO));
CHECK(reg_uint("pkey", "ib_pkey_val",
"OpenFabrics partition key (pkey) value. "
"Unsigned integer decimal or hex values are allowed (e.g., \"3\" or \"0x3f\") and will be masked against the maximum allowable IB partition key value (0x7fff)",
0, &mca_btl_openib_component.ib_pkey_val, 0));
CHECK(reg_uint("psn", "ib_psn",
"OpenFabrics packet sequence starting number "
"(must be >= 0)",
0, &mca_btl_openib_component.ib_psn, 0));
CHECK(reg_uint("ib_qp_ous_rd_atom", NULL,
"InfiniBand outstanding atomic reads "
"(must be >= 0)",
4, &mca_btl_openib_component.ib_qp_ous_rd_atom, 0));
opal_asprintf(&msg, "OpenFabrics MTU, in bytes (if not specified in INI files). Valid values are: %d=256 bytes, %d=512 bytes, %d=1024 bytes, %d=2048 bytes, %d=4096 bytes",
IBV_MTU_256,
IBV_MTU_512,
IBV_MTU_1024,
IBV_MTU_2048,
IBV_MTU_4096);
if (NULL == msg) {
/* Don't try to recover from this */
return OPAL_ERR_OUT_OF_RESOURCE;
}
mca_btl_openib_component.ib_mtu = 0;
(void) mca_base_var_enum_create("btl_openib_mtus", ib_mtu_values, &new_enum);
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"mtu", msg, MCA_BASE_VAR_TYPE_INT, new_enum,
0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.ib_mtu);
if (0 <= tmp) {
(void) mca_base_var_register_synonym(tmp, "ompi", "btl", "openib", "ib_mtu",
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
} else {
ret = tmp;
}
OBJ_RELEASE(new_enum);
free(msg);
CHECK(reg_uint("ib_min_rnr_timer", NULL, "InfiniBand minimum "
"\"receiver not ready\" timer, in seconds "
"(must be >= 0 and <= 31)",
25, &mca_btl_openib_component.ib_min_rnr_timer, 0));
CHECK(reg_uint("ib_timeout", NULL,
"InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout) "
"(must be >= 0 and <= 31)",
20, &mca_btl_openib_component.ib_timeout, 0));
CHECK(reg_uint("ib_retry_count", NULL,
"InfiniBand transmit retry count "
"(must be >= 0 and <= 7)",
7, &mca_btl_openib_component.ib_retry_count, 0));
CHECK(reg_uint("ib_rnr_retry", NULL,
"InfiniBand \"receiver not ready\" "
"retry count; applies *only* to SRQ/XRC queues. PP queues "
"use RNR retry values of 0 because Open MPI performs "
"software flow control to guarantee that RNRs never occur "
"(must be >= 0 and <= 7; 7 = \"infinite\")",
7, &mca_btl_openib_component.ib_rnr_retry, 0));
CHECK(reg_uint("ib_max_rdma_dst_ops", NULL, "InfiniBand maximum pending RDMA "
"destination operations "
"(must be >= 0)",
4, &mca_btl_openib_component.ib_max_rdma_dst_ops, 0));
CHECK(reg_uint("ib_service_level", NULL, "InfiniBand service level "
"(must be >= 0 and <= 15)",
0, &mca_btl_openib_component.ib_service_level, 0));
#if (ENABLE_DYNAMIC_SL)
CHECK(reg_uint("ib_path_record_service_level", NULL,
"Enable getting InfiniBand service level from PathRecord "
"(must be >= 0, 0 = disabled, positive = try to get the "
"service level from PathRecord)",
0, &mca_btl_openib_component.ib_path_record_service_level, 0));
#endif
CHECK(reg_int("use_eager_rdma", NULL, "Use RDMA for eager messages "
"(-1 = use device default, 0 = do not use eager RDMA, "
"1 = use eager RDMA)",
-1, &mca_btl_openib_component.use_eager_rdma, 0));
CHECK(reg_int("eager_rdma_threshold", NULL,
"Use RDMA for short messages after this number of "
"messages are received from a given peer "
"(must be >= 1)",
16, &mca_btl_openib_component.eager_rdma_threshold, REGINT_GE_ONE));
CHECK(reg_int("max_eager_rdma", NULL, "Maximum number of peers allowed to use "
"RDMA for short messages (RDMA is used for all long "
"messages, except if explicitly disabled, such as "
"with the \"dr\" pml) "
"(must be >= 0)",
16, &mca_btl_openib_component.max_eager_rdma, REGINT_GE_ZERO));
CHECK(reg_int("eager_rdma_num", NULL, "Number of RDMA buffers to allocate "
"for small messages "
"(must be >= 1)",
16, &mca_btl_openib_component.eager_rdma_num, REGINT_GE_ONE));
mca_btl_openib_component.eager_rdma_num++;
CHECK(reg_uint("btls_per_lid", NULL, "Number of BTLs to create for each "
"InfiniBand LID "
"(must be >= 1)",
1, &mca_btl_openib_component.btls_per_lid, REGINT_GE_ONE));
CHECK(reg_uint("max_lmc", NULL, "Maximum number of LIDs to use for each device port "
"(must be >= 0, where 0 = use all available)",
1, &mca_btl_openib_component.max_lmc, 0));
CHECK(reg_int("enable_apm_over_lmc", NULL, "Maximum number of alternative paths for each device port "
"(must be >= -1, where 0 = disable apm, -1 = all available alternative paths )",
0, &mca_btl_openib_component.apm_lmc, REGINT_NEG_ONE_OK|REGINT_GE_ZERO));
CHECK(reg_int("enable_apm_over_ports", NULL, "Enable alternative path migration (APM) over different ports of the same device "
"(must be >= 0, where 0 = disable APM over ports, 1 = enable APM over ports of the same device)",
0, &mca_btl_openib_component.apm_ports, REGINT_GE_ZERO));
CHECK(reg_bool("use_async_event_thread", NULL,
"If nonzero, use the thread that will handle InfiniBand asynchronous events",
true, &mca_btl_openib_component.use_async_event_thread));
CHECK(reg_bool("enable_srq_resize", NULL,
"Enable/Disable on demand SRQ resize. "
"(0 = without resizing, nonzero = with resizing)", 1,
&mca_btl_openib_component.enable_srq_resize));
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
CHECK(reg_bool("rroce_enable", NULL,
"Enable/Disable routing between different subnets"
"(0 = disable, nonzero = enable)", false,
&mca_btl_openib_component.rroce_enable));
#endif
CHECK(reg_uint("buffer_alignment", NULL,
"Preferred communication buffer alignment, in bytes "
"(must be > 0 and power of two)",
64, &mca_btl_openib_component.buffer_alignment, 0));
CHECK(reg_bool("use_message_coalescing", NULL,
"If nonzero, use message coalescing", false,
&mca_btl_openib_component.use_message_coalescing));
CHECK(reg_uint("cq_poll_ratio", NULL,
"How often to poll high priority CQ versus low priority CQ",
100, &mca_btl_openib_component.cq_poll_ratio, REGINT_GE_ONE));
CHECK(reg_uint("eager_rdma_poll_ratio", NULL,
"How often to poll eager RDMA channel versus CQ",
100, &mca_btl_openib_component.eager_rdma_poll_ratio, REGINT_GE_ONE));
CHECK(reg_uint("hp_cq_poll_per_progress", NULL,
"Max number of completion events to process for each call "
"of BTL progress engine",
10, &mca_btl_openib_component.cq_poll_progress, REGINT_GE_ONE));
CHECK(reg_uint("max_hw_msg_size", NULL,
"Maximum size (in bytes) of a single fragment of a long message when using the RDMA protocols (must be > 0 and <= hw capabilities).",
0, &mca_btl_openib_component.max_hw_msg_size, 0));
CHECK(reg_bool("allow_max_memory_registration", NULL,
"Allow maximum possible memory to register with HCA",
1, &mca_btl_openib_component.allow_max_memory_registration));
/* Help debug memory registration issues */
CHECK(reg_int("memory_registration_verbose", NULL,
"Output some verbose memory registration information "
"(0 = no output, nonzero = output)", 0,
&mca_btl_openib_component.memory_registration_verbose_level, 0));
CHECK(reg_int("ignore_locality", NULL,
"Ignore any locality information and use all devices "
"(0 = use locality informaiton and use only close devices, nonzero = ignore locality information)", 0,
&mca_btl_openib_component.ignore_locality, REGINT_GE_ZERO));
/* Info only */
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"have_fork_support",
"Whether the OpenFabrics stack supports applications that invoke the \"fork()\" system call or not (0 = no, 1 = yes). "
"Note that this value does NOT indicate whether the system being run on supports \"fork()\" with OpenFabrics applications or not.",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_CONSTANT,
&btl_openib_have_fork_support);
mca_btl_openib_module.super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_DEFAULT;
mca_btl_openib_module.super.btl_eager_limit = 12 * 1024;
mca_btl_openib_module.super.btl_rndv_eager_limit = 12 * 1024;
mca_btl_openib_module.super.btl_max_send_size = 64 * 1024;
mca_btl_openib_module.super.btl_rdma_pipeline_send_length = 1024 * 1024;
mca_btl_openib_module.super.btl_rdma_pipeline_frag_size = 1024 * 1024;
mca_btl_openib_module.super.btl_min_rdma_pipeline_size = 256 * 1024;
mca_btl_openib_module.super.btl_flags = MCA_BTL_FLAGS_RDMA |
MCA_BTL_FLAGS_NEED_ACK | MCA_BTL_FLAGS_NEED_CSUM | MCA_BTL_FLAGS_HETEROGENEOUS_RDMA |
MCA_BTL_FLAGS_SEND;
#if HAVE_DECL_IBV_ATOMIC_HCA
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_ATOMIC_FOPS;
mca_btl_openib_module.super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_ADD | MCA_BTL_ATOMIC_SUPPORTS_CSWAP;
#endif
/* Default to bandwidth auto-detection */
mca_btl_openib_module.super.btl_bandwidth = 0;
mca_btl_openib_module.super.btl_latency = 4;
#if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
/* Default is enabling CUDA asynchronous send copies */
CHECK(reg_bool("cuda_async_send", NULL,
"Enable or disable CUDA async send copies "
"(true = async; false = sync)",
true, &mca_btl_openib_component.cuda_async_send));
/* Default is enabling CUDA asynchronous receive copies */
CHECK(reg_bool("cuda_async_recv", NULL,
"Enable or disable CUDA async recv copies "
"(true = async; false = sync)",
false, &mca_btl_openib_component.cuda_async_recv));
/* Also make the max send size larger for better GPU buffer performance */
mca_btl_openib_module.super.btl_max_send_size = 128 * 1024;
/* Turn of message coalescing - not sure if it works with GPU buffers */
mca_btl_openib_component.use_message_coalescing = 0;
/* Indicates if library was built with GPU Direct RDMA support. Not changeable. */
mca_btl_openib_component.cuda_have_gdr = OPAL_INT_TO_BOOL(OPAL_CUDA_GDR_SUPPORT);
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version, "have_cuda_gdr",
"Whether CUDA GPU Direct RDMA support is built into library or not",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_CONSTANT,
&mca_btl_openib_component.cuda_have_gdr);
/* Indicates if driver has GPU Direct RDMA support. Not changeable. */
if (OPAL_SUCCESS == opal_os_dirpath_access("/sys/kernel/mm/memory_peers/nv_mem/version", S_IRUSR)) {
mca_btl_openib_component.driver_have_gdr = 1;
} else {
mca_btl_openib_component.driver_have_gdr = 0;
}
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version, "have_driver_gdr",
"Whether Infiniband driver has GPU Direct RDMA support",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_CONSTANT,
&mca_btl_openib_component.driver_have_gdr);
/* Default for GPU Direct RDMA is off for now */
CHECK(reg_bool("want_cuda_gdr", NULL,
"Enable or disable CUDA GPU Direct RDMA support "
"(true = enabled; false = disabled)",
false, &mca_btl_openib_component.cuda_want_gdr));
if (mca_btl_openib_component.cuda_want_gdr && !mca_btl_openib_component.cuda_have_gdr) {
opal_show_help("help-mpi-btl-openib.txt",
"CUDA_no_gdr_support", true,
opal_process_info.nodename);
return OPAL_ERROR;
}
if (mca_btl_openib_component.cuda_want_gdr && !mca_btl_openib_component.driver_have_gdr) {
opal_show_help("help-mpi-btl-openib.txt",
"driver_no_gdr_support", true,
opal_process_info.nodename);
return OPAL_ERROR;
}
#if OPAL_CUDA_GDR_SUPPORT
if (mca_btl_openib_component.cuda_want_gdr) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_GET;
mca_btl_openib_module.super.btl_cuda_eager_limit = SIZE_MAX; /* magic number - indicates set it to minimum */
mca_btl_openib_module.super.btl_cuda_rdma_limit = 30000; /* default switchover is 30,000 to pipeline */
} else {
mca_btl_openib_module.super.btl_cuda_eager_limit = 0; /* Turns off any of the GPU Direct RDMA code */
mca_btl_openib_module.super.btl_cuda_rdma_limit = 0; /* Unused */
}
#endif /* OPAL_CUDA_GDR_SUPPORT */
#endif /* OPAL_CUDA_SUPPORT */
CHECK(mca_btl_base_param_register(
&mca_btl_openib_component.super.btl_version,
&mca_btl_openib_module.super));
/* setup all the qp stuff */
/* round mid_qp_size to smallest power of two */
mid_qp_size = opal_next_poweroftwo (mca_btl_openib_module.super.btl_eager_limit / 4) >> 1;
/* mid_qp_size = MAX (mid_qp_size, 1024); ?! */
if(mid_qp_size <= 128) {
mid_qp_size = 1024;
}
opal_asprintf(&default_qps,
"S,128,256,192,128:S,%u,1024,1008,64:S,%u,1024,1008,64:S,%u,1024,1008,64",
mid_qp_size,
(uint32_t)mca_btl_openib_module.super.btl_eager_limit,
(uint32_t)mca_btl_openib_module.super.btl_max_send_size);
if (NULL == default_qps) {
/* Don't try to recover from this */
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (NULL != mca_btl_openib_component.default_recv_qps) {
free(mca_btl_openib_component.default_recv_qps);
}
mca_btl_openib_component.default_recv_qps = default_qps;
CHECK(reg_string("receive_queues", NULL,
"Colon-delimited, comma-delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4",
default_qps, &mca_btl_openib_component.receive_queues,
0
));
CHECK(reg_string("if_include", NULL,
"Comma-delimited list of devices/ports to be used (e.g. \"mthca0,mthca1:2\"; empty value means to use all ports found). Mutually exclusive with btl_openib_if_exclude.",
NULL, &mca_btl_openib_component.if_include,
0));
CHECK(reg_string("if_exclude", NULL,
"Comma-delimited list of device/ports to be excluded (empty value means to not exclude any ports). Mutually exclusive with btl_openib_if_include.",
NULL, &mca_btl_openib_component.if_exclude,
0));
CHECK(reg_string("ipaddr_include", NULL,
"Comma-delimited list of IP Addresses to be used (e.g. \"192.168.1.0/24\"). Mutually exclusive with btl_openib_ipaddr_exclude.",
NULL, &mca_btl_openib_component.ipaddr_include,
0));
CHECK(reg_string("ipaddr_exclude", NULL,
"Comma-delimited list of IP Addresses to be excluded (e.g. \"192.168.1.0/24\"). Mutually exclusive with btl_openib_ipaddr_include.",
NULL, &mca_btl_openib_component.ipaddr_exclude,
0));
CHECK(reg_int("gid_index", NULL,
"GID index to use on verbs device ports",
0, &mca_btl_openib_component.gid_index,
REGINT_GE_ZERO));
CHECK(reg_bool("allow_different_subnets", NULL,
"Allow connecting processes from different IB subnets."
"(0 = do not allow; 1 = allow)",
false, &mca_btl_openib_component.allow_different_subnets));
/* Register any MCA params for the connect pseudo-components */
if (OPAL_SUCCESS == ret) {
ret = opal_btl_openib_connect_base_register();
}
return btl_openib_verify_mca_params();
}
int btl_openib_verify_mca_params (void)
{
if (mca_btl_openib_component.cq_poll_batch > MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT) {
mca_btl_openib_component.cq_poll_batch = MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT;
}
#if !HAVE_IBV_FORK_INIT
if (1 == mca_btl_openib_component.want_fork_support) {
opal_show_help("help-mpi-btl-openib.txt",
"ibv_fork requested but not supported", true,
opal_process_info.nodename);
return OPAL_ERR_BAD_PARAM;
}
#endif
mca_btl_openib_component.ib_pkey_val &= MCA_BTL_IB_PKEY_MASK;
if (mca_btl_openib_component.ib_min_rnr_timer > 31) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_min_rnr_timer > 31",
"btl_openib_ib_min_rnr_timer reset to 31");
mca_btl_openib_component.ib_min_rnr_timer = 31;
}
if (mca_btl_openib_component.ib_timeout > 31) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_timeout > 31",
"btl_openib_ib_timeout reset to 31");
mca_btl_openib_component.ib_timeout = 31;
}
if (mca_btl_openib_component.ib_retry_count > 7) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_retry_count > 7",
"btl_openib_ib_retry_count reset to 7");
mca_btl_openib_component.ib_retry_count = 7;
}
if (mca_btl_openib_component.ib_rnr_retry > 7) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_rnr_retry > 7",
"btl_openib_ib_rnr_retry reset to 7");
mca_btl_openib_component.ib_rnr_retry = 7;
}
if (mca_btl_openib_component.ib_service_level > 15) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_service_level > 15",
"btl_openib_ib_service_level reset to 15");
mca_btl_openib_component.ib_service_level = 15;
}
if(mca_btl_openib_component.buffer_alignment <= 1 ||
(mca_btl_openib_component.buffer_alignment & (mca_btl_openib_component.buffer_alignment - 1))) {
opal_show_help("help-mpi-btl-openib.txt", "wrong buffer alignment",
true, mca_btl_openib_component.buffer_alignment, opal_process_info.nodename, 64);
mca_btl_openib_component.buffer_alignment = 64;
}
#if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
if (mca_btl_openib_component.cuda_async_send) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_COPY_ASYNC_SEND;
} else {
mca_btl_openib_module.super.btl_flags &= ~MCA_BTL_FLAGS_CUDA_COPY_ASYNC_SEND;
}
if (mca_btl_openib_component.cuda_async_recv) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_COPY_ASYNC_RECV;
} else {
mca_btl_openib_module.super.btl_flags &= ~MCA_BTL_FLAGS_CUDA_COPY_ASYNC_RECV;
}
#if 0 /* Disable this check for now while fork support code is worked out. */
/* Cannot have fork support and GDR on at the same time. If the user asks for both,
* then print a message and return error. If the user does not explicitly ask for
* fork support, then turn it off in the presence of GDR. */
if (mca_btl_openib_component.cuda_want_gdr && mca_btl_openib_component.cuda_have_gdr &&
mca_btl_openib_component.driver_have_gdr) {
if (1 == opal_common_verbs_want_fork_support) {
opal_show_help("help-mpi-btl-openib.txt", "no_fork_with_gdr",
true, opal_process_info.nodename);
return OPAL_ERR_BAD_PARAM;
}
}
#endif /* Workaround */
if (0 != mca_btl_openib_module.super.btl_cuda_max_send_size) {
opal_show_help("help-mpi-btl-openib.txt", "do_not_set_openib_value",
true, opal_process_info.nodename);
mca_btl_openib_module.super.btl_cuda_max_send_size = 0;
}
#endif
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,22 +0,0 @@
/*
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_MCA_H
#define MCA_BTL_IB_MCA_H
BEGIN_C_DECLS
/**
* Function to register MCA params and check for sane values
*/
int btl_openib_register_mca_params(void);
int btl_openib_verify_mca_params (void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,405 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2014-2017 Intel, Inc. All rights reserved.
* Copyright (c) 2015-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 Mellanox Technologies. All rights reserved.
* Copyright (c) 2016-2017 Los Alamos National Security, LLC. All rights
* reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "opal/util/arch.h"
#include "opal/mca/pmix/pmix.h"
#include "btl_openib.h"
#include "btl_openib_proc.h"
#include "connect/base.h"
#include "connect/connect.h"
static void mca_btl_openib_proc_btl_construct(mca_btl_openib_proc_btlptr_t* elem);
static void mca_btl_openib_proc_btl_destruct(mca_btl_openib_proc_btlptr_t* elem);
OBJ_CLASS_INSTANCE(mca_btl_openib_proc_btlptr_t,
opal_list_item_t, mca_btl_openib_proc_btl_construct,
mca_btl_openib_proc_btl_destruct);
static void mca_btl_openib_proc_btl_construct(mca_btl_openib_proc_btlptr_t* elem)
{
elem->openib_btl = NULL;
}
static void mca_btl_openib_proc_btl_destruct(mca_btl_openib_proc_btlptr_t* elem)
{
elem->openib_btl = NULL;
}
static void mca_btl_openib_proc_construct(mca_btl_openib_proc_t* proc);
static void mca_btl_openib_proc_destruct(mca_btl_openib_proc_t* proc);
OBJ_CLASS_INSTANCE(mca_btl_openib_proc_t,
opal_list_item_t, mca_btl_openib_proc_construct,
mca_btl_openib_proc_destruct);
void mca_btl_openib_proc_construct(mca_btl_openib_proc_t* ib_proc)
{
ib_proc->proc_opal = 0;
ib_proc->proc_ports = NULL;
ib_proc->proc_port_count = 0;
ib_proc->proc_endpoints = 0;
ib_proc->proc_endpoint_count = 0;
OBJ_CONSTRUCT(&ib_proc->proc_lock, opal_mutex_t);
OBJ_CONSTRUCT(&ib_proc->openib_btls, opal_list_t);
}
/*
* Cleanup ib proc instance
*/
void mca_btl_openib_proc_destruct(mca_btl_openib_proc_t* ib_proc)
{
/* release resources */
if(NULL != ib_proc->proc_endpoints) {
free(ib_proc->proc_endpoints);
}
if (NULL != ib_proc->proc_ports) {
int i, j;
for (i = 0; i < ib_proc->proc_port_count; ++i) {
for (j = 0; j < ib_proc->proc_ports[i].pm_cpc_data_count; ++j) {
if (NULL != ib_proc->proc_ports[i].pm_cpc_data[j].cbm_modex_message) {
free(ib_proc->proc_ports[i].pm_cpc_data[j].cbm_modex_message);
}
}
}
free(ib_proc->proc_ports);
}
OBJ_DESTRUCT(&ib_proc->proc_lock);
OPAL_LIST_DESTRUCT(&ib_proc->openib_btls);
}
/*
* Look for an existing IB process instances based on the associated
* opal_proc_t instance.
*/
static mca_btl_openib_proc_t* ibproc_lookup_no_lock(opal_proc_t* proc)
{
mca_btl_openib_proc_t* ib_proc;
OPAL_LIST_FOREACH(ib_proc, &mca_btl_openib_component.ib_procs, mca_btl_openib_proc_t) {
if(ib_proc->proc_opal == proc) {
return ib_proc;
}
}
return NULL;
}
static mca_btl_openib_proc_t* ibproc_lookup_and_lock(opal_proc_t* proc)
{
mca_btl_openib_proc_t* ib_proc;
/* get the process from the list */
opal_mutex_lock(&mca_btl_openib_component.ib_lock);
ib_proc = ibproc_lookup_no_lock(proc);
opal_mutex_unlock(&mca_btl_openib_component.ib_lock);
if( NULL != ib_proc ){
/* if we were able to find it - lock it.
* NOTE: we want to lock it outside of list locked region */
opal_mutex_lock(&ib_proc->proc_lock);
}
return ib_proc;
}
static void inline unpack8(char **src, uint8_t *value)
{
/* Copy one character */
*value = (uint8_t) **src;
/* Most the src ahead one */
++*src;
}
/*
* Create a IB process structure. There is a one-to-one correspondence
* between a opal_proc_t and a mca_btl_openib_proc_t instance. We
* cache additional data (specifically the list of
* mca_btl_openib_endpoint_t instances, and published addresses)
* associated w/ a given destination on this datastructure.
*/
mca_btl_openib_proc_t* mca_btl_openib_proc_get_locked(opal_proc_t* proc)
{
mca_btl_openib_proc_t *ib_proc = NULL, *ib_proc_ret = NULL;
size_t msg_size;
uint32_t size;
int rc, i, j;
void *message;
char *offset;
int modex_message_size;
mca_btl_openib_modex_message_t dummy;
bool is_new = false;
/* Check if we have already created a IB proc
* structure for this ompi process */
ib_proc = ibproc_lookup_and_lock(proc);
if (NULL != ib_proc) {
/* Gotcha! */
return ib_proc;
}
/* All initialization has to be an atomic operation. we do the following assumption:
* - we let all concurent threads to try to do the initialization;
* - when one has finished it locks ib_lock and checks if corresponding
* process is still missing;
* - if so - new proc is added, otherwise - initialized proc struct is released.
*/
/* First time, gotta create a new IB proc
* out of the opal_proc ... */
ib_proc = OBJ_NEW(mca_btl_openib_proc_t);
if (NULL == ib_proc) {
return NULL;
}
/* Initialize number of peer */
ib_proc->proc_endpoint_count = 0;
ib_proc->proc_opal = proc;
/* query for the peer address info */
OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version,
&proc->proc_name, &message, &msg_size);
if (OPAL_SUCCESS != rc) {
BTL_VERBOSE(("[%s:%d] opal_modex_recv failed for peer %s",
__FILE__, __LINE__,
OPAL_NAME_PRINT(proc->proc_name)));
goto no_err_exit;
}
if (0 == msg_size) {
goto no_err_exit;
}
/* Message was packed in btl_openib_component.c; the format is
listed in a comment in that file */
modex_message_size = ((char *) &(dummy.end)) - ((char*) &dummy);
/* Unpack the number of modules in the message */
offset = (char *) message;
unpack8(&offset, &(ib_proc->proc_port_count));
BTL_VERBOSE(("unpack: %d btls", ib_proc->proc_port_count));
if (ib_proc->proc_port_count > 0) {
ib_proc->proc_ports = (mca_btl_openib_proc_modex_t *)
malloc(sizeof(mca_btl_openib_proc_modex_t) *
ib_proc->proc_port_count);
} else {
ib_proc->proc_ports = NULL;
}
/* Loop over unpacking all the ports */
for (i = 0; i < ib_proc->proc_port_count; i++) {
/* Unpack the modex comment message struct */
size = modex_message_size;
memcpy(&(ib_proc->proc_ports[i].pm_port_info), offset, size);
#if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
MCA_BTL_OPENIB_MODEX_MSG_NTOH(ib_proc->proc_ports[i].pm_port_info);
#endif
offset += size;
BTL_VERBOSE(("unpacked btl %d: modex message, offset now %d",
i, (int)(offset-((char*)message))));
/* Unpack the number of CPCs that follow */
unpack8(&offset, &(ib_proc->proc_ports[i].pm_cpc_data_count));
BTL_VERBOSE(("unpacked btl %d: number of cpcs to follow %d (offset now %d)",
i, ib_proc->proc_ports[i].pm_cpc_data_count,
(int)(offset-((char*)message))));
ib_proc->proc_ports[i].pm_cpc_data = (opal_btl_openib_connect_base_module_data_t *)
calloc(ib_proc->proc_ports[i].pm_cpc_data_count,
sizeof(opal_btl_openib_connect_base_module_data_t));
if (NULL == ib_proc->proc_ports[i].pm_cpc_data) {
goto err_exit;
}
/* Unpack the CPCs */
for (j = 0; j < ib_proc->proc_ports[i].pm_cpc_data_count; ++j) {
uint8_t u8;
opal_btl_openib_connect_base_module_data_t *cpcd;
cpcd = ib_proc->proc_ports[i].pm_cpc_data + j;
unpack8(&offset, &u8);
BTL_VERBOSE(("unpacked btl %d: cpc %d: index %d (offset now %d)",
i, j, u8, (int)(offset-(char*)message)));
cpcd->cbm_component =
opal_btl_openib_connect_base_get_cpc_byindex(u8);
BTL_VERBOSE(("unpacked btl %d: cpc %d: component %s",
i, j, cpcd->cbm_component->cbc_name));
unpack8(&offset, &cpcd->cbm_priority);
unpack8(&offset, &cpcd->cbm_modex_message_len);
BTL_VERBOSE(("unpacked btl %d: cpc %d: priority %d, msg len %d (offset now %d)",
i, j, cpcd->cbm_priority,
cpcd->cbm_modex_message_len,
(int)(offset-(char*)message)));
if (cpcd->cbm_modex_message_len > 0) {
cpcd->cbm_modex_message = malloc(cpcd->cbm_modex_message_len);
if (NULL == cpcd->cbm_modex_message) {
BTL_ERROR(("Failed to malloc"));
goto err_exit;
}
memcpy(cpcd->cbm_modex_message, offset,
cpcd->cbm_modex_message_len);
offset += cpcd->cbm_modex_message_len;
BTL_VERBOSE(("unpacked btl %d: cpc %d: blob unpacked %d %x (offset now %d)",
i, j,
((uint32_t*)cpcd->cbm_modex_message)[0],
((uint32_t*)cpcd->cbm_modex_message)[1],
(int)(offset-((char*)message))));
}
}
}
if (0 == ib_proc->proc_port_count) {
ib_proc->proc_endpoints = NULL;
goto no_err_exit;
} else {
ib_proc->proc_endpoints = (volatile mca_btl_base_endpoint_t**)
malloc(ib_proc->proc_port_count *
sizeof(mca_btl_base_endpoint_t*));
}
if (NULL == ib_proc->proc_endpoints) {
goto err_exit;
}
BTL_VERBOSE(("unpacking done!"));
/* Finally add this process to the initialized procs list */
opal_mutex_lock(&mca_btl_openib_component.ib_lock);
ib_proc_ret = ibproc_lookup_no_lock(proc);
if (NULL == ib_proc_ret) {
/* if process can't be found in this list - insert it locked
* it is safe to lock ib_proc here because this thread is
* the only one who knows about it so far */
opal_mutex_lock(&ib_proc->proc_lock);
opal_list_append(&mca_btl_openib_component.ib_procs, &ib_proc->super);
ib_proc_ret = ib_proc;
is_new = true;
} else {
/* otherwise - release module_proc */
OBJ_RELEASE(ib_proc);
}
opal_mutex_unlock(&mca_btl_openib_component.ib_lock);
/* if we haven't insert the process - lock it here so we
* won't lock mca_btl_openib_component.ib_lock */
if( !is_new ){
opal_mutex_lock(&ib_proc_ret->proc_lock);
}
return ib_proc_ret;
err_exit:
BTL_ERROR(("%d: error exit from mca_btl_openib_proc_create", OPAL_PROC_MY_NAME.vpid));
no_err_exit:
OBJ_RELEASE(ib_proc);
return NULL;
}
int mca_btl_openib_proc_remove(opal_proc_t *proc,
mca_btl_base_endpoint_t *endpoint)
{
size_t i;
mca_btl_openib_proc_t* ib_proc = NULL;
/* Remove endpoint from the openib BTL version of the proc as
well */
ib_proc = ibproc_lookup_and_lock(proc);
if (NULL != ib_proc) {
for (i = 0; i < ib_proc->proc_endpoint_count; ++i) {
if (ib_proc->proc_endpoints[i] == endpoint) {
ib_proc->proc_endpoints[i] = NULL;
if (i == ib_proc->proc_endpoint_count - 1) {
--ib_proc->proc_endpoint_count;
}
opal_mutex_unlock(&ib_proc->proc_lock);
return OPAL_SUCCESS;
}
}
}
return OPAL_ERR_NOT_FOUND;
}
/*
* Note that this routine must be called with the lock on the process
* already held. Insert a btl instance into the proc array and assign
* it an address.
*/
int mca_btl_openib_proc_insert(mca_btl_openib_proc_t* module_proc,
mca_btl_base_endpoint_t* module_endpoint)
{
/* insert into endpoint array */
#ifndef WORDS_BIGENDIAN
/* if we are little endian and our peer is not so lucky, then we
need to put all information sent to him in big endian (aka
Network Byte Order) and expect all information received to
be in NBO. Since big endian machines always send and receive
in NBO, we don't care so much about that case. */
if (module_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN) {
module_endpoint->nbo = true;
}
#endif
/* only allow eager rdma if the peers agree on the size of a long */
if((module_proc->proc_opal->proc_arch & OPAL_ARCH_LONGISxx) !=
(opal_proc_local_get()->proc_arch & OPAL_ARCH_LONGISxx)) {
module_endpoint->use_eager_rdma = false;
}
module_endpoint->endpoint_proc = module_proc;
module_proc->proc_endpoints[module_proc->proc_endpoint_count++] = module_endpoint;
return OPAL_SUCCESS;
}
int mca_btl_openib_proc_reg_btl(mca_btl_openib_proc_t* ib_proc,
mca_btl_openib_module_t* openib_btl)
{
mca_btl_openib_proc_btlptr_t* elem;
OPAL_LIST_FOREACH(elem, &ib_proc->openib_btls, mca_btl_openib_proc_btlptr_t) {
if(elem->openib_btl == openib_btl) {
/* this is normal return meaning that this BTL has already touched this ib_proc */
return OPAL_ERR_RESOURCE_BUSY;
}
}
elem = OBJ_NEW(mca_btl_openib_proc_btlptr_t);
if( NULL == elem ){
return OPAL_ERR_OUT_OF_RESOURCE;
}
elem->openib_btl = openib_btl;
opal_list_append(&ib_proc->openib_btls, &elem->super);
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,115 +0,0 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2015 Mellanox Technologies. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_PROC_H
#define MCA_BTL_IB_PROC_H
#include "opal/class/opal_object.h"
#include "opal/util/proc.h"
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
BEGIN_C_DECLS
/* Must forward reference this to avoid include file loop */
struct opal_btl_openib_connect_base_module_data_t;
/**
* Data received from the modex. For each openib BTL module/port in
* the peer, we'll receive two things:
*
* 1. Data about the peer's port
* 2. An array of CPCs that the peer has available on that port, each
* of which has its own meta data
*
* Hence, these two items need to be bundled together;
*/
typedef struct mca_btl_openib_proc_modex_t {
/** Information about the peer's port */
mca_btl_openib_modex_message_t pm_port_info;
/** Array of the peer's CPCs available on this port */
opal_btl_openib_connect_base_module_data_t *pm_cpc_data;
/** Length of the pm_cpc_data array */
uint8_t pm_cpc_data_count;
} mca_btl_openib_proc_modex_t;
/**
* The list element to hold pointers to openin_btls that are using this
* ib_proc.
*/
struct mca_btl_openib_proc_btlptr_t {
opal_list_item_t super;
mca_btl_openib_module_t* openib_btl;
};
typedef struct mca_btl_openib_proc_btlptr_t mca_btl_openib_proc_btlptr_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_proc_btlptr_t);
/**
* Represents the state of a remote process and the set of addresses
* that it exports. Also cache an instance of mca_btl_base_endpoint_t for
* each
* BTL instance that attempts to open a connection to the process.
*/
struct mca_btl_openib_proc_t {
/** allow proc to be placed on a list */
opal_list_item_t super;
/** pointer to corresponding opal_proc_t */
const opal_proc_t *proc_opal;
/** modex messages from this proc; one for each port in the peer */
mca_btl_openib_proc_modex_t *proc_ports;
/** length of proc_ports array */
uint8_t proc_port_count;
/** list of openib_btl's that touched this proc **/
opal_list_t openib_btls;
/** array of endpoints that have been created to access this proc */
volatile struct mca_btl_base_endpoint_t **proc_endpoints;
/** number of endpoints (length of proc_endpoints array) */
volatile size_t proc_endpoint_count;
/** lock to protect against concurrent access to proc state */
opal_mutex_t proc_lock;
};
typedef struct mca_btl_openib_proc_t mca_btl_openib_proc_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_proc_t);
mca_btl_openib_proc_t* mca_btl_openib_proc_get_locked(opal_proc_t* proc);
int mca_btl_openib_proc_insert(mca_btl_openib_proc_t*, mca_btl_base_endpoint_t*);
int mca_btl_openib_proc_remove(opal_proc_t* proc,
mca_btl_base_endpoint_t* module_endpoint);
int mca_btl_openib_proc_reg_btl(mca_btl_openib_proc_t* ib_proc,
mca_btl_openib_module_t* openib_btl);
END_C_DECLS
#endif

Просмотреть файл

@ -1,175 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
/*
* RDMA WRITE local buffer to remote buffer address.
*/
int mca_btl_openib_put (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata)
{
mca_btl_openib_put_frag_t *frag = NULL;
int rc, qp = order;
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
if (OPAL_UNLIKELY((btl->btl_put_local_registration_threshold < size && !local_handle) || !remote_handle ||
size > btl->btl_put_limit)) {
return OPAL_ERR_BAD_PARAM;
}
frag = to_put_frag(alloc_send_user_frag ());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = size;
if (local_handle) {
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
} else {
/* lkey is not required for inline RDMA write */
to_com_frag(frag)->sg_entry.lkey = 0;
}
to_com_frag(frag)->sg_entry.addr = (uint64_t)(intptr_t) local_address;
to_com_frag(frag)->endpoint = ep;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* post descriptor */
to_out_frag(frag)->sr_desc.opcode = IBV_WR_RDMA_WRITE;
to_out_frag(frag)->sr_desc.wr.rdma.remote_addr = remote_address;
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
to_out_frag(frag)->sr_desc.wr.rdma.rkey = opal_swap_bytes4(remote_handle->rkey);
} else
#endif
{
to_out_frag(frag)->sr_desc.wr.rdma.rkey = remote_handle->rkey;
}
if (ep->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&ep->endpoint_lock);
rc = check_endpoint_state(ep, &to_base_frag(frag)->base, &ep->pending_put_frags);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
/* descriptor was queued pending connection */
return OPAL_SUCCESS;
}
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_put_internal (btl, ep, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
/* queue the fragment for when resources are available */
OPAL_THREAD_LOCK(&ep->endpoint_lock);
opal_list_append(&ep->pending_put_frags, (opal_list_item_t*)frag);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_put_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
mca_btl_openib_put_frag_t *frag)
{
int qp = to_base_frag(frag)->base.order;
struct ibv_send_wr *bad_wr;
int rc;
/* NTH: the inline send size and remote SRQ number are only available once the endpoint is
* connected. By setting these values here instead of mca_btl_openib_put we guarantee
* both fields are initialized */
to_out_frag(frag)->sr_desc.send_flags = ib_send_flags (to_com_frag(frag)->sg_entry.length,
&(ep->qps[qp]), 1);
#if HAVE_XRC
if (MCA_BTL_XRC_ENABLED && BTL_OPENIB_QP_TYPE_XRC(qp)) {
#if OPAL_HAVE_CONNECTX_XRC
to_out_frag(frag)->sr_desc.xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#elif OPAL_HAVE_CONNECTX_XRC_DOMAINS
to_out_frag(frag)->sr_desc.qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
#error "that should never happen"
#endif
}
#endif
/* check for a send wqe */
if (qp_get_wqe(ep, qp) < 0) {
qp_put_wqe(ep, qp);
return OPAL_ERR_OUT_OF_RESOURCE;
}
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
if (0 != (rc = ibv_post_send(ep->qps[qp].qp->lcl_qp, &to_out_frag(frag)->sr_desc, &bad_wr))) {
qp_put_wqe(ep, qp);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,211 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2009 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include <dlfcn.h>
#include "opal/mca/btl/base/base.h"
#include "opal/util/printf.h"
#include "btl_openib_xrc.h"
#include "btl_openib.h"
#if HAVE_XRC
#define SIZE_OF3(A, B, C) (sizeof(A) + sizeof(B) + sizeof(C))
static void ib_address_constructor(ib_address_t *ib_addr);
static void ib_address_destructor(ib_address_t *ib_addr);
OBJ_CLASS_INSTANCE(ib_address_t,
opal_list_item_t,
ib_address_constructor,
ib_address_destructor);
/* This func. opens XRC domain */
int mca_btl_openib_open_xrc_domain(struct mca_btl_openib_device_t *device)
{
int len;
char *xrc_file_name;
const char *dev_name;
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_xrcd_init_attr xrcd_attr;
#endif
dev_name = ibv_get_device_name(device->ib_dev);
len = opal_asprintf(&xrc_file_name,
"%s"OPAL_PATH_SEP"openib_xrc_domain_%s",
opal_process_info.job_session_dir, dev_name);
if (0 > len) {
BTL_ERROR(("Failed to allocate memomry for XRC file name: %s\n",
strerror(errno)));
return OPAL_ERROR;
}
device->xrc_fd = open(xrc_file_name, O_CREAT, S_IWUSR|S_IRUSR);
if (0 > device->xrc_fd) {
BTL_ERROR(("Failed to open XRC domain file %s, errno says %s\n",
xrc_file_name,strerror(errno)));
free(xrc_file_name);
return OPAL_ERROR;
}
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
memset(&xrcd_attr, 0, sizeof xrcd_attr);
xrcd_attr.comp_mask = IBV_XRCD_INIT_ATTR_FD | IBV_XRCD_INIT_ATTR_OFLAGS;
xrcd_attr.fd = device->xrc_fd;
xrcd_attr.oflags = O_CREAT;
device->xrcd = ibv_open_xrcd(device->ib_dev_context, &xrcd_attr);
if (NULL == device->xrcd) {
#else
device->xrc_domain = ibv_open_xrc_domain(device->ib_dev_context, device->xrc_fd, O_CREAT);
if (NULL == device->xrc_domain) {
#endif
BTL_ERROR(("Failed to open XRC domain\n"));
close(device->xrc_fd);
free(xrc_file_name);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
/* This func. closes XRC domain */
int mca_btl_openib_close_xrc_domain(struct mca_btl_openib_device_t *device)
{
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if (NULL == device->xrcd) {
#else
if (NULL == device->xrc_domain) {
#endif
/* No XRC domain, just exit */
return OPAL_SUCCESS;
}
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if (ibv_close_xrcd(device->xrcd)) {
#else
if (ibv_close_xrc_domain(device->xrc_domain)) {
#endif
BTL_ERROR(("Failed to close XRC domain, errno %d says %s\n",
device->xrc_fd, strerror(errno)));
return OPAL_ERROR;
}
/* do we need to check exit status */
if (close(device->xrc_fd)) {
BTL_ERROR(("Failed to close XRC file descriptor, errno %d says %s\n",
device->xrc_fd, strerror(errno)));
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
static void ib_address_constructor(ib_address_t *ib_addr)
{
ib_addr->key = NULL;
ib_addr->subnet_id = 0;
ib_addr->lid = 0;
ib_addr->status = MCA_BTL_IB_ADDR_CLOSED;
ib_addr->qp = NULL;
ib_addr->max_wqe = 0;
/* NTH: make the addr_lock recursive because mca_btl_openib_endpoint_connected can call
* into the CPC with the lock held. The alternative would be to drop the lock but the
* lock is never obtained in a critical path. */
OBJ_CONSTRUCT(&ib_addr->addr_lock, opal_recursive_mutex_t);
OBJ_CONSTRUCT(&ib_addr->pending_ep, opal_list_t);
}
static void ib_address_destructor(ib_address_t *ib_addr)
{
if (NULL != ib_addr->key) {
free(ib_addr->key);
}
OBJ_DESTRUCT(&ib_addr->addr_lock);
OBJ_DESTRUCT(&ib_addr->pending_ep);
}
static int ib_address_init(ib_address_t *ib_addr, uint16_t lid, uint64_t s_id, opal_jobid_t ep_jobid)
{
ib_addr->key = malloc(SIZE_OF3(s_id, lid, ep_jobid));
if (NULL == ib_addr->key) {
BTL_ERROR(("Failed to allocate memory for key\n"));
return OPAL_ERROR;
}
memset(ib_addr->key, 0, SIZE_OF3(s_id, lid, ep_jobid));
/* creating the key = lid + s_id + ep_jobid */
memcpy(ib_addr->key, &lid, sizeof(lid));
memcpy((void*)((char*)ib_addr->key + sizeof(lid)), &s_id, sizeof(s_id));
memcpy((void*)((char*)ib_addr->key + sizeof(lid) + sizeof(s_id)),
&ep_jobid, sizeof(ep_jobid));
/* caching lid and subnet id */
ib_addr->subnet_id = s_id;
ib_addr->lid = lid;
return OPAL_SUCCESS;
}
/* Create new entry in hash table for subnet_id and lid,
* update the endpoint pointer.
* Before call to this function you need to protect with
*/
int mca_btl_openib_ib_address_add_new (uint16_t lid, uint64_t s_id,
opal_jobid_t ep_jobid, mca_btl_openib_endpoint_t *ep)
{
void *tmp;
int ret = OPAL_SUCCESS;
struct ib_address_t *ib_addr = OBJ_NEW(ib_address_t);
ret = ib_address_init(ib_addr, lid, s_id, ep_jobid);
if (OPAL_SUCCESS != ret ) {
BTL_ERROR(("XRC Internal error. Failed to init ib_addr\n"));
OBJ_DESTRUCT(ib_addr);
return ret;
}
/* is it already in the table ?*/
OPAL_THREAD_LOCK(&mca_btl_openib_component.ib_lock);
if (OPAL_SUCCESS != opal_hash_table_get_value_ptr(&mca_btl_openib_component.ib_addr_table,
ib_addr->key,
SIZE_OF3(s_id, lid, ep_jobid), &tmp)) {
/* It is new one, lets put it on the table */
ret = opal_hash_table_set_value_ptr(&mca_btl_openib_component.ib_addr_table,
ib_addr->key, SIZE_OF3(s_id, lid, ep_jobid), (void*)ib_addr);
if (OPAL_SUCCESS != ret) {
BTL_ERROR(("XRC Internal error."
" Failed to add element to mca_btl_openib_component.ib_addr_table\n"));
OPAL_THREAD_UNLOCK(&mca_btl_openib_component.ib_lock);
OBJ_DESTRUCT(ib_addr);
return ret;
}
/* update the endpoint with pointer to ib address */
ep->ib_addr = ib_addr;
} else {
/* so we have this one in the table, just add the pointer to the endpoint */
ep->ib_addr = (ib_address_t *)tmp;
assert(lid == ep->ib_addr->lid && s_id == ep->ib_addr->subnet_id);
OBJ_DESTRUCT(ib_addr);
}
OPAL_THREAD_UNLOCK(&mca_btl_openib_component.ib_lock);
return ret;
}
#endif

Просмотреть файл

@ -1,58 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2014 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_XRC_H
#define MCA_BTL_OPENIB_XRC_H
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#if HAVE_XRC
#define MCA_BTL_XRC_ENABLED (mca_btl_openib_component.num_xrc_qps)
#else
#define MCA_BTL_XRC_ENABLED 0
#endif
typedef enum {
MCA_BTL_IB_ADDR_CONNECTING = 100,
MCA_BTL_IB_ADDR_CONNECTED,
MCA_BTL_IB_ADDR_CLOSED
} mca_btl_openib_ib_addr_state_t;
struct ib_address_t {
opal_list_item_t super;
void *key; /* the key with size 80bit - [subnet(64) LID(16bit)] */
uint64_t subnet_id; /* caching subnet_id */
uint16_t lid; /* caching lid */
opal_list_t pending_ep; /* list of endpoints that use this ib_address */
mca_btl_openib_qp_t *qp; /* pointer to qp that will be used
for communication with the
destination */
uint32_t remote_xrc_rcv_qp_num; /* remote xrc qp number */
opal_mutex_t addr_lock; /* protection */
mca_btl_openib_ib_addr_state_t status; /* ib port status */
int32_t max_wqe;
};
typedef struct ib_address_t ib_address_t;
int mca_btl_openib_open_xrc_domain(struct mca_btl_openib_device_t *device);
int mca_btl_openib_close_xrc_domain(struct mca_btl_openib_device_t *device);
int mca_btl_openib_ib_address_add_new (uint16_t lid, uint64_t s_id,
opal_jobid_t ep_jobid, mca_btl_openib_endpoint_t *ep);
#endif

Просмотреть файл

@ -1,4 +0,0 @@
# Ignore symbols in this component that are auto-generated and we
# can't do anything about them (e.g., flex/bison symbols).
btl_openib_ini_yyleng
btl_openib_ini_yytext

Просмотреть файл

@ -1,117 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2008-2011 Mellanox Technologies. All rights reserved.
# Copyright (c) 2011 Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2015 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_btl_openib_POST_CONFIG([should_build])
# ------------------------------------------
AC_DEFUN([MCA_opal_btl_openib_POST_CONFIG], [
AM_CONDITIONAL([MCA_btl_openib_have_xrc], [test $1 -eq 1 && test "x$btl_openib_have_xrc" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_rdmacm], [test $1 -eq 1 && test "x$btl_openib_have_rdmacm" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_dynamic_sl], [test $1 -eq 1 && test "x$btl_openib_have_opensm_devel" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_udcm], [test $1 -eq 1 && test "x$btl_openib_have_udcm" = "x1"])
])
# MCA_btl_openib_CONFIG([action-if-can-copalle],
# [action-if-cant-copalle])
# ------------------------------------------------
AC_DEFUN([MCA_opal_btl_openib_CONFIG],[
AC_CONFIG_FILES([opal/mca/btl/openib/Makefile])
OPAL_VAR_SCOPE_PUSH([cpcs btl_openib_LDFLAGS_save btl_openib_LIBS_save])
cpcs="oob"
OPAL_CHECK_OPENFABRICS([btl_openib],
[btl_openib_happy="yes"
OPAL_CHECK_OPENFABRICS_CM([btl_openib])],
[btl_openib_happy="no"])
OPAL_CHECK_EXP_VERBS([btl_openib], [], [])
AS_IF([test "$btl_openib_happy" = "yes"],
[# With the new openib flags, look for ibv_fork_init
btl_openib_LDFLAGS_save="$LDFLAGS"
btl_openib_LIBS_save="$LIBS"
LDFLAGS="$LDFLAGS $btl_openib_LDFLAGS"
LIBS="$LIBS $btl_openib_LIBS"
AC_CHECK_FUNCS([ibv_fork_init])
LDFLAGS="$btl_openib_LDFLAGS_save"
LIBS="$btl_openib_LIBS_save"
$1],
[$2])
AS_IF([test "$btl_openib_happy" = "yes"],
[if test "x$btl_openib_have_xrc" = "x1"; then
cpcs="$cpcs xoob"
fi
if test "x$btl_openib_have_rdmacm" = "x1"; then
cpcs="$cpcs rdmacm"
if test "$enable_openib_rdmacm_ibaddr" = "yes"; then
AC_MSG_CHECKING([IB addressing])
AC_EGREP_CPP(
yes,
[
#include <infiniband/ib.h>
#ifdef AF_IB
yes
#endif
],
[
AC_CHECK_HEADERS(
[rdma/rsocket.h],
[
AC_MSG_RESULT([yes])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 1, rdmacm IB_AF addressing support)
],
[
AC_MSG_RESULT([no])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
AC_MSG_WARN([There is no IB_AF addressing support by lib rdmacm.])
]
)],
[
AC_MSG_RESULT([no])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
AC_MSG_WARN([There is no IB_AF addressing support by lib rdmacm.])
])
else
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
fi
fi
if test "x$btl_openib_have_udcm" = "x1"; then
cpcs="$cpcs udcm"
fi
AC_MSG_CHECKING([which openib btl cpcs will be built])
AC_MSG_RESULT([$cpcs])])
# make sure that CUDA-aware checks have been done
AC_REQUIRE([OPAL_CHECK_CUDA])
# substitute in the things needed to build openib
AC_SUBST([btl_openib_CFLAGS])
AC_SUBST([btl_openib_CPPFLAGS])
AC_SUBST([btl_openib_LDFLAGS])
AC_SUBST([btl_openib_LIBS])
OPAL_VAR_SCOPE_POP
])dnl

Просмотреть файл

@ -1,105 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2013 Mellanox Technologies, Inc.
* All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_BASE_H
#define BTL_OPENIB_CONNECT_BASE_H
#include "opal/mca/btl/openib/connect/connect.h"
#ifdef OPAL_HAVE_RDMAOE
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
(((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \
(IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ? \
true : false)
#else
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ? \
true : false)
#endif
BEGIN_C_DECLS
/*
* Forward declaration to resolve circular dependency
*/
struct mca_btl_base_endpoint_t;
/*
* Open function
*/
int opal_btl_openib_connect_base_register(void);
/*
* Component-wide CPC init
*/
int opal_btl_openib_connect_base_init(void);
/*
* Query CPCs to see if they want to run on a specific module
*/
int opal_btl_openib_connect_base_select_for_local_port
(mca_btl_openib_module_t *btl);
/*
* Forward reference to avoid an include file loop
*/
struct mca_btl_openib_proc_modex_t;
/*
* Select function
*/
int opal_btl_openib_connect_base_find_match
(mca_btl_openib_module_t *btl,
struct mca_btl_openib_proc_modex_t *peer_port,
opal_btl_openib_connect_base_module_t **local_cpc,
opal_btl_openib_connect_base_module_data_t **remote_cpc_data);
/*
* Find a CPC's index so that we can send it in the modex
*/
int opal_btl_openib_connect_base_get_cpc_index
(opal_btl_openib_connect_base_component_t *cpc);
/*
* Lookup a CPC by its index (received from the modex)
*/
opal_btl_openib_connect_base_component_t *
opal_btl_openib_connect_base_get_cpc_byindex(uint8_t index);
/*
* Allocate a CTS frag
*/
int opal_btl_openib_connect_base_alloc_cts(
struct mca_btl_base_endpoint_t *endpoint);
/*
* Free a CTS frag
*/
int opal_btl_openib_connect_base_free_cts(
struct mca_btl_base_endpoint_t *endpoint);
/*
* Start a new connection to an endpoint
*/
int opal_btl_openib_connect_base_start(
opal_btl_openib_connect_base_module_t *cpc,
struct mca_btl_base_endpoint_t *endpoint);
/*
* Component-wide CPC finalize
*/
void opal_btl_openib_connect_base_finalize(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,541 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2007 Mellanox Technologies, Inc. All rights reserved.
* Copyright (c) 2012-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
*
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "btl_openib.h"
#include "btl_openib_proc.h"
#include "connect/base.h"
#include "connect/btl_openib_connect_empty.h"
#if OPAL_HAVE_RDMACM
#include "connect/btl_openib_connect_rdmacm.h"
#endif
#if OPAL_HAVE_UDCM
#include "connect/btl_openib_connect_udcm.h"
#endif
#include "opal/util/argv.h"
#include "opal/util/output.h"
#include "opal/util/proc.h"
#include "opal/util/show_help.h"
#include "opal/util/printf.h"
#include "opal/util/sys_limits.h"
#include "opal/align.h"
/*
* Array of all possible connection functions
*/
static opal_btl_openib_connect_base_component_t *all[] = {
/* Always have an entry here so that the CP indexes will always be
the same: OOB has been removed, so use the "empty" CPC */
&opal_btl_openib_connect_empty,
/* Always have an entry here so that the CP indexes will always be
the same: XOOB has been removed, so use the "empty" CPC */
&opal_btl_openib_connect_empty,
/* Always have an entry here so that the CP indexes will always be
the same: if RDMA CM is not available, use the "empty" CPC */
#if OPAL_HAVE_RDMACM
&opal_btl_openib_connect_rdmacm,
#else
&opal_btl_openib_connect_empty,
#endif
/* Always have an entry here so that the CP indexes will always be
the same: if UD CM is not enabled, use the "empty" CPC */
#if OPAL_HAVE_UDCM
&opal_btl_openib_connect_udcm,
#else
&opal_btl_openib_connect_empty,
#endif
NULL
};
/* increase this count if any more cpcs are added */
static opal_btl_openib_connect_base_component_t *available[5];
static int num_available = 0;
static char *btl_openib_cpc_include;
static char *btl_openib_cpc_exclude;
/*
* Register MCA parameters
*/
int opal_btl_openib_connect_base_register(void)
{
int i, j, save;
char **temp = NULL, *string = NULL, *all_cpc_names = NULL;
/* Make an MCA parameter to select which connect module to use */
for (i = 0; NULL != all[i]; ++i) {
/* The CPC name "empty" is reserved for "fake" CPC modules */
if (0 != strcmp(all[i]->cbc_name, "empty")) {
opal_argv_append_nosize(&temp, all[i]->cbc_name);
}
}
all_cpc_names = opal_argv_join(temp, ',');
opal_argv_free(temp);
opal_asprintf(&string,
"Method used to select OpenFabrics connections (valid values: %s)",
all_cpc_names);
btl_openib_cpc_include = NULL;
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"cpc_include", string, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&btl_openib_cpc_include);
free(string);
opal_asprintf(&string,
"Method used to exclude OpenFabrics connections (valid values: %s)",
all_cpc_names);
btl_openib_cpc_exclude = NULL;
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"cpc_exclude", string, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&btl_openib_cpc_exclude);
free(string);
/* Parse the if_[in|ex]clude paramters to come up with a list of
CPCs that are available */
/* If we have an "include" list, then find all those CPCs and put
them in available[] */
if (NULL != btl_openib_cpc_include) {
mca_btl_openib_component.cpc_explicitly_defined = true;
temp = opal_argv_split(btl_openib_cpc_include, ',');
for (save = j = 0; NULL != temp[j]; ++j) {
for (i = 0; NULL != all[i]; ++i) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
opal_output(-1, "include: saving %s", all[i]->cbc_name);
available[save++] = all[i];
++num_available;
break;
}
}
if (NULL == all[i]) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"cpc name not found", true,
"include", opal_process_info.nodename,
"include", btl_openib_cpc_include, temp[j],
all_cpc_names);
opal_argv_free(temp);
free(all_cpc_names);
return OPAL_ERR_NOT_FOUND;
}
}
opal_argv_free(temp);
}
/* Otherwise, if we have an "exclude" list, take all the CPCs that
are not in that list and put them in available[] */
else if (NULL != btl_openib_cpc_exclude) {
mca_btl_openib_component.cpc_explicitly_defined = true;
temp = opal_argv_split(btl_openib_cpc_exclude, ',');
/* First: error check -- ensure that all the names are valid */
for (j = 0; NULL != temp[j]; ++j) {
for (i = 0; NULL != all[i]; ++i) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
break;
}
}
if (NULL == all[i]) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"cpc name not found", true,
"exclude", opal_process_info.nodename,
"exclude", btl_openib_cpc_exclude, temp[j],
all_cpc_names);
opal_argv_free(temp);
free(all_cpc_names);
return OPAL_ERR_NOT_FOUND;
}
}
/* Now do the exclude */
for (save = i = 0; NULL != all[i]; ++i) {
for (j = 0; NULL != temp[j]; ++j) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
break;
}
}
if (NULL == temp[j]) {
opal_output(-1, "exclude: saving %s", all[i]->cbc_name);
available[save++] = all[i];
++num_available;
}
}
opal_argv_free(temp);
}
/* If there's no include/exclude list, copy all[] into available[] */
else {
opal_output(-1, "no include or exclude: saving all");
memcpy(available, all, sizeof(all));
num_available = (sizeof(all) /
sizeof(opal_btl_openib_connect_base_module_t *)) - 1;
}
/* Call the register function on all the CPCs so that they may
setup any MCA params specific to the connection type */
for (i = 0; NULL != available[i]; ++i) {
if (NULL != available[i]->cbc_register) {
available[i]->cbc_register();
}
}
free (all_cpc_names);
return OPAL_SUCCESS;
}
/*
* Called once during openib BTL component initialization to allow CPC
* components to initialize.
*/
int opal_btl_openib_connect_base_init(void)
{
int i, rc;
/* Call each available CPC component's open function, if it has
one. If the CPC component open function returns OPAL_SUCCESS,
keep it. If it returns ERR_NOT_SUPPORTED, remove it from the
available[] array. If it returns something else, return that
error upward. */
for (i = num_available = 0; NULL != available[i]; ++i) {
if (NULL == available[i]->cbc_init) {
available[num_available++] = available[i];
opal_output(-1, "found available cpc (NULL init): %s",
all[i]->cbc_name);
continue;
}
rc = available[i]->cbc_init();
if (OPAL_SUCCESS == rc) {
available[num_available++] = available[i];
opal_output(-1, "found available cpc (SUCCESS init): %s",
all[i]->cbc_name);
continue;
} else if (OPAL_ERR_NOT_SUPPORTED == rc) {
continue;
} else {
return rc;
}
}
available[num_available] = NULL;
return (num_available > 0) ? OPAL_SUCCESS : OPAL_ERR_NOT_AVAILABLE;
}
/*
* Find all the CPCs that are eligible for a single local port (i.e.,
* openib module).
*/
int opal_btl_openib_connect_base_select_for_local_port(mca_btl_openib_module_t *btl)
{
char *msg = NULL;
int i, rc, cpc_index, len;
opal_btl_openib_connect_base_module_t **cpcs;
cpcs = (opal_btl_openib_connect_base_module_t **) calloc(num_available,
sizeof(opal_btl_openib_connect_base_module_t *));
if (NULL == cpcs) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Go through all available CPCs and query them to see if they
want to run on this module. If they do, save them to a running
array. */
for (len = 1, i = 0; NULL != available[i]; ++i) {
len += strlen(available[i]->cbc_name) + 2;
}
msg = (char *) malloc(len);
if (NULL == msg) {
free(cpcs);
return OPAL_ERR_OUT_OF_RESOURCE;
}
msg[0] = '\0';
for (cpc_index = i = 0; NULL != available[i]; ++i) {
if (i > 0) {
strcat(msg, ", ");
}
strcat(msg, available[i]->cbc_name);
rc = available[i]->cbc_query(btl, &cpcs[cpc_index]);
if (OPAL_ERR_NOT_SUPPORTED == rc || OPAL_ERR_UNREACH == rc) {
continue;
} else if (OPAL_SUCCESS != rc) {
free(cpcs);
free(msg);
return rc;
}
opal_output(-1, "match cpc for local port: %s",
available[i]->cbc_name);
/* If the CPC wants to use the CTS protocol, check to ensure
that QP 0 is PP; if it's not, we can't use this CPC (or the
CTS protocol) */
if (cpcs[cpc_index]->cbm_uses_cts &&
!BTL_OPENIB_QP_TYPE_PP(0)) {
BTL_VERBOSE(("this CPC only supports when the first btl_openib_receive_queues QP is a PP QP"));
continue;
}
/* This CPC has indicated that it wants to run on this openib
BTL module. Woo hoo! */
++cpc_index;
}
/* If we got an empty array, then no CPCs were eligible. Doh! */
if (0 == cpc_index) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"no cpcs for port", true,
opal_process_info.nodename,
ibv_get_device_name(btl->device->ib_dev),
btl->port_num, msg);
free(cpcs);
free(msg);
return OPAL_ERR_NOT_SUPPORTED;
}
free(msg);
/* We got at least one eligible CPC; save the array into the
module's port_info */
btl->cpcs = cpcs;
btl->num_cpcs = cpc_index;
return OPAL_SUCCESS;
}
/*
* This function is invoked when determining whether we have a CPC in
* common with a specific remote port. We already know that the
* subnet ID is the same between a specific local port and the target
* remote port; now we need to know if we can find a CPC in common
* between the two.
*
* If yes, be sure to find the *same* CPC on both sides. We know
* which CPCs are available on each side, and we know the priorities
* that were assigned on both sides. So find a CPC that is common to
* both sides and has the highest overall priority (between both
* sides).
*
* Return the matching CPC, or NULL if not found.
*/
int
opal_btl_openib_connect_base_find_match(mca_btl_openib_module_t *btl,
mca_btl_openib_proc_modex_t *peer_port,
opal_btl_openib_connect_base_module_t **ret_local_cpc,
opal_btl_openib_connect_base_module_data_t **ret_remote_cpc_data)
{
int i, j, max = -1;
opal_btl_openib_connect_base_module_t *local_cpc, *local_selected = NULL;
opal_btl_openib_connect_base_module_data_t *local_cpcd, *remote_cpcd,
*remote_selected = NULL;
/* Iterate over all the CPCs on the local module */
for (i = 0; i < btl->num_cpcs; ++i) {
local_cpc = btl->cpcs[i];
local_cpcd = &(local_cpc->data);
/* Iterate over all the CPCs on the remote port */
for (j = 0; j < peer_port->pm_cpc_data_count; ++j) {
remote_cpcd = &(peer_port->pm_cpc_data[j]);
/* Are the components the same? */
if (local_cpcd->cbm_component == remote_cpcd->cbm_component) {
/* If so, update the max priority found so far */
if (max < local_cpcd->cbm_priority) {
max = local_cpcd->cbm_priority;
local_selected = local_cpc;
remote_selected = remote_cpcd;
}
if (max < remote_cpcd->cbm_priority) {
max = remote_cpcd->cbm_priority;
local_selected = local_cpc;
remote_selected = remote_cpcd;
}
}
}
}
/* All done! */
if (NULL != local_selected) {
*ret_local_cpc = local_selected;
*ret_remote_cpc_data = remote_selected;
opal_output(-1, "find_match: found match!");
return OPAL_SUCCESS;
} else {
opal_output(-1, "find_match: did NOT find match!");
return OPAL_ERR_NOT_FOUND;
}
}
/*
* Lookup a CPC component's index in the all[] array so that we can
* send it int the modex
*/
int opal_btl_openib_connect_base_get_cpc_index(opal_btl_openib_connect_base_component_t *cpc)
{
int i;
for (i = 0; NULL != all[i]; ++i) {
if (all[i] == cpc) {
return i;
}
}
/* Not found */
return -1;
}
/*
* Lookup a CPC by its index (received from the modex)
*/
opal_btl_openib_connect_base_component_t *
opal_btl_openib_connect_base_get_cpc_byindex(uint8_t index)
{
return (index >= (sizeof(all) /
sizeof(opal_btl_openib_connect_base_module_t *))) ?
NULL : all[index];
}
int opal_btl_openib_connect_base_alloc_cts(mca_btl_base_endpoint_t *endpoint)
{
opal_free_list_item_t *fli;
int length = sizeof(mca_btl_openib_header_t) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t) +
sizeof(mca_btl_openib_footer_t) +
mca_btl_openib_component.qp_infos[mca_btl_openib_component.credits_qp].size;
int align_it = 0;
int page_size;
page_size = opal_getpagesize();
if (length >= page_size / 2) { align_it = 1; }
if (align_it) {
// I think this is only active for ~64k+ buffers anyway, but I'm not
// positive, so I'm only increasing the buffer size and alignment if
// it's not too small. That way we'd avoid wasting excessive memory
// in case this code was active for tiny buffers.
length = OPAL_ALIGN(length, page_size, int);
}
/* Explicitly don't use the mpool registration */
fli = &(endpoint->endpoint_cts_frag.super.super.base.super);
fli->registration = NULL;
if (!align_it) {
fli->ptr = malloc(length);
} else {
posix_memalign((void**)&(fli->ptr), page_size, length);
}
if (NULL == fli->ptr) {
BTL_ERROR(("malloc failed"));
return OPAL_ERR_OUT_OF_RESOURCE;
}
endpoint->endpoint_cts_mr =
ibv_reg_mr(endpoint->endpoint_btl->device->ib_pd,
fli->ptr, length,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ);
OPAL_OUTPUT((-1, "registered memory %p, length %d", fli->ptr, length));
if (NULL == endpoint->endpoint_cts_mr) {
free(fli->ptr);
BTL_ERROR(("Failed to reg mr!"));
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* NOTE: We do not need to register this memory with the
opal_memory subsystem, because this is OMPI-controlled memory
-- we do not need to worry about this memory being freed out
from underneath us. */
/* Copy the lkey where it needs to go */
endpoint->endpoint_cts_frag.super.sg_entry.lkey =
endpoint->endpoint_cts_mr->lkey;
endpoint->endpoint_cts_frag.super.sg_entry.length = length;
/* Construct the rest of the recv_frag_t */
OBJ_CONSTRUCT(&(endpoint->endpoint_cts_frag), mca_btl_openib_recv_frag_t);
endpoint->endpoint_cts_frag.super.super.base.order =
mca_btl_openib_component.credits_qp;
endpoint->endpoint_cts_frag.super.endpoint = endpoint;
OPAL_OUTPUT((-1, "Got a CTS frag for peer %s, addr %p, length %d, lkey %d",
opal_get_proc_hostname(endpoint->endpoint_proc->proc_opal),
(void*) endpoint->endpoint_cts_frag.super.sg_entry.addr,
endpoint->endpoint_cts_frag.super.sg_entry.length,
endpoint->endpoint_cts_frag.super.sg_entry.lkey));
return OPAL_SUCCESS;
}
int opal_btl_openib_connect_base_free_cts(mca_btl_base_endpoint_t *endpoint)
{
/* NOTE: We don't need to deregister this memory with opal_memory
because it was not registered there in the first place (see
comment above, near call to ibv_reg_mr). */
if (NULL != endpoint->endpoint_cts_mr) {
ibv_dereg_mr(endpoint->endpoint_cts_mr);
endpoint->endpoint_cts_mr = NULL;
}
if (NULL != endpoint->endpoint_cts_frag.super.super.base.super.ptr) {
free(endpoint->endpoint_cts_frag.super.super.base.super.ptr);
endpoint->endpoint_cts_frag.super.super.base.super.ptr = NULL;
OPAL_OUTPUT((-1, "Freeing CTS frag"));
}
return OPAL_SUCCESS;
}
/*
* Called to start a connection
*/
int opal_btl_openib_connect_base_start(
opal_btl_openib_connect_base_module_t *cpc,
mca_btl_base_endpoint_t *endpoint)
{
/* If the CPC uses the CTS protocol, provide a frag buffer for the
CPC to post. Must allocate these frags up here in the main
thread because the FREE_LIST_WAIT is not thread safe. */
if (cpc->cbm_uses_cts) {
int rc;
rc = opal_btl_openib_connect_base_alloc_cts(endpoint);
if (OPAL_SUCCESS != rc) {
return rc;
}
}
return cpc->cbm_start_connect(cpc, endpoint);
}
/*
* Called during openib btl component close
*/
void opal_btl_openib_connect_base_finalize(void)
{
int i;
for (i = 0 ; i < num_available ; ++i) {
if (NULL != available[i]->cbc_finalize) {
available[i]->cbc_finalize();
}
}
}

Просмотреть файл

@ -1,46 +0,0 @@
/*
* Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#include "connect/connect.h"
static void empty_component_register(void);
static int empty_component_init(void);
static int empty_component_query(mca_btl_openib_module_t *btl,
opal_btl_openib_connect_base_module_t **cpc);
opal_btl_openib_connect_base_component_t opal_btl_openib_connect_empty = {
"empty",
empty_component_register,
empty_component_init,
empty_component_query,
NULL
};
static void empty_component_register(void)
{
/* Nothing to do */
}
static int empty_component_init(void)
{
/* Never let this CPC run */
return OPAL_ERR_NOT_SUPPORTED;
}
static int empty_component_query(mca_btl_openib_module_t *btl,
opal_btl_openib_connect_base_module_t **cpc)
{
/* Never let this CPC run */
return OPAL_ERR_NOT_SUPPORTED;
}

Просмотреть файл

@ -1,20 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_EMPTY_H
#define BTL_OPENIB_CONNECT_EMPTY_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_empty;
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,20 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_RDMACM_H
#define BTL_OPENIB_CONNECT_RDMACM_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_rdmacm;
#endif

Просмотреть файл

@ -1,469 +0,0 @@
/*
* Copyright (c) 2011 Mellanox Technologies. All rights reserved.
*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "opal/util/show_help.h"
#include "opal/util/sys_limits.h"
#include "opal/util/proc.h"
#include "connect/btl_openib_connect_sl.h"
#include <infiniband/iba/ib_types.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#define SL_NOT_PRESENT 0xFF
#define MAX_GET_SL_REC_RETRIES 20
#define GET_SL_REC_RETRIES_TIMEOUT_MS 2000000
static struct mca_btl_openib_sa_qp_cache {
/* There will be a MR with the one send and receive buffer together */
/* The send buffer is first, the receive buffer is second */
/* The receive buffer in a UD queue pair needs room for the 40 byte GRH */
/* The buffers are first in the structure for page alignment */
char send_recv_buffer[MAD_BLOCK_SIZE * 2 + 40];
struct mca_btl_openib_sa_qp_cache *next;
struct ibv_context *context;
char *device_name;
uint32_t port_num;
struct ibv_qp *qp;
struct ibv_ah *ah;
struct ibv_cq *cq;
struct ibv_mr *mr;
struct ibv_pd *pd;
struct ibv_recv_wr rwr;
struct ibv_sge rsge;
uint8_t sl_values[65536]; /* 64K */
} *sa_qp_cache = 0;
static int init_ud_qp(
struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache);
static void init_sa_mad(
struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
struct ibv_send_wr *swr,
struct ibv_sge *ssge,
uint16_t lid,
uint16_t rem_lid);
static int get_pathrecord_info(
struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
ib_sa_mad_t *sar,
struct ibv_send_wr *swr,
uint16_t lid,
uint16_t rem_lid);
static int init_device(
struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache,
uint32_t port_num);
/*=================================================================*/
static void free_sa_qp_cache(void)
{
struct mca_btl_openib_sa_qp_cache *cache, *tmp;
cache = sa_qp_cache;
while (NULL != cache) {
/* free cache data */
if (cache->device_name)
free(cache->device_name);
if (NULL != cache->qp)
ibv_destroy_qp(cache->qp);
if (NULL != cache->ah)
ibv_destroy_ah(cache->ah);
if (NULL != cache->cq)
ibv_destroy_cq(cache->cq);
if (NULL != cache->mr)
ibv_dereg_mr(cache->mr);
if (NULL != cache->pd)
ibv_dealloc_pd(cache->pd);
tmp = cache->next;
free(cache);
cache = tmp;
}
sa_qp_cache = NULL;
}
/*=================================================================*/
static int init_ud_qp(struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache)
{
struct ibv_qp_init_attr iattr;
struct ibv_qp_attr mattr;
int rc;
/* create cq */
cache->cq = ibv_create_cq(cache->context, 4, NULL, NULL, 0);
if (NULL == cache->cq) {
BTL_ERROR(("error creating cq, errno says %s", strerror(errno)));
opal_show_help("help-mpi-btl-openib.txt", "init-fail-create-q",
true, opal_process_info.nodename,
__FILE__, __LINE__, "ibv_create_cq",
strerror(errno), errno,
ibv_get_device_name(context_arg->device));
return OPAL_ERROR;
}
/* create qp */
memset(&iattr, 0, sizeof(iattr));
iattr.send_cq = cache->cq;
iattr.recv_cq = cache->cq;
iattr.cap.max_send_wr = 1;
iattr.cap.max_recv_wr = 1;
iattr.cap.max_send_sge = 1;
iattr.cap.max_recv_sge = 1;
iattr.qp_type = IBV_QPT_UD;
cache->qp = ibv_create_qp(cache->pd, &iattr);
if (NULL == cache->qp) {
BTL_ERROR(("error creating qp %s (%d)", strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_INIT */
memset(&mattr, 0, sizeof(mattr));
mattr.qp_state = IBV_QPS_INIT;
mattr.port_num = cache->port_num;
mattr.qkey = ntohl(IB_QP1_WELL_KNOWN_Q_KEY);
rc = ibv_modify_qp(cache->qp, &mattr,
IBV_QP_STATE |
IBV_QP_PKEY_INDEX |
IBV_QP_PORT |
IBV_QP_QKEY);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_INIT errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_RTR */
memset(&mattr, 0, sizeof(mattr));
mattr.qp_state = IBV_QPS_RTR;
rc = ibv_modify_qp(cache->qp, &mattr, IBV_QP_STATE);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_RTR errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_RTS */
mattr.qp_state = IBV_QPS_RTS;
rc = ibv_modify_qp(cache->qp, &mattr, IBV_QP_STATE | IBV_QP_SQ_PSN);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_RTR errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
/*=================================================================*/
static void init_sa_mad(struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
struct ibv_send_wr *swr,
struct ibv_sge *ssge,
uint16_t lid,
uint16_t rem_lid)
{
ib_path_rec_t *path_record = (ib_path_rec_t*)sa_mad->data;
memset(swr, 0, sizeof(*swr));
memset(ssge, 0, sizeof(*ssge));
/* Initialize the standard MAD header. */
memset(sa_mad, 0, MAD_BLOCK_SIZE);
ib_mad_init_new((ib_mad_t *)sa_mad, /* mad header pointer */
IB_MCLASS_SUBN_ADM, /* management class */
(uint8_t) 2, /* version */
IB_MAD_METHOD_GET, /* method */
hton64((uint64_t)lid << 48 | /* transaction ID */
(uint64_t)rem_lid << 32 |
(uint64_t)cache->qp->qp_num << 8),
IB_MAD_ATTR_PATH_RECORD, /* attribute ID */
0); /* attribute modifier */
sa_mad->comp_mask = IB_PR_COMPMASK_DLID | IB_PR_COMPMASK_SLID;
path_record->dlid = htons(rem_lid);
path_record->slid = htons(lid);
swr->sg_list = ssge;
swr->num_sge = 1;
swr->opcode = IBV_WR_SEND;
swr->wr.ud.ah = cache->ah;
swr->wr.ud.remote_qpn = ntohl(IB_QP1);
swr->wr.ud.remote_qkey = ntohl(IB_QP1_WELL_KNOWN_Q_KEY);
swr->send_flags = IBV_SEND_SIGNALED | IBV_SEND_SOLICITED;
ssge->addr = (uint64_t)(void *)sa_mad;
ssge->length = MAD_BLOCK_SIZE;
ssge->lkey = cache->mr->lkey;
}
/*=================================================================*/
static int get_pathrecord_info(struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *req_mad,
ib_sa_mad_t *resp_mad,
struct ibv_send_wr *swr,
uint16_t lid,
uint16_t rem_lid)
{
struct ibv_send_wr *bswr;
struct ibv_wc wc;
struct timeval get_sl_rec_last_sent, get_sl_rec_last_poll;
struct ibv_recv_wr *brwr;
int got_sl_value, get_sl_rec_retries, rc, ne, i;
ib_path_rec_t *req_path_record = ib_sa_mad_get_payload_ptr(req_mad);
ib_path_rec_t *resp_path_record = ib_sa_mad_get_payload_ptr(resp_mad);
got_sl_value = 0;
get_sl_rec_retries = 0;
rc = ibv_post_recv(cache->qp, &(cache->rwr), &brwr);
if (0 != rc) {
BTL_ERROR(("error posting receive on QP [0x%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
while (0 == got_sl_value) {
rc = ibv_post_send(cache->qp, swr, &bswr);
if (0 != rc) {
BTL_ERROR(("error posting send on QP [0x%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
gettimeofday(&get_sl_rec_last_sent, NULL);
while (0 == got_sl_value) {
ne = ibv_poll_cq(cache->cq, 1, &wc);
if (ne > 0 && IBV_WC_RECV == wc.opcode) {
/* We only care about the status of receive work requests. */
/* If the status of the send work request was anything other */
/* than success, we'll eventually retransmit, so ignore them. */
if (0 == resp_mad->status &&
req_path_record->slid == htons(lid) &&
req_path_record->dlid == htons(rem_lid) &&
IBV_WC_SUCCESS == wc.status &&
wc.byte_len >= MAD_BLOCK_SIZE &&
resp_mad->trans_id == req_mad->trans_id) {
/* Everything matches, so we have the desired SL */
cache->sl_values[rem_lid] = ib_path_rec_sl(resp_path_record);
got_sl_value = 1;
break;
}
/* Probably bad status, unlikely bad lid match. We will */
/* ignore response and let it time out so that we do a */
/* retry, but after a delay. Need to repost receive WR. */
rc = ibv_post_recv(cache->qp, &(cache->rwr), &brwr);
if (0 != rc) {
BTL_ERROR(("error posing receive on QP[%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
} else if (0 == ne) { /* poll did not find anything */
gettimeofday(&get_sl_rec_last_poll, NULL);
i = get_sl_rec_last_poll.tv_sec - get_sl_rec_last_sent.tv_sec;
i = (i * 1000000) +
get_sl_rec_last_poll.tv_usec - get_sl_rec_last_sent.tv_usec;
if (i > GET_SL_REC_RETRIES_TIMEOUT_MS) {
get_sl_rec_retries++;
BTL_VERBOSE(("[%d/%d] retries to get PathRecord",
get_sl_rec_retries, MAX_GET_SL_REC_RETRIES));
if (get_sl_rec_retries > MAX_GET_SL_REC_RETRIES) {
BTL_ERROR(("No response from SA after %d retries",
MAX_GET_SL_REC_RETRIES));
return OPAL_ERROR;
}
/* Need to retransmit request. We must make a new TID */
/* so the SM doesn't see it as the same request. */
req_mad->trans_id += hton64(1);
break;
}
usleep(100); /* otherwise pause before polling again */
} else if (ne < 0) {
BTL_ERROR(("error polling CQ returned %d\n", ne));
return OPAL_ERROR;
}
}
}
return 0;
}
/*=================================================================*/
static int init_device(struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache,
uint32_t port_num)
{
struct ibv_ah_attr aattr;
struct ibv_port_attr pattr;
int rc;
cache->context = ibv_open_device(context_arg->device);
if (NULL == cache->context) {
BTL_ERROR(("error obtaining device context for %s errno says %s",
ibv_get_device_name(context_arg->device), strerror(errno)));
return OPAL_ERROR;
}
cache->device_name = strdup(ibv_get_device_name(cache->context->device));
cache->port_num = port_num;
/* init all sl_values to be SL_NOT_PRESENT */
memset(&cache->sl_values, SL_NOT_PRESENT, sizeof(cache->sl_values));
cache->next = sa_qp_cache;
sa_qp_cache = cache;
/* allocate the protection domain for the device */
cache->pd = ibv_alloc_pd(cache->context);
if (NULL == cache->pd) {
BTL_ERROR(("error allocating protection domain for %s errno says %s",
ibv_get_device_name(context_arg->device), strerror(errno)));
return OPAL_ERROR;
}
/* register memory region */
cache->mr = ibv_reg_mr(cache->pd, cache->send_recv_buffer,
sizeof(cache->send_recv_buffer),
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE);
if (NULL == cache->mr) {
BTL_ERROR(("error registering memory region, errno says %s", strerror(errno)));
return OPAL_ERROR;
}
/* init the ud qp */
rc = init_ud_qp(context_arg, cache);
if (OPAL_ERROR == rc) {
return OPAL_ERROR;
}
rc = ibv_query_port(cache->context, cache->port_num, &pattr);
if (rc) {
BTL_ERROR(("error getting port attributes for device %s "
"port number %d errno says %s",
ibv_get_device_name(context_arg->device),
cache->port_num, strerror(errno)));
return OPAL_ERROR;
}
/* create address handle */
memset(&aattr, 0, sizeof(aattr));
aattr.dlid = pattr.sm_lid;
aattr.sl = pattr.sm_sl;
aattr.port_num = cache->port_num;
cache->ah = ibv_create_ah(cache->pd, &aattr);
if (NULL == cache->ah) {
BTL_ERROR(("error creating address handle: %s", strerror(errno)));
return OPAL_ERROR;
}
memset(&(cache->rwr), 0, sizeof(cache->rwr));
cache->rwr.num_sge = 1;
cache->rwr.sg_list = &(cache->rsge);
memset(&(cache->rsge), 0, sizeof(cache->rsge));
cache->rsge.addr = (uint64_t)(void *)
(cache->send_recv_buffer + MAD_BLOCK_SIZE);
cache->rsge.length = MAD_BLOCK_SIZE + 40;
cache->rsge.lkey = cache->mr->lkey;
return 0;
}
/*=================================================================*/
static int get_pathrecord_sl(struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid)
{
struct ibv_send_wr swr;
ib_sa_mad_t *req_mad, *resp_mad;
struct ibv_sge ssge;
struct mca_btl_openib_sa_qp_cache *cache;
size_t page_size = (size_t)opal_getpagesize();
int rc;
/* search for a cached item */
for (cache = sa_qp_cache; cache; cache = cache->next) {
if (0 == strcmp(cache->device_name,
ibv_get_device_name(context_arg->device))
&& cache->port_num == port_num) {
break;
}
}
if (NULL == cache) {
/* init new cache */
if (posix_memalign((void **)(&cache), page_size,
sizeof(struct mca_btl_openib_sa_qp_cache))) {
BTL_ERROR(("error in posix_memalign SA cache"));
return OPAL_ERROR;
}
/* one time setup for each device/port combination */
rc = init_device(context_arg, cache, port_num);
if (0 != rc) {
return rc;
}
}
/* if the destination lid SL value is not in the cache, go get it */
if (SL_NOT_PRESENT == cache->sl_values[rem_lid]) {
/* sa_mad is first buffer, where we build the SA Get request to send */
req_mad = (ib_sa_mad_t *)(cache->send_recv_buffer);
init_sa_mad(cache, req_mad, &swr, &ssge, lid, rem_lid);
/* resp_mad is the receive buffer (40 byte offset is for GRH) */
resp_mad = (ib_sa_mad_t *)(cache->send_recv_buffer + MAD_BLOCK_SIZE + 40);
rc = get_pathrecord_info(cache, req_mad, resp_mad, &swr, lid, rem_lid);
if (0 != rc) {
return rc;
}
}
/* now all we do is send back the value laying around */
return cache->sl_values[rem_lid];
}
/*=================================================================*/
int btl_openib_connect_get_pathrecord_sl(struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid)
{
int rc = get_pathrecord_sl(context_arg, port_num, lid, rem_lid);
if (OPAL_ERROR == rc) {
free_sa_qp_cache();
}
return rc;
}
/*=================================================================*/
void btl_openib_connect_sl_finalize()
{
free_sa_qp_cache();
}

Просмотреть файл

@ -1,26 +0,0 @@
/*
* Copyright (c) 2011 Mellanox Technologies. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_SL_H
#define BTL_OPENIB_CONNECT_SL_H
BEGIN_C_DECLS
int btl_openib_connect_get_pathrecord_sl(
struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid);
void btl_openib_connect_sl_finalize(void);
END_C_DECLS
#endif /* BTL_OPENIB_CONNECT_SL_H */

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,22 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
* Copyright (c) 2011 Los Alamos National Security, LLC. All
* right reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_UD_H
#define BTL_OPENIB_CONNECT_UD_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_udcm;
#endif

Просмотреть файл

@ -1,355 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/**
* @file
*
* This interface is designed to hide the back-end details of how IB
* RC connections are made from the rest of the openib BTL. There are
* module-like instances of the implemented functionality (dlopen and
* friends are not used, but all the functionality is accessed through
* struct's of function pointers, so you can swap between multiple
* different implementations at run time, just like real components).
* Hence, these entities are referred to as "Connect
* Pseudo-Components" (CPCs).
*
* The CPCs are referenced by their names (e.g., "oob", "rdma_cm").
*
* CPCs are split into components and modules, similar to all other
* MCA frameworks in this code base.
*
* Before diving into the CPC interface, let's discuss some
* terminology and mappings of data structures:
*
* - a BTL module represents a network port (in the case of the openib
* BTL, a LID)
* - a CPC module represents one way to make connections to a BTL module
* - hence, a BTL module has potentially multiple CPC modules
* associated with it
* - an endpoint represnts a connection between a local BTL module and
* a remote BTL module (in the openib BTL, because of BSRQ, an
* endpoint can contain multiple QPs)
* - when an endpoint is created, one of the CPC modules associated
* with the local BTL is selected and associated with the endpoint
* (obviously, it is a CPC module that is common between the local
* and remote BTL modules)
* - endpoints may be created and destroyed during the MPI job
* - endpoints are created lazily, during the first communication
* between two peers
* - endpoints are destroyed when two MPI processes become
* disconnected (e.g., MPI-2 dynamics or MPI_FINALIZE)
* - hence, BTL modules and CPC modules outlive endpoints.
* Specifically, BTL modules and CPC modules live from MPI_INIT to
* MPI_FINALIZE. endpoints come and go as MPI semantics demand it.
* - therefore, CPC modules need to cache information on endpoints that
* are specific to that connection.
*
* Component interface:
*
* - component_register(): The openib BTL's component_open() function
* calls the connect_base_register() function, which scans all
* compiled-in CPC's. If they have component_register() functions,
* they are called (component_register() functions are only allowed to
* register MCA parameters).
*
* NOTE: The connect_base_register() function will process the
* btl_openib_cpc_include and btl_openib_cpc_exclude MCA parameters
* and automatically include/exclude CPCs as relevant. If a CPC is
* excluded, none of its other interface functions will be invoked for
* the duration of the process.
*
* - component_init(): The openib BTL's component_init() function
* calls connect_base_init(), which will invoke this query function on
* each CPC to see if it wants to run at all. CPCs can gracefully
* remove themselves from consideration in this process by returning
* OPAL_ERR_NOT_SUPPORTED.
*
* - component_query(): The openib BTL's init_one_port() calls the
* connect_base_select_for_local_port() function, which, for each LID
* on that port, calls the component_query() function on every
* available CPC on that LID. This function is intended to see if a
* CPC can run on a sepcific openib BTL module (i.e., LID). If it
* can, the CPC is supposed to create a CPC module that is specific to
* that BTL/LID and return it. If it cannot, it should return
* OPAL_ERR_NOT_SUPPORTED and be gracefully skipped for this
* OpenFabrics port.
*
* component_finalize(): The openib BTL's component_close() function
* calls connect_base_finalize(), which, in turn, calls the
* component_finalize() function on all available CPCs. Note that all
* CPC modules will have been finalized by this point; the CPC
* component_finalize() function is a chance for the CPC to clean up
* any component-specific resources.
*
* Module interface:
*
* cbm_component member: A pointer pointing to the single, global
* instance of the CPC component. This member is used for creating a
* unique index representing the modules' component so that it can be
* shared with remote peer processes.
*
* cbm_priority member: An integer between 0 and 100, inclusive,
* representing the priority of this CPC.
*
* cbm_modex_message member: A pointer to a blob buffer that will be
* included in the modex message for this port for this CPC (it is
* assumed that this blob is a) only understandable by the
* corresponding CPC in the peer process, and b) contains specific
* addressing/contact information for *this* port's CPC module).
*
* cbm_modex_message_len member: The length of the cbm_modex_message
* blob, in bytes.
*
* cbm_endpoint_init(): Called during endpoint creation, allowing a
* CPC module to cache information on the endpoint. A pointer to the
* endpoint's CPC module is already cached on the endpoint.
*
* cbm_start_connect(): initiate a connection to a remote peer. The
* CPC is responsible for setting itself up for asyncronous operation
* for progressing the outgoing connection request.
*
* cbm_endpoint_finalize(): Called during the endpoint destrouction,
* allowing the CPC module to destroy anything that it cached on the
* endpoint.
*
* cbm_finalize(): shut down all asynchronous handling and clean up
* any state that was setup for this CPC module/BTL. Some CPCs setup
* asynchronous support on a per-HCA/NIC basis (vs. per-port/LID). It
* is the reponsibility of the CPC to figure out such issues (e.g.,
* via reference counting) -- there is no notification from the
* upper-level BTL about when an entire HCA/NIC is no longer being
* used. There is only this function, which tells when a specific
* CPC/BTL module is no longer being used.
*
* cbm_uses_cts: a bool that indicates whether the CPC will use the
* CTS protocol or not.
* - if true: the CPC will post the fragment on
* endpoint->endpoint_cts_frag as a receive buffer and will *not*
* call opal_btl_openib_post_recvs().
* - if false: the CPC will call opal_btl_openib_post_recvs() before
* calling opal_btl_openib_cpc_complete().
*
* There are two functions in the main openib BTL that the CPC may
* call:
*
* - opal_btl_openib_post_recvs(endpoint): once a QP is locally
* connected to the remote side (but we don't know if the remote side
* is connected to us yet), this function is invoked to post buffers
* on the QP, setup credits for the endpoint, etc. This function is
* *only* invoked if the CPC's cbm_uses_cts is false.
*
* - opal_btl_openib_cpc_complete(endpoint): once that a CPC knows
* that a QP is connected on *both* sides, this function is invoked to
* tell the main openib BTL "ok, you can use this connection now."
* (e.g., the main openib BTL will either invoke the CTS protocol or
* start sending out fragments that were queued while the connection
* was establishing, etc.).
*/
#ifndef BTL_OPENIB_CONNECT_H
#define BTL_OPENIB_CONNECT_H
BEGIN_C_DECLS
#define BCF_MAX_NAME 64
/**
* Must forward declare these structs to avoid include file loops.
*/
struct mca_btl_openib_hca_t;
struct mca_btl_openib_module_t;
struct mca_btl_base_endpoint_t;
/**
* This is struct is defined below
*/
struct opal_btl_openib_connect_base_module_t;
/************************************************************************/
/**
* Function to register MCA params in the connect functions. It
* returns no value, so it cannot fail.
*/
typedef void (*opal_btl_openib_connect_base_component_register_fn_t)(void);
/**
* This function is invoked once by the openib BTL component during
* startup. It is intended to have CPC component-wide startup.
*
* Return value:
*
* - OPAL_SUCCESS: this CPC component will be used in selection during
* this process.
*
* - OPAL_ERR_NOT_SUPPORTED: this CPC component will be silently
* ignored in this process.
*
* - Other OPAL_ERR_* values: the error will be propagated upwards,
* likely causing a fatal error (and/or the openib BTL component
* being ignored).
*/
typedef int (*opal_btl_openib_connect_base_component_init_fn_t)(void);
/**
* Query the CPC to see if it wants to run on a specific port (i.e., a
* specific BTL module). If the component init function previously
* returned OPAL_SUCCESS, this function is invoked once per BTL module
* creation (i.e., for each port found by an MPI process). If this
* CPC wants to be used on this BTL module, it returns a CPC module
* that is specific to this BTL module.
*
* The BTL module in question is passed to the function; all of its
* attributes can be used to query to see if it's eligible for this
* CPC.
*
* If it is eligible, the CPC is responsible for creating a
* corresponding CPC module, filling in all the relevant fields on the
* modules, and for setting itself up to run (per above) and returning
* a CPC module (this is effectively the "module_init" function).
* Note that the module priority must be between 0 and 100
* (inclusive). When multiple CPCs are eligible for a single module,
* the CPC with the highest priority will be used.
*
* Return value:
*
* - OPAL_SUCCESS if this CPC is eligible for and was able to be setup
* for this BTL module. It is assumed that the CPC is now completely
* setup to run on this openib module (per description above).
*
* - OPAL_ERR_NOT_SUPPORTED if this CPC cannot support this BTL
* module. This is not an error; it's just the CPC saying "sorry, I
* cannot support this BTL module."
*
* - Other OPAL_ERR_* code: an error occurred.
*/
typedef int (*opal_btl_openib_connect_base_func_component_query_t)
(struct mca_btl_openib_module_t *btl,
struct opal_btl_openib_connect_base_module_t **cpc);
/**
* This function is invoked once by the openib BTL component during
* shutdown. It is intended to have CPC component-wide shutdown.
*/
typedef int (*opal_btl_openib_connect_base_component_finalize_fn_t)(void);
/**
* CPC component struct
*/
struct opal_btl_openib_connect_base_component_t {
/** Name of this set of connection functions */
char cbc_name[BCF_MAX_NAME];
/** Register function. Can be NULL. */
opal_btl_openib_connect_base_component_register_fn_t cbc_register;
/** CPC component init function. Can be NULL. */
opal_btl_openib_connect_base_component_init_fn_t cbc_init;
/** Query the CPC component to get a CPC module corresponding to
an openib BTL module. Cannot be NULL. */
opal_btl_openib_connect_base_func_component_query_t cbc_query;
/** CPC component finalize function. Can be NULL. */
opal_btl_openib_connect_base_component_finalize_fn_t cbc_finalize;
};
/**
* Convenience typedef
*/
typedef struct opal_btl_openib_connect_base_component_t opal_btl_openib_connect_base_component_t;
/************************************************************************/
/**
* Function called when an endpoint has been created and has been
* associated with a CPC.
*/
typedef int (*opal_btl_openib_connect_base_module_endpoint_init_fn_t)
(struct mca_btl_base_endpoint_t *endpoint);
/**
* Function to initiate a connection to a remote process.
*/
typedef int (*opal_btl_openib_connect_base_module_start_connect_fn_t)
(struct opal_btl_openib_connect_base_module_t *cpc,
struct mca_btl_base_endpoint_t *endpoint);
/**
* Function called when an endpoint is being destroyed.
*/
typedef int (*opal_btl_openib_connect_base_module_endpoint_finalize_fn_t)
(struct mca_btl_base_endpoint_t *endpoint);
/**
* Function to finalize the CPC module. It is called once when the
* CPC module's corresponding openib BTL module is being finalized.
*/
typedef int (*opal_btl_openib_connect_base_module_finalize_fn_t)
(struct mca_btl_openib_module_t *btl,
struct opal_btl_openib_connect_base_module_t *cpc);
/**
* Meta data about a CPC module. This is in a standalone struct
* because it is used in both the CPC module struct and the
* openib_btl_proc_t struct to hold information received from the
* modex.
*/
typedef struct opal_btl_openib_connect_base_module_data_t {
/** Pointer back to the component. Used by the base and openib
btl to calculate this module's index for the modex. */
opal_btl_openib_connect_base_component_t *cbm_component;
/** Priority of the CPC module (must be >=0 and <=100) */
uint8_t cbm_priority;
/** Blob that the CPC wants to include in the openib modex message
for a specific port, or NULL if the CPC does not want to
include a message in the modex. */
void *cbm_modex_message;
/** Length of the cbm_modex_message blob (0 if
cbm_modex_message==NULL). The message is intended to be short
(because the size of the modex broadcast is a function of
sum(cbm_modex_message_len[i]) for
i=(0...total_num_ports_in_MPI_job) -- e.g., IBCM imposes its
own [very short] limits (per IBTA volume 1, chapter 12). */
uint8_t cbm_modex_message_len;
} opal_btl_openib_connect_base_module_data_t;
/**
* Struct for holding CPC module and associated meta data
*/
typedef struct opal_btl_openib_connect_base_module_t {
/** Meta data about the module */
opal_btl_openib_connect_base_module_data_t data;
/** Endpoint initialization function */
opal_btl_openib_connect_base_module_endpoint_init_fn_t cbm_endpoint_init;
/** Connect function */
opal_btl_openib_connect_base_module_start_connect_fn_t cbm_start_connect;
/** Endpoint finalization function */
opal_btl_openib_connect_base_module_endpoint_finalize_fn_t cbm_endpoint_finalize;
/** Finalize the cpc module */
opal_btl_openib_connect_base_module_finalize_fn_t cbm_finalize;
/** Whether this module will use the CTS protocol or not. This
directly states whether this module will call
mca_btl_openib_endpoint_post_recvs() or not: true = this
module will *not* call _post_recvs() and instead will post the
receive buffer provided at endpoint->endpoint_cts_frag on qp
0. */
bool cbm_uses_cts;
} opal_btl_openib_connect_base_module_t;
END_C_DECLS
#endif

Просмотреть файл

@ -1,57 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2008-2009 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics IB CPC
# support.
#
[no cpcs for port]
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: %s
Local device: %s
Local port: %d
CPCs attempted: %s
#
[cpc name not found]
An invalid CPC name was specified via the btl_openib_cpc_%s MCA
parameter.
Local host: %s
btl_openib_cpc_%s value: %s
Invalid name: %s
All possible valid names: %s
#
[inline truncated]
WARNING: The btl_openib_max_inline_data MCA parameter was used to
specify how much inline data should be used, but a device reduced this
value. This is not an error; it simply means that your run will use
a smaller inline data value than was requested.
Local host: %s
Local device: %s
Local port: %d
Requested value: %d
Value used by device: %d
#
[ibv_create_qp failed]
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: %s
Local device: %s
Queue pair type: %s

Просмотреть файл

@ -1,67 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics RDMA CM
# support (the openib BTL).
#
[could not find matching endpoint]
The OpenFabrics device in an MPI process received an RDMA CM connect
request for a peer that it could not identify as part of this MPI job.
This should not happen. Your process is likely to abort; sorry.
Local host: %s
Local device: %s
Remote address: %s
Remote TCP port: %d
#
[illegal tcp port]
The btl_openib_connect_rdmacm_port MCA parameter was used to specify
an illegal TCP port value. TCP ports must be between 0 and 65536
(ports below 1024 can only be used by root).
TCP port: %d
This value was ignored.
#
[illegal retry count]
The btl_openib_connect_rdmacm_retry_count MCA parameter was used to specify
an illegal retry count.
Retry count: %d
#
[illegal timeout]
The btl_openib_connect_rdmacm_resolve_timeout parameter was used to
specify an illegal timeout value. Timeout values are specified in
miliseconds and must be greater than 0.
Timeout value: %d
This value was ignored.
#
[rdma cm device removal]
The RDMA CM returned that the device Open MPI was trying to use has
been removed.
Local host: %s
Local device: %s
Your MPI job will now abort, sorry.
#
[rdma cm event error]
The RDMA CM returned an event error while attempting to make a
connection. This type of error usually indicates a network
configuration error.
Local host: %s
Local device: %s
Error name: %s
Peer: %s
Your MPI job will now abort, sorry.

Просмотреть файл

@ -1,725 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2006 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2006-2011 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2007-2009 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
# Copyright (c) 2013-2014 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2018 Los Alamos National Security, LLC. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics support
# (the openib BTL).
#
[ini file:file not found]
The Open MPI OpenFabrics (openib) BTL component was unable to find or
read an INI file that was requested via the
btl_openib_device_param_files MCA parameter. Please check this file
and/or modify the btl_openib_evice_param_files MCA parameter:
%s
#
[ini file:not in a section]
In parsing the OpenFabrics (openib) BTL parameter file, values were
found that were not in a valid INI section. These values will be
ignored. Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:unexpected token]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:expected equals]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). An equals sign ("=") was expected but was not found.
Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:expected newline]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). A newline was expected but was not found. Please
re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:unknown field]
In parsing the OpenFabrics (openib) BTL parameter file, an
unrecognized field name was found. Please re-check this file:
%s
At line %d, the field named:
%s
This field, and any other unrecognized fields, will be skipped.
#
[no device params found]
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: %s
Device name: %s
Device vendor ID: 0x%04x
Device vendor part ID: %d
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
#
[init-fail-no-mem]
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: %s
OMPI source: %s:%d
Function: %s()
Device: %s
Memlock limit: %s
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
#
[init-fail-create-q]
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue. This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?). The failure occured here:
Local host: %s
OMPI source: %s:%d
Function: %s()
Error: %s (errno=%d)
Device: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[pp rnr retry exceeded]
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded. In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.
This error usually means one of two things:
1. There is something awry within the network fabric itself.
2. A bug in Open MPI has caused flow control to malfunction.
#1 is usually more likely. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[srq rnr retry exceeded]
The OpenFabrics "receiver not ready" retry count on a shared receive
queue or XRC receive queue has been exceeded. This error can occur if
the mca_btl_openib_ib_rnr_retry is set to a value less than 7 (where 7
the default value and effectively means "infinite retry"). If your
rnr_retry value is 7, there might be something awry within the network
fabric itself. In this case, you should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[pp retry exceeded]
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 20). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[no active ports found]
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: %s
#
[error in device init]
WARNING: There was an error initializing an OpenFabrics device.
Local host: %s
Local device: %s
#
[no devices right type]
WARNING: No OpenFabrics devices of the right type were found within
the requested bus distance. The OpenFabrics BTL will be ignored for
this run.
Local host: %s
Requested type: %s
If the "requested type" is "<any>", this usually means that *no*
OpenFabrics devices were found within the requested bus distance.
Note starting with Open MPI 4.0, only iWarp and RoCE devices are considered
for selection by default. Set the btl_openib_allow_ib MCA
parameter to "true" to allow use of Infiniband devices.
#
[default subnet prefix]
WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
#
[ibv_fork requested but not supported]
WARNING: fork() support was requested for the OpenFabrics (openib)
BTL, but it is not supported on the host %s. Deactivating the
OpenFabrics BTL.
#
[ibv_fork_init fail]
WARNING: fork() support was requested for the OpenFabrics (openib)
BTL, but the library call ibv_fork_init() failed on the host %s.
Deactivating the OpenFabrics BTL.
#
[wrong buffer alignment]
Wrong buffer alignment %d configured on host '%s'. Should be bigger
than zero and power of two. Use default %d instead.
#
[of error event]
The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.
Local host: %s
MPI process PID: %d
Error number: %d (%s)
This error may indicate connectivity problems within the fabric;
please contact your system administrator.
#
[of unknown event]
The OpenFabrics stack has reported an unknown network error event.
Open MPI will try to continue, but the job may end up failing.
Local host: %s
MPI process PID: %d
Error number: %d
This error may indicate that you are using an OpenFabrics library
version that is not currently supported by Open MPI. You might try
recompiling Open MPI against your OpenFabrics library installation to
get more information.
#
[specified include and exclude]
ERROR: You have specified more than one of the btl_openib_if_include,
btl_openib_if_exclude, btl_openib_ipaddr_include, or btl_openib_ipaddr_exclude
MCA parameters. These four parameters are mutually exclusive; you can only
specify one.
For reference, the values that you specified are:
btl_openib_if_include: %s
btl_openib_if_exclude: %s
btl_openib_ipaddr_include: %s
btl_openib_ipaddr_exclude: %s
#
[nonexistent port]
WARNING: One or more nonexistent OpenFabrics devices/ports were
specified:
Host: %s
MCA parameter: mca_btl_if_%sclude
Nonexistent entities: %s
These entities will be ignored. You can disable this warning by
setting the btl_openib_warn_nonexistent_if MCA parameter to 0.
#
[invalid mca param value]
WARNING: An invalid MCA parameter value was found for the OpenFabrics
(openib) BTL.
Problem: %s
Resolution: %s
#
[no qps in receive_queues]
WARNING: No queue pairs were defined in the btl_openib_receive_queues
MCA parameter. At least one queue pair must be defined. The
OpenFabrics (openib) BTL will therefore be deactivated for this run.
Local host: %s
#
[invalid qp type in receive_queues]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
Valid queue pair types are "P" for per-peer and "S" for shared receive
queue.
Local host: %s
btl_openib_receive_queues: %s
Bad specification: %s
#
[invalid pp qp specification]
WARNING: An invalid per-peer receive queue specification was detected
as part of the btl_openib_receive_queues MCA parameter. The
OpenFabrics (openib) BTL will therefore be deactivated for this run.
Per-peer receive queues require between 2 and 5 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
4. Credit window size (optional; defaults to (low_watermark / 2),
must be > 0)
5. Number of buffers reserved for credit messages (optional;
defaults to (num_buffers*2-1)/credit_window)
Example: P,128,256,128,16
- 128 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- If the number of available credits reaches 16, send an explicit
credit message to the sender
- Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages
Local host: %s
Bad queue specification: %s
#
[invalid srq specification]
WARNING: An invalid shared receive queue specification was detected as
part of the btl_openib_receive_queues MCA parameter. The OpenFabrics
(openib) BTL will therefore be deactivated for this run.
Shared receive queues can take between 2 and 6 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
4. Maximum number of outstanding sends a sender can have (optional;
defaults to (low_watermark / 4)
5. Start value of number of receive buffers that will be pre-posted (optional; defaults to (num_buffers / 4))
6. Event limit buffer count watermark (optional; defaults to (3/16 of start value of buffers number))
Example: S,1024,256,128,32,32,8
- 1024 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- A sender will not send to a peer unless it has less than 32
outstanding sends to that peer.
- 32 receive buffers will be preposted.
- When the number of unused shared receive buffers reaches 8, more
buffers (32 in this case) will be posted.
Local host: %s
Bad queue specification: %s
#
[rd_num must be > rd_low]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter must be greater than the low
buffer count watermark. The OpenFabrics (openib) BTL will therefore
be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[rd_num must be >= rd_init]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter (parameter #2) must be
greater or equal to the initial SRQ size (parameter #5).
The OpenFabrics (openib) BTL will therefore be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[srq_limit must be > rd_num]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter (parameter #2) must be greater than the limit
buffer count (parameter #6). The OpenFabrics (openib) BTL will therefore
be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[biggest qp size is too small]
WARNING: The largest queue pair buffer size specified in the
btl_openib_receive_queues MCA parameter is smaller than the maximum
send size (i.e., the btl_openib_max_send_size MCA parameter), meaning
that no queue is large enough to receive the largest possible incoming
message fragment. The OpenFabrics (openib) BTL will therefore be
deactivated for this run.
Local host: %s
Largest buffer size: %d
Maximum send fragment size: %d
#
[biggest qp size is too big]
WARNING: The largest queue pair buffer size specified in the
btl_openib_receive_queues MCA parameter is larger than the maximum
send size (i.e., the btl_openib_max_send_size MCA parameter). This
means that memory will be wasted because the largest possible incoming
message fragment will not fill a buffer allocated for incoming
fragments.
Local host: %s
Largest buffer size: %d
Maximum send fragment size: %d
#
[freelist too small]
WARNING: The maximum freelist size that was specified was too small
for the requested receive queue sizes. The maximum freelist size must
be at least equal to the sum of the largest number of buffers posted
to a single queue plus the corresponding number of reserved/credit
buffers for that queue. It is suggested that the maximum be quite a
bit larger than this for performance reasons.
Local host: %s
Specified freelist size: %d
Minimum required freelist size: %d
#
[XRC with PP or SRQ]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
Note that XRC ("X") queue pairs cannot be used with per-peer ("P") and
SRQ ("S") queue pairs. This restriction may be removed in future
versions of Open MPI.
Local host: %s
btl_openib_receive_queues: %s
#
[XRC with BTLs per LID]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
XRC ("X") queue pairs can not be used when (btls_per_lid > 1). This
restriction may be removed in future versions of Open MPI.
Local host: %s
btl_openib_receive_queues: %s
btls_per_lid: %d
#
[XRC on device without XRC support]
WARNING: You configured the OpenFabrics (openib) BTL to run with %d
XRC queues. The device %s does not have XRC capabilities; the
OpenFabrics btl will ignore this device. If no devices are found with
XRC capabilities, the OpenFabrics BTL will be disabled.
Local host: %s
#
[No XRC support]
WARNING: The Open MPI build was compiled without XRC support, but XRC
("X") queues were specified in the btl_openib_receive_queues MCA
parameter. The OpenFabrics (openib) BTL will therefore be deactivated
for this run.
Local host: %s
btl_openib_receive_queues: %s
#
[non optimal rd_win]
WARNING: rd_win specification is non optimal. For maximum performance it is
advisable to configure rd_win bigger than (rd_num - rd_low), but currently
rd_win = %d and (rd_num - rd_low) = %d.
#
[apm without lmc]
WARNING: You can't enable APM support with LMC bit configured to 0.
APM support will be disabled.
#
[apm with wrong lmc]
Can not provide %d alternative paths with LMC bit configured to %d.
#
[apm not enough ports]
WARNING: For APM over ports ompi require at least 2 active ports and
only single active port was found. Disabling APM over ports
#
[locally conflicting receive_queues]
Open MPI detected two devices on a single server that have different
"receive_queues" parameter values (in the openib BTL). Open MPI
currently only supports one OpenFabrics receive_queues value in an MPI
job, even if you have different types of OpenFabrics adapters on the
same host.
Device 2 (in the details shown below) will be ignored for the duration
of this MPI job.
You can fix this issue by one or more of the following:
1. Set the MCA parameter btl_openib_receive_queues to a value that
is usable by all the OpenFabrics devices that you will use.
2. Use the btl_openib_if_include or btl_openib_if_exclue MCA
parameters to select exactly which OpenFabrics devices to use in
your MPI job.
Finally, note that the "receive_queues" values may have been set by
the Open MPI device default settings file. You may want to look in
this file and see if your devices are getting receive_queues values
from this file:
%s/mca-btl-openib-device-params.ini
Here is more detailed information about the recieive_queus value
conflict:
Local host: %s
Device 1: %s (vendor 0x%x, part ID %d)
Receive queues: %s
Device 2: %s (vendor 0x%x, part ID %d)
Receive queues: %s
#
[eager RDMA and progress threads]
WARNING: The openib BTL was directed to use "eager RDMA" for short
messages, but the openib BTL was compiled with progress threads
support. Short eager RDMA is not yet supported with progress threads;
its use has been disabled in this job.
This is a warning only; you job will attempt to continue.
#
[ptmalloc2 with no threads]
WARNING: It appears that ptmalloc2 was compiled into this process via
-lopenmpi-malloc, but there is no thread support. This combination is
known to cause memory corruption in the openib BTL. Open MPI is
therefore disabling the use of the openib BTL in this process for this
run.
Local host: %s
#
[cannot raise btl error]
The OpenFabrics driver in Open MPI tried to raise a fatal error, but
failed. Hopefully there was an error message before this one that
gave some more detailed information.
Local host: %s
Source file: %s
Source line: %d
Your job is now going to abort, sorry.
#
[no iwarp support]
Open MPI does not support iWARP devices with this version of OFED.
You need to upgrade to a later version of OFED (1.3 or later) for Open
MPI to support iWARP devices.
(This message is being displayed because you told Open MPI to use
iWARP devices via the btl_openib_device_type MCA parameter)
#
[invalid ipaddr_inexclude]
WARNING: An invalid value was given for btl_openib_ipaddr_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[unsupported queues configuration]
The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other. This
generally happens when you are using OpenFabrics devices from
different vendors on the same network. You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.
Local host: %s
Local adapter: %s (vendor 0x%x, part ID %d)
Local queues: %s
Remote host: %s
Remote adapter: (vendor 0x%x, part ID %d)
Remote queues: %s
#
[conflicting transport types]
Open MPI detected two different OpenFabrics transport types in the same Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.
Local host: %s
Local adapter: %s (vendor 0x%x, part ID %d)
Local transport type: %s
Remote host: %s
Remote Adapter: (vendor 0x%x, part ID %d)
Remote transport type: %s
#
[gid index too large]
Open MPI tried to use a GID index that was too large for an
OpenFabrics device (i.e., the GID index does not exist on this
device).
Local host: %s
Local adapter: %s
Local port: %d
Requested GID index: %d (specified by the btl_openib_gid_index MCA param)
Max allowable GID index: %d
Use "ibv_devinfo -v" on the local host to see the GID table of this
device.
[reg mem limit low]
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: %s
Registerable memory: %lu MiB
Total memory: %lu MiB
%s
[CUDA_no_gdr_support]
You requested to run with CUDA GPU Direct RDMA support but the Open MPI
library was not built with that support. The Open MPI library must be
configured with CUDA 6.0 or later.
Local host: %s
[driver_no_gdr_support]
You requested to run with CUDA GPU Direct RDMA support but this OFED
installation does not have that support. Contact Mellanox to figure
out how to get an OFED stack with that support.
Local host: %s
[no_fork_with_gdr]
You cannot have fork support and CUDA GPU Direct RDMA support on at the
same time. Please disable one of them. Deactivating the openib BTL.
Local host: %s
#
[CUDA_gdr_and_nopinned]
You requested to run with CUDA GPU Direct RDMA support but also with
"leave pinned" turned off. This will result in very poor performance
with CUDA GPU Direct RDMA. Either disable GPU Direct RDMA support or
enable "leave pinned" support. Deactivating the openib BTL.
Local host: %s
#
[do_not_set_openib_value]
Open MPI has detected that you have attempted to set the btl_openib_cuda_max_send_size
value. This is not supported. Setting back to default value of 0.
Local host: %s
[ib port not selected]
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: %s
Local adapter: %s
Local port: %d
#

Просмотреть файл

@ -1,351 +0,0 @@
#
# Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2006-2011 Mellanox Technologies. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# This is the default NIC/HCA parameters file for Open MPI's OpenIB
# BTL. If NIC/HCA vendors wish to add their respective values into
# this file (that is distributed with Open MPI), please contact the
# Open MPI development team. See http://www.open-mpi.org/ for
# details.
# This file is in the "ini" style, meaning that it has sections
# identified section names enclosed in square brackets (e.g.,
# "[Section name]") followed by "key = value" pairs indicating values
# for a specific NIC/HCA vendor and model. NICs/HCAs are identified
# by their vendor ID and vendor part ID, which can be obtained by
# running the diagnostic utility command "ibv_devinfo". The fields
# "vendor_id" and "vendor_part"id" are the vendor ID and vendor part
# ID, respectively.
# The sections in this file only accept a few fields:
# vendor_id: a comma-delimited list of integers of NIC/HCA vendor IDs,
# expressed either in decimal or hexidecimal (e.g., "13" or "0xd").
# Individual values can be taken directly from the output of
# "ibv_devinfo". NIC/HCA vendor ID's correspond to IEEE OUI's, for
# which you can find the canonical list here:
# http://standards.ieee.org/regauth/oui/. Example:
#
# vendor_id = 0x05ad
#
# Note: Several vendors resell Mellanox hardware and put their own firmware
# on the cards, therefore overriding the default Mellanox vendor ID.
#
# Mellanox 0x02c9
# Cisco 0x05ad
# Silverstorm 0x066a
# Voltaire 0x08f1
# HP 0x1708
# Sun 0x03ba
# Bull 0x119f
# vendor_part_id: a comma-delimited list of integers of different
# NIC/HCA models from a single vendor, expressed in either decimal or
# hexidecimal (e.g., "13" or "0xd"). Individual values can be
# obtained from the output of the "ibv_devinfo". Example:
#
# vendor_part_id = 25208,25218
# mtu: an integer indicating the maximum transfer unit (MTU) to be
# used with this NIC/HCA. The effective MTU will be the minimum of an
# NIC's/HCA's MTU value and its peer NIC's/HCA's MTU value. Valid
# values are 256, 512, 1024, 2048, and 4096. Example:
#
# mtu = 1024
# use_eager_rdma: an integer indicating whether RDMA should be used
# for eager messages. 0 values indicate "no" (false); non-zero values
# indicate "yes" (true). This flag should only be enabled for
# NICs/HCAs that can provide guarantees about ordering of data in
# memory -- that the last byte of an incoming RDMA write will always
# be written last. Certain cards cannot provide this guarantee, while
# others can.
# use_eager_rdma = 1
# receive_queues: a list of "bucket shared receive queues" (BSRQ) that
# are opened between MPI process peer pairs for point-to-point
# communications of messages shorter than the total length required
# for RDMA transfer. The use of multiple RQs, each with different
# sized posted receive buffers can allow [much] better registered
# memory utilization -- MPI messages are sent on the QP with the
# smallest buffer size that will fit the message. Note that flow
# control messages are always sent across the QP with the smallest
# buffer size. Also note that the buffers *must* be listed in
# increasing buffer size. This parameter matches the
# mca_btl_openib_receive_queues MCA parameter; see the ompi_info help
# message and FAQ for a description of its values. BSRQ
# specifications are found in this precedence:
# highest: specifying the mca_btl_openib_receive_queues MCA param
# next: finding a value in this file
# lowest: using the default mca_btl_openib_receive_queues MCA param value
# receive_queues = P,128,256,192,128:S,65536,256,192,128
# max_inline_data: an integer specifying the maximum inline data (in
# bytes) supported by the device. -1 means to use a run-time probe to
# figure out the maximum value supported by the device.
# max_inline_data = 1024
# rdmacm_reject_causes_connect_error: a boolean indicating whether
# when an RDMA CM REJECT is issued on the device, instead of getting
# the expected REJECT event back, you might get a CONNECT_ERROR event.
# Open MPI uses RDMA CM REJECT messages in its normal wireup
# procedure; some connections are *expected* to be rejected. However,
# with some older drivers, if process A issues a REJECT, process B
# will receive a CONNECT_ERROR event instead of a REJECT event. So if
# this flag is set to true and we receive a CONNECT_ERROR event on a
# connection where we are expecting a REJECT, then just treat the
# CONNECT_ERROR exactly as we would have treated the REJECT. Setting
# this flag to true allows Open MPI to work around the behavior
# described above. It is [mostly] safe to set this flag to true even
# after a driver has been fixed; the scope of where this flag is used
# is small enough that it *shouldn't* mask real CONNECT_ERROR events.
# rdmacm_reject_causes_connect_error = 1
############################################################################
[default]
# These are the default values, identified by the vendor and part ID
# numbers of 0 and 0. If queried NIC/HCA does not return vendor and
# part ID numbers that match any of the sections in this file, the
# values in this section are used. Vendor IDs and part IDs can be hex
# or decimal.
vendor_id = 0
vendor_part_id = 0
use_eager_rdma = 0
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Tavor Infinihost]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 23108
use_eager_rdma = 1
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Arbel InfiniHost III MemFree/Tavor]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 25208,25218
use_eager_rdma = 1
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Sinai Infinihost III]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 25204,24204
use_eager_rdma = 1
mtu = 2048
max_inline_data = 128
############################################################################
# A.k.a. ConnectX
[Mellanox Hermon]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 25408,25418,25428,25448,26418,26428,26438,26448,26468,26478,26488,4099,4103,4100
use_eager_rdma = 1
mtu = 2048
max_inline_data = 128
############################################################################
[Mellanox ConnectIB]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4113
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[Mellanox ConnectX4]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4115,4117
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[Mellanox ConnectX5]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4119,4121
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[IBM eHCA 4x and 12x]
vendor_id = 0x5076
vendor_part_id = 0
use_eager_rdma = 1
mtu = 2048
receive_queues = P,128,256,192,128:P,65536,256,192,128
max_inline_data = 0
############################################################################
[IBM eHCA-2 4x and 12x]
vendor_id = 0x5076
vendor_part_id = 1
use_eager_rdma = 1
mtu = 4096
receive_queues = P,128,256,192,128:P,65536,256,192,128
max_inline_data = 0
############################################################################
# See http://lists.openfabrics.org/pipermail/general/2008-June/051920.html
# 0x1fc1 and 0x1077 are PCI ID's; at least one of QL's OUIs is 0x1175
[QLogic InfiniPath 1]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 13
use_eager_rdma = 1
mtu = 2048
max_inline_data = 0
[QLogic InfiniPath 2]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 16,29216
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
[QLogic InfiniPath 3]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 16,29474
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
[QLogic FastLinQ QL41000]
vendor_id = 0x1077
vendor_part_id = 32880
receive_queues = P,65536,64
############################################################################
# Chelsio's OUI is 0x0743. 0x1425 is the PCI ID.
[Chelsio T3]
vendor_id = 0x1425
vendor_part_id = 0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0030,0x0031,0x0032,0x0035,0x0036
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,256,192,128
max_inline_data = 64
[Chelsio T4]
vendor_id = 0x1425
vendor_part_id = 0xa000,0x4400,0x4401,0x4402,0x4403,0x4404,0x4405,0x4406,0x4407,0x4408,0x4409,0x440a,0x440b,0x440c,0x440d,0x440e,0x4480,0x4481
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id = 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
[Chelsio T6]
vendor_id = 0x1425
vendor_part_id = 0x6400,0x6401,0x6402,0x6403,0x6404,0x6405,0x6406,0x6407,0x6408,0x6409,0x640d,0x6410,0x6411,0x6414,0x6415
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
############################################################################
# I'm *assuming* that 0x4040 is the PCI ID...
[NetXen]
vendor_id = 0x4040
vendor_part_id = 0x0001,0x0002,0x0003,0x0004,0x0005,0x0024,0x0025,0x0100
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,248,192,128
max_inline_data = 64
############################################################################
# NetEffect's OUI is 0x1255. 0x1678 is the PCI ID. ...but then
# NetEffect was bought by Intel. Intel's OUI is 0x1b21.
[NetEffect/Intel NE020]
vendor_id = 0x1678,0x1255,0x1b21
vendor_part_id = 0x0100,0x0110
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,256,192,128
max_inline_data = 64
[Intel HFI1]
vendor_id = 0x1175
vendor_part_id = 9456,9457
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
############################################################################
# Intel has several OUI's, including 0x8086. Amusing. :-) Intel has
# advised us (June, 2013) to ignore the Intel Phi OpenFabrics
# device... at least for now.
[Intel Xeon Phi]
vendor_id = 0x8086
vendor_part_id = 0
ignore_device = 1
############################################################################
# IBM Soft iWARP device.
[IBM Soft iWARP]
vendor_id = 0x626d74
vendor_part_id = 0
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 72
############################################################################
# Broadcom NetXtreme-E RDMA Ethernet Controller
[Broadcom BCM57XXX]
vendor_id = 0x14e4
vendor_part_id = 0x1605,0x1606,0x1614,0x16c0,0x16c1,0x16ce,0x16cf,0x16d6,0x16d7,0x16d8,0x16d9,0x16df,0x16e2,0x16e3,0x16e5,0x16eb,0x16ed,0x16ef,0x16f0,0x16f1
use_eager_rdma = 1
mtu = 1024
receive_queues = P,65536,256,192,128
max_inline_data = 96
[Broadcom BCM58XXX]
vendor_id = 0x14e4
vendor_part_id = 0xd800,0xd802,0xd804
use_eager_rdma = 1
mtu = 1024
receive_queues = P,65536,256,192,128
max_inline_data = 96

Просмотреть файл

@ -1,7 +0,0 @@
#
# owner/status file
# owner: institution that is responsible for this package
# status: e.g. active, maintenance, unmaintained
#
owner:Chelsio
status:maintenance

Просмотреть файл

@ -1,83 +0,0 @@
#
# Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
# Copyright (c) 2012-2015 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
AM_CPPFLAGS = $(common_verbs_CPPFLAGS)
headers = \
common_verbs.h
sources = \
common_verbs_basics.c \
common_verbs_devlist.c \
common_verbs_find_max_inline.c \
common_verbs_find_ports.c \
common_verbs_mca.c \
common_verbs_port.c \
common_verbs_qp_type.c
dist_opaldata_DATA = \
help-opal-common-verbs.txt
# To simplify components that link to this library, we will *always*
# have an output libtool library named libmca_<type>_<name>.la -- even
# for case 2) described above (i.e., so there's no conditional logic
# necessary in component Makefile.am's that link to this library).
# Hence, if we're creating a noinst version of this library (i.e.,
# case 2), we sym link it to the libmca_<type>_<name>.la name
# (libtool will do the Right Things under the covers). See the
# all-local and clean-local rules, below, for how this is effected.
lib_LTLIBRARIES =
noinst_LTLIBRARIES =
comp_inst = lib@OPAL_LIB_PREFIX@mca_common_verbs.la
comp_noinst = lib@OPAL_LIB_PREFIX@mca_common_verbs_noinst.la
if MCA_BUILD_opal_common_verbs_DSO
lib_LTLIBRARIES += $(comp_inst)
else
noinst_LTLIBRARIES += $(comp_noinst)
endif
lib@OPAL_LIB_PREFIX@mca_common_verbs_la_SOURCES = $(headers) $(sources)
lib@OPAL_LIB_PREFIX@mca_common_verbs_la_CPPFLAGS = $(common_verbs_CPPFLAGS)
lib@OPAL_LIB_PREFIX@mca_common_verbs_la_LDFLAGS = \
-version-info $(libmca_opal_common_verbs_so_version) \
$(common_verbs_LDFLAGS)
lib@OPAL_LIB_PREFIX@mca_common_verbs_la_LIBADD = $(common_verbs_LIBS)
lib@OPAL_LIB_PREFIX@mca_common_verbs_noinst_la_SOURCES = $(headers) $(sources)
# Conditionally install the header files
if WANT_INSTALL_HEADERS
opaldir = $(opalincludedir)/opal/mca/common/verbs
opal_HEADERS = $(headers)
else
opaldir = $(includedir)
endif
# These two rules will sym link the "noinst" libtool library filename
# to the installable libtool library filename in the case where we are
# compiling this component statically (case 2), described above).
V=0
OMPI_V_LN_SCOMP = $(ompi__v_LN_SCOMP_$V)
ompi__v_LN_SCOMP_ = $(ompi__v_LN_SCOMP_$AM_DEFAULT_VERBOSITY)
ompi__v_LN_SCOMP_0 = @echo " LN_S " `basename $(comp_inst)`;
all-local:
$(OMPI_V_LN_SCOMP) if test -z "$(lib_LTLIBRARIES)"; then \
rm -f "$(comp_inst)"; \
$(LN_S) "$(comp_noinst)" "$(comp_inst)"; \
fi
clean-local:
if test -z "$(lib_LTLIBRARIES)"; then \
rm -f "$(comp_inst)"; \
fi

Просмотреть файл

@ -1,186 +0,0 @@
/*
* Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
* All rights reserved.
* Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
* Copyright (c) 2012-2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef _COMMON_OFAUTILS_H_
#define _COMMON_OFAUTILS_H_
#include "opal_config.h"
#include <stdint.h>
#include <infiniband/verbs.h>
#include "opal/mca/mca.h"
#include <infiniband/verbs.h>
#include "opal/class/opal_list.h"
BEGIN_C_DECLS
/*
* common_verbs_devlist.c
*/
OPAL_DECLSPEC struct ibv_device **opal_ibv_get_device_list(int *num_devs);
OPAL_DECLSPEC void opal_ibv_free_device_list(struct ibv_device **ib_devs);
/*
* common_verbs_mca.c
*/
extern bool opal_common_verbs_warn_nonexistent_if;
extern int opal_common_verbs_want_fork_support;
OPAL_DECLSPEC void opal_common_verbs_mca_register(mca_base_component_t *component);
/*
* common_verbs_basics.c
*/
bool opal_common_verbs_check_basics(void);
/*
* common_verbs_find_ports.c
*/
typedef struct opal_common_verbs_device_item_t {
opal_object_t super;
struct ibv_device *device;
char *device_name;
struct ibv_context *context;
struct ibv_device_attr device_attr;
/** This field defaults to true, meaning that the destructor for
opal_common_verbs_device_item_t will invoke ibv_close_device()
on the context. An upper layer can reset this field to false,
however, indicating that the destructor should *not* invoke
ibv_close_device() (e.g., if the upper layer has copied the
context and is using it). */
bool destructor_free_context;
} opal_common_verbs_device_item_t;
OBJ_CLASS_DECLARATION(opal_common_verbs_device_item_t);
typedef struct opal_common_verbs_port_item_t {
opal_list_item_t super;
opal_common_verbs_device_item_t *device;
uint8_t port_num;
struct ibv_port_attr port_attr;
} opal_common_verbs_port_item_t;
OBJ_CLASS_DECLARATION(opal_common_verbs_port_item_t);
enum {
OPAL_COMMON_VERBS_FLAGS_RC = 0x1,
OPAL_COMMON_VERBS_FLAGS_NOT_RC = 0x2,
OPAL_COMMON_VERBS_FLAGS_UD = 0x4,
OPAL_COMMON_VERBS_FLAGS_TRANSPORT_IB = 0x8,
OPAL_COMMON_VERBS_FLAGS_TRANSPORT_IWARP = 0x10,
/* Note that these 2 link layer flags will only be useful if
defined(HAVE_IBV_LINK_LAYER_ETHERNET). Otherwise, they will be
ignored. */
OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_IB = 0x80,
OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_ETHERNET = 0x100,
OPAL_COMMON_VERBS_FLAGS_MAX
};
/**
* Find a list of ibv_device ports that match a specific criteria.
*
* @param if_include (IN): comma-delimited list of interfaces to use
* @param if_exclude (IN): comma-delimited list of interfaces to NOT use
* @param flags (IN): bit flags
* @param verbose_stream (IN): stream to send opal_output_verbose messages to
*
* The ports will adhere to the if_include / if_exclude lists (only
* one can be specified). The lists are comma-delimited tokens in one
* of two forms:
*
* interface_name
* interface_name:port
*
* Hence, a if_include list could be the follwing: "mlx4_0,mthca0:1".
*
* The flags provide logical OR behavior -- a port will be included if
* it includes any of the capabilities/characteristics listed in the
* flags.
*
* Note that if the verbose_stream is >=0, output will be sent to that
* stream with a verbose level of 5.
*
* A valid list will always be returned. It will contain zero or more
* opal_common_verbs_port_item_t items. Each item can be individually
* OBJ_RELEASE'd; the destructor will take care of cleaning up the
* linked opal_common_verbs_device_item_t properly (i.e., when all
* port_items referring to it have been freed).
*/
OPAL_DECLSPEC opal_list_t *
opal_common_verbs_find_ports(const char *if_include,
const char *if_exclude,
int flags,
int verbose_stream);
/*
* Trivial function to compute the bandwidth on an ibv_port.
*
* Will return OPAL_ERR_NOT_FOUND if it can't figure out the bandwidth
* (and the bandwidth parameter value will be undefined). Otherwise,
* will return OPAL_SUCCESS and set bandwidth to an appropriate value.
*/
OPAL_DECLSPEC int
opal_common_verbs_port_bw(struct ibv_port_attr *port_attr,
uint32_t *bandwidth);
/*
* Trivial function to switch on the verbs MTU enum and return a
* numeric value.
*/
OPAL_DECLSPEC int
opal_common_verbs_mtu(struct ibv_port_attr *port_attr);
/*
* Find the max_inline_data value for a given device
*/
OPAL_DECLSPEC int
opal_common_verbs_find_max_inline(struct ibv_device *device,
struct ibv_context *context,
struct ibv_pd *pd,
uint32_t *max_inline_arg);
/*
* Test a device to see if it can handle a specific QP type (RC and/or
* UD). Will return the logical AND if multiple types are specified
* (e.g., if (RC|UD) are in flags, then will return OPAL_SUCCESS only
* if *both* types can be created on the device).
*
* Flags can be the logical OR of OPAL_COMMON_VERBS_FLAGS_RC and/or
* OPAL_COMMON_VERBS_FLAGS_UD. All other values are ignored.
*/
OPAL_DECLSPEC int opal_common_verbs_qp_test(struct ibv_context *device_context,
int flags);
/*
* ibv_fork_init testing - if fork support is requested then ibv_fork_init
* should be called right at the beginning of the verbs initialization flow, before ibv_create_* call.
*
* Known limitations:
* If ibv_fork_init is called after ibv_create_* functions - it will have no effect.
* OMPI initializes verbs many times during initialization in the following verbs components:
* oob/ud, btl/openib, mtl/mxm, pml/yalla, oshmem/ikrit, ompi/mca/coll/{fca,hcoll}
*
* So, ibv_fork_init should be called once, in the beginning of the init flow of every verb component
* to proper request fork support.
*
*/
int opal_common_verbs_fork_test(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,109 +0,0 @@
/*
* Copyright (c) 2012-2016 Cisco Systems, Inc. All rights reserved.
*
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <stdio.h>
#ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
#ifdef HAVE_SYS_STAT_H
#include <sys/stat.h>
#endif
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#if OPAL_COMMON_VERBS_USNIC_HAPPY
#include "opal/mca/common/verbs_usnic/common_verbs_usnic.h"
#endif
/* This is crummy, but <infiniband/driver.h> doesn't work on all
platforms with all compilers. Specifically, trying to include it
on RHEL4U3 with the PGI 32 bit compiler will cause problems because
certain 64 bit types are not defined. Per advice from Roland D.,
just include the one prototype that we need in this case
(ibv_get_sysfs_path()). */
#include <infiniband/verbs.h>
#ifdef HAVE_INFINIBAND_DRIVER_H
#include <infiniband/driver.h>
#else
const char *ibv_get_sysfs_path(void);
#endif
#include "common_verbs.h"
#include "opal/runtime/opal_params.h"
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "opal/util/printf.h"
/***********************************************************************/
bool opal_common_verbs_check_basics(void)
{
#if defined(__linux__)
int rc;
char *file;
struct stat s;
/* Check to see if $sysfsdir/class/infiniband/ exists */
opal_asprintf(&file, "%s/class/infiniband", ibv_get_sysfs_path());
if (NULL == file) {
return false;
}
rc = stat(file, &s);
free(file);
if (0 != rc || !S_ISDIR(s.st_mode)) {
return false;
}
#endif
/* It exists and is a directory -- good enough */
return true;
}
int opal_common_verbs_fork_test(void)
{
int ret = OPAL_SUCCESS;
/* Make sure that ibv_fork_init() is the first ibv_* function to
be invoked in this process. */
#ifdef HAVE_IBV_FORK_INIT
if (0 != opal_common_verbs_want_fork_support) {
/* Check if fork support is requested by the user */
if (0 != ibv_fork_init()) {
/* If the opal_common_verbs_want_fork_support MCA
* parameter is >0 but the call to ibv_fork_init() failed,
* then return an error code.
*/
if (opal_common_verbs_want_fork_support > 0) {
opal_show_help("help-opal-common-verbs.txt",
"ibv_fork_init fail", true,
opal_proc_local_get()->proc_hostname, errno,
strerror(errno));
ret = OPAL_ERROR;
}
}
}
#endif
#if OPAL_COMMON_VERBS_USNIC_HAPPY
/* Now register any necessary fake libibverbs drivers. We
piggyback loading these fake drivers on the fork test because
they must be loaded before ibv_get_device_list() is invoked.
Note that this routine is in a different common component (see
comments over there for an explanation why). */
opal_common_verbs_usnic_register_fake_drivers();
#endif
return ret;
}

Просмотреть файл

@ -1,95 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2008 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2012 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2007 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
* Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
/* This is crummy, but <infiniband/driver.h> doesn't work on all
platforms with all compilers. Specifically, trying to include it
on RHEL4U3 with the PGI 32 bit compiler will cause problems because
certain 64 bit types are not defined. Per advice from Roland D.,
just include the one prototype that we need in this case
(ibv_get_sysfs_path()). */
#ifdef HAVE_INFINIBAND_DRIVER_H
#include <infiniband/driver.h>
#else
const char *ibv_get_sysfs_path(void);
#endif
#include "opal/util/output.h"
#include "common_verbs.h"
/*
* Portable wrapper around ibv_get_device_list() / ibv_get_devices().
*/
struct ibv_device **opal_ibv_get_device_list(int *num_devs)
{
struct ibv_device **ib_devs;
#ifdef HAVE_IBV_GET_DEVICE_LIST
ib_devs = ibv_get_device_list(num_devs);
#else
struct dlist *dev_list;
struct ibv_device *ib_dev;
*num_devs = 0;
/* Determine the number of device's available on the host */
dev_list = ibv_get_devices();
if (NULL == dev_list) {
return NULL;
}
dlist_start(dev_list);
dlist_for_each_data(dev_list, ib_dev, struct ibv_device)
(*num_devs)++;
/* Allocate space for the ib devices */
ib_devs = (struct ibv_device**)malloc(*num_devs * sizeof(struct ibv_dev*));
if (NULL == ib_devs) {
*num_devs = 0;
opal_output(0, "Failed malloc: %s:%d", __FILE__, __LINE__);
return NULL;
}
dlist_start(dev_list);
dlist_for_each_data(dev_list, ib_dev, struct ibv_device)
*(++ib_devs) = ib_dev;
#endif
return ib_devs;
}
void opal_ibv_free_device_list(struct ibv_device **ib_devs)
{
#ifdef HAVE_IBV_GET_DEVICE_LIST
ibv_free_device_list(ib_devs);
#else
free(ib_devs);
#endif
}

Просмотреть файл

@ -1,108 +0,0 @@
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2008 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2012 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2007 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
* Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <stdio.h>
#include <string.h>
#include <infiniband/verbs.h>
#include <stdint.h>
#include "opal_stdint.h"
#include "opal/types.h"
#include "opal/util/output.h"
#include "opal/util/argv.h"
#include "opal/class/opal_object.h"
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "opal/constants.h"
#include "common_verbs.h"
/* Horrible. :-( Per the thread starting here:
http://lists.openfabrics.org/pipermail/general/2008-June/051822.html,
we can't rely on the value reported by the device to determine the
maximum max_inline_data value. So we have to search by looping
over max_inline_data values and trying to make dummy QPs. Yuck! */
int opal_common_verbs_find_max_inline(struct ibv_device *device,
struct ibv_context *context,
struct ibv_pd *pd,
uint32_t *max_inline_arg)
{
int ret;
struct ibv_qp *qp;
struct ibv_cq *cq;
struct ibv_qp_init_attr init_attr;
uint32_t max_inline_data;
*max_inline_arg = 0;
/* Make a dummy CQ */
#if OPAL_IBV_CREATE_CQ_ARGS == 3
cq = ibv_create_cq(context, 1, NULL);
#else
cq = ibv_create_cq(context, 1, NULL, NULL, 0);
#endif
if (NULL == cq) {
opal_show_help("help-mpi-btl-openib.txt", "init-fail-create-q",
true, opal_proc_local_get()->proc_hostname,
__FILE__, __LINE__, "ibv_create_cq",
strerror(errno), errno,
ibv_get_device_name(device));
return OPAL_ERR_NOT_AVAILABLE;
}
/* Setup the QP attributes */
memset(&init_attr, 0, sizeof(init_attr));
init_attr.qp_type = IBV_QPT_RC;
init_attr.send_cq = cq;
init_attr.recv_cq = cq;
init_attr.srq = 0;
init_attr.cap.max_send_sge = 1;
init_attr.cap.max_recv_sge = 1;
init_attr.cap.max_recv_wr = 1;
/* Loop over max_inline_data values; just check powers of 2 --
that's good enough */
init_attr.cap.max_inline_data = max_inline_data = 1 << 20;
ret = OPAL_ERR_NOT_FOUND;
while (max_inline_data > 0) {
qp = ibv_create_qp(pd, &init_attr);
if (NULL != qp) {
*max_inline_arg = max_inline_data;
ibv_destroy_qp(qp);
ret = OPAL_SUCCESS;
break;
}
max_inline_data >>= 1;
init_attr.cap.max_inline_data = max_inline_data;
}
/* Destroy the temp CQ */
ibv_destroy_cq(cq);
return ret;
}

Просмотреть файл

@ -1,505 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2014 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2014 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2012 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
* Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <stdio.h>
#include <string.h>
#include <infiniband/verbs.h>
#include <stdint.h>
#include "opal_stdint.h"
#include "opal/types.h"
#include "opal/util/output.h"
#include "opal/util/argv.h"
#include "opal/class/opal_object.h"
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "opal/constants.h"
#include "common_verbs.h"
/***********************************************************************/
static void device_item_construct(opal_common_verbs_device_item_t *di)
{
di->device = NULL;
di->device_name = NULL;
di->context = NULL;
di->destructor_free_context = true;
memset(&di->device_attr, 0, sizeof(di->device_attr));
}
static void device_item_destruct(opal_common_verbs_device_item_t *di)
{
if (NULL != di->device_name) {
free(di->device_name);
}
/* Only free the context if a) the device is open, and b) the
upper layer didn't tell us not to */
if (NULL != di->context && di->destructor_free_context) {
ibv_close_device(di->context);
}
/* Zero out all the fields */
device_item_construct(di);
}
OBJ_CLASS_INSTANCE(opal_common_verbs_device_item_t,
opal_object_t,
device_item_construct,
device_item_destruct);
/***********************************************************************/
static void port_item_construct(opal_common_verbs_port_item_t *pi)
{
pi->device = NULL;
pi->port_num = 0;
memset(&pi->port_attr, 0, sizeof(pi->port_attr));
}
static void port_item_destruct(opal_common_verbs_port_item_t *pi)
{
OBJ_RELEASE(pi->device);
/* Zero out all the fields */
port_item_construct(pi);
}
OBJ_CLASS_INSTANCE(opal_common_verbs_port_item_t,
opal_list_item_t,
port_item_construct,
port_item_destruct);
/***********************************************************************/
/*
* Given a list of include or exclude items (never both), determine
* whether we want the current port or not.
*/
static bool want_this_port(char **include_list, char **exclude_list,
opal_common_verbs_device_item_t *di, int port)
{
int i;
char name[1024];
/* If we have no include or exclude list, then we unconditionally
want the port */
if (NULL == include_list && NULL == exclude_list) {
return true;
}
/* Search the include list */
if (NULL != include_list) {
for (i = 0; NULL != include_list[i]; ++i) {
/* First check if we can find the naked device name */
if (strcmp(di->device_name, include_list[i]) == 0) {
return true;
}
/* Now check for the specific port number */
snprintf(name, sizeof(name), "%s:%d", di->device_name, port);
if (strcmp(name, include_list[i]) == 0) {
return true;
}
}
/* Didn't find it. So we don't want it. */
return false;
}
/* Search the exclude list */
else {
for (i = 0; NULL != exclude_list[i]; ++i) {
/* First check if we can find the naked device name */
if (strcmp(di->device_name, exclude_list[i]) == 0) {
return false;
}
/* Now check for the specific port number */
snprintf(name, sizeof(name), "%s:%d", di->device_name, port);
if (strcmp(name, exclude_list[i]) == 0) {
return false;
}
}
/* Didn't find it. So we want it. */
return true;
}
/* Will never get here */
}
/***********************************************************************/
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
static const char *link_layer_to_str(int link_type)
{
switch(link_type) {
case IBV_LINK_LAYER_INFINIBAND: return "IB";
case IBV_LINK_LAYER_ETHERNET: return "IWARP";
case IBV_LINK_LAYER_UNSPECIFIED:
default: return "unspecified";
}
}
#endif
/***********************************************************************/
static void check_sanity(char ***if_sanity_list, const char *dev_name, int port)
{
int i;
char tmp[BUFSIZ], **list;
const char *compare;
if (NULL == if_sanity_list || NULL == *if_sanity_list) {
return;
}
list = *if_sanity_list;
/* A match is found if:
- "dev_name" is in the list and port == -1, or
- "dev_name:port" is in the list
If a match is found, remove that entry from the list. */
memset(tmp, 0, sizeof(tmp));
if (port > 0) {
snprintf(tmp, sizeof(tmp) - 1, "%s:%d", dev_name, port);
compare = tmp;
} else {
compare = dev_name;
}
for (i = 0; NULL != list[i]; ++i) {
if (0 == strcmp(list[i], compare)) {
int count = opal_argv_count(list);
opal_argv_delete(&count, &list, i, 1);
*if_sanity_list = list;
--i;
}
}
}
/***********************************************************************/
/*
* Find a list of ibv_ports matching a set of criteria.
*/
opal_list_t *opal_common_verbs_find_ports(const char *if_include,
const char *if_exclude,
int flags,
int stream)
{
int32_t num_devs;
struct ibv_device **devices;
struct ibv_device *device;
struct ibv_context *device_context;
struct ibv_device_attr device_attr;
struct ibv_port_attr port_attr;
char **if_include_list = NULL, **if_exclude_list = NULL, **if_sanity_list = NULL;
opal_common_verbs_device_item_t *di;
opal_common_verbs_port_item_t *pi;
int rc;
uint32_t j;
opal_list_t *port_list = NULL;
bool want;
/* Sanity check the include/exclude params */
if (NULL != if_include && NULL != if_exclude) {
return NULL;
}
/* Query all the IBV devices on the machine. Use an ompi
compatibility function, because how to get this list changed
over the history of the IBV API. */
devices = opal_ibv_get_device_list(&num_devs);
if (0 == num_devs) {
opal_output_verbose(5, stream, "no verbs interfaces found");
return NULL;
}
opal_output_verbose(5, stream, "found %d verbs interface%s",
num_devs, (num_devs != 1) ? "s" : "");
/* Allocate a list to fill */
port_list = OBJ_NEW(opal_list_t);
if (NULL == port_list) {
return NULL;
}
if (NULL != if_include) {
opal_output_verbose(5, stream, "finding verbs interfaces, including %s",
if_include);
if_include_list = opal_argv_split(if_include, ',');
if_sanity_list = opal_argv_copy(if_include_list);
} else if (NULL != if_exclude) {
opal_output_verbose(5, stream, "finding verbs interfaces, excluding %s",
if_exclude);
if_exclude_list = opal_argv_split(if_exclude, ',');
if_sanity_list = opal_argv_copy(if_exclude_list);
}
/* Now loop through all the devices. Get the attributes for each
port on each device to see if they match our selection
criteria. */
for (int32_t i = 0; (int32_t) i < num_devs; ++i) {
/* See if this device is on the include/exclude sanity check
list. If it is, remove it from the sanity check list
(i.e., we should end up with an empty list at the end if
all entries in the sanity check list exist) */
device = devices[i];
check_sanity(&if_sanity_list, ibv_get_device_name(device), -1);
opal_output_verbose(5, stream, "examining verbs interface: %s",
ibv_get_device_name(device));
device_context = ibv_open_device(device);
if (NULL == device_context) {
opal_show_help("help-opal-common-verbs.txt",
"ibv_open_device fail", true,
opal_proc_local_get()->proc_hostname,
ibv_get_device_name(device),
errno, strerror(errno));
goto err_free_port_list;
}
if (ibv_query_device(device_context, &device_attr)){
opal_show_help("help-opal-common-verbs.txt",
"ibv_query_device fail", true,
opal_proc_local_get()->proc_hostname,
ibv_get_device_name(device),
errno, strerror(errno));
goto err_free_port_list;
}
/* Now that we have the attributes of this device, remove all
ports of this device from the sanity check list. Note that
IBV ports are indexed from 1, not 0. */
for (j = 1; j <= device_attr.phys_port_cnt; j++) {
check_sanity(&if_sanity_list, ibv_get_device_name(device), j);
}
/* Check the device-specific flags to see if we want this
device */
want = false;
if (flags & OPAL_COMMON_VERBS_FLAGS_TRANSPORT_IB &&
IBV_TRANSPORT_IB == device->transport_type) {
opal_output_verbose(5, stream, "verbs interface %s has right type (IB)",
ibv_get_device_name(device));
want = true;
}
if (flags & OPAL_COMMON_VERBS_FLAGS_TRANSPORT_IWARP &&
IBV_TRANSPORT_IWARP == device->transport_type) {
opal_output_verbose(5, stream, "verbs interface %s has right type (IWARP)",
ibv_get_device_name(device));
want = true;
}
/* Check for RC or UD QP support */
if (flags & OPAL_COMMON_VERBS_FLAGS_RC) {
rc = opal_common_verbs_qp_test(device_context, flags);
if (OPAL_SUCCESS == rc) {
want = true;
opal_output_verbose(5, stream,
"verbs interface %s supports RC QPs",
ibv_get_device_name(device));
} else {
opal_output_verbose(5, stream,
"verbs interface %s failed to make RC QP",
ibv_get_device_name(device));
}
}
if (flags & OPAL_COMMON_VERBS_FLAGS_UD) {
rc = opal_common_verbs_qp_test(device_context, flags);
if (OPAL_SUCCESS == rc) {
want = true;
opal_output_verbose(5, stream,
"verbs interface %s supports UD QPs",
ibv_get_device_name(device));
} else if (OPAL_ERR_TYPE_MISMATCH == rc) {
opal_output_verbose(5, stream,
"verbs interface %s made an RC QP! we don't want RC-capable devices",
ibv_get_device_name(device));
} else {
opal_output_verbose(5, stream,
"verbs interface %s failed to make UD QP",
ibv_get_device_name(device));
}
}
/* If we didn't want it, go to the next device */
if (!want) {
continue;
}
/* Make a device_item_t to hold the device information */
di = OBJ_NEW(opal_common_verbs_device_item_t);
if (NULL == di) {
goto err_free_port_list;
}
di->device = device;
di->context = device_context;
di->device_attr = device_attr;
di->device_name = strdup(ibv_get_device_name(device));
/* Note IBV ports are 1 based (not 0 based) */
for (j = 1; j <= device_attr.phys_port_cnt; j++) {
/* If we don't want this port (based on if_include /
if_exclude lists), skip it */
if (!want_this_port(if_include_list, if_exclude_list, di, j)) {
opal_output_verbose(5, stream, "verbs interface %s:%d: rejected by include/exclude",
ibv_get_device_name(device), j);
continue;
}
/* Query the port */
if (ibv_query_port(device_context, (uint8_t) j, &port_attr)) {
opal_show_help("help-opal-common-verbs.txt",
"ibv_query_port fail", true,
opal_proc_local_get()->proc_hostname,
ibv_get_device_name(device),
errno, strerror(errno));
goto err_free_port_list;
}
/* We definitely only want ACTIVE ports */
if (IBV_PORT_ACTIVE != port_attr.state) {
opal_output_verbose(5, stream, "verbs interface %s:%d: not ACTIVE",
ibv_get_device_name(device), j);
continue;
}
/* Check the port-specific flags to see if we want this
port */
want = false;
if (0 == flags) {
want = true;
}
if ((flags & (OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_IB |
OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_ETHERNET)) ==
(OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_IB |
OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_ETHERNET)) {
/* If they specified both link layers, then we want this port */
want = true;
} else if ((flags & (OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_IB |
OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_ETHERNET)) == 0) {
/* If they specified neither link layer, then we want this port */
want = true;
}
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
else if (flags & OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_IB) {
if (IBV_LINK_LAYER_INFINIBAND == port_attr.link_layer) {
want = true;
} else {
opal_output_verbose(5, stream, "verbs interface %s:%d has wrong link layer (has %s, want IB)",
ibv_get_device_name(device), j,
link_layer_to_str(port_attr.link_layer));
}
} else if (flags & OPAL_COMMON_VERBS_FLAGS_LINK_LAYER_ETHERNET) {
if (IBV_LINK_LAYER_ETHERNET == port_attr.link_layer) {
want = true;
} else {
opal_output_verbose(5, stream, "verbs interface %s:%d has wrong link layer (has %s, want Ethernet)",
ibv_get_device_name(device), j,
link_layer_to_str(port_attr.link_layer));
}
}
#endif
if (!want) {
continue;
}
/* If we got this far, we want the port. Make an item for it. */
pi = OBJ_NEW(opal_common_verbs_port_item_t);
if (NULL == pi) {
goto err_free_port_list;
}
pi->device = di;
pi->port_num = j;
pi->port_attr = port_attr;
OBJ_RETAIN(di);
/* Add the port item to the list */
opal_list_append(port_list, &pi->super);
opal_output_verbose(5, stream, "found acceptable verbs interface %s:%d",
ibv_get_device_name(device), j);
}
/* We're done with the device; if some ports are using it, its
ref count will be > 0, and therefore the device won't be
deleted here. */
OBJ_RELEASE(di);
}
/* Sanity check that the devices specified in the if_include /
if_exclude lists actually existed. If this is true, then the
sanity list will now be empty. If there are still items left
on the list, then they didn't exist. Bad. Print a warning (if
the warning is not disabled). */
if (0 != opal_argv_count(if_sanity_list)) {
if (opal_common_verbs_warn_nonexistent_if) {
char *str = opal_argv_join(if_sanity_list, ',');
opal_show_help("help-opal-common-verbs.txt", "nonexistent port",
true, opal_proc_local_get()->proc_hostname,
((NULL != if_include) ? "in" : "ex"), str);
free(str);
/* Only warn once per process */
opal_common_verbs_warn_nonexistent_if = false;
}
}
if (NULL != if_sanity_list) {
opal_argv_free(if_sanity_list);
}
opal_argv_free(if_include_list);
opal_argv_free(if_exclude_list);
/* All done! */
opal_ibv_free_device_list(devices);
return port_list;
err_free_port_list:
OPAL_LIST_RELEASE(port_list);
opal_ibv_free_device_list(devices);
if (NULL != if_sanity_list) {
opal_argv_free(if_sanity_list);
}
opal_argv_free(if_include_list);
opal_argv_free(if_exclude_list);
return NULL;
}

Просмотреть файл

@ -1,60 +0,0 @@
/*
* Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "common_verbs.h"
#include "opal/mca/base/mca_base_var.h"
/***********************************************************************/
static bool registered = false;
static int warn_nonexistent_if_index = -1;
bool opal_common_verbs_warn_nonexistent_if = true;
int opal_common_verbs_want_fork_support = -1;
static void register_internal(void)
{
opal_common_verbs_warn_nonexistent_if = true;
warn_nonexistent_if_index =
mca_base_var_register("opal", "opal_common", "verbs", "warn_nonexistent_if",
"Warn if non-existent devices and/or ports are specified in device include/exclude MCA parameters "
"(0 = do not warn; any other value = warn)",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_9, MCA_BASE_VAR_SCOPE_LOCAL,
&opal_common_verbs_warn_nonexistent_if);
/* A depreacated synonym */
mca_base_var_register_synonym(warn_nonexistent_if_index, "ompi", "ompi_common",
"verbs", "warn_nonexistent_if", MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
mca_base_var_register("opal", "opal_common", "verbs", "want_fork_support",
"Whether fork support is desired or not "
"(negative = try to enable fork support, but continue even "
"if it is not available, 0 = do not enable fork support, "
"positive = try to enable fork support and fail if it is not available)",
MCA_BASE_VAR_TYPE_INT, NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_8, MCA_BASE_VAR_SCOPE_ALL_EQ,
&opal_common_verbs_want_fork_support);
registered = true;
}
void opal_common_verbs_mca_register(mca_base_component_t *component)
{
if (!registered) {
register_internal();
}
/* Make synonym for the common_verbs MCA params. */
mca_base_var_register_synonym(warn_nonexistent_if_index, "ompi", component->mca_type_name,
component->mca_component_name, "warn_nonexistent_if", 0);
}

Просмотреть файл

@ -1,120 +0,0 @@
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2012 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2011 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2012 Oak Ridge National Laboratory. All rights reserved
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "opal/constants.h"
#include <infiniband/verbs.h>
#include "common_verbs.h"
int opal_common_verbs_port_bw(struct ibv_port_attr *port_attr,
uint32_t *bandwidth)
{
*bandwidth = 0;
/* To calculate the bandwidth available on this port, we have to
look up the values corresponding to port->active_speed and
port->active_width. These are enums corresponding to the IB
spec. Overall forumula to get the true link speed is 8/10 or
64/66 of the reported speed (depends on the coding that is
being used for the particular speed) times the number of
links. */
switch (port_attr->active_speed) {
case 1:
/* SDR: 2.5 Gbps * 0.8, in megabits */
*bandwidth = 2000;
break;
case 2:
/* DDR: 5 Gbps * 0.8, in megabits */
*bandwidth = 4000;
break;
case 4:
/* QDR: 10 Gbps * 0.8, in megabits */
*bandwidth = 8000;
break;
case 8:
/* FDR10: 10.3125 Gbps * 64/66, in megabits */
*bandwidth = 10000;
break;
case 16:
/* FDR: 14.0625 Gbps * 64/66, in megabits */
*bandwidth = 13636;
break;
case 32:
/* EDR: 25.78125 Gbps * 64/66, in megabits */
*bandwidth = 25000;
break;
case 64:
/* HDR: 50Gbps * 64/66, in megabits */
*bandwidth = 50000;
break;
default:
/* Who knows? */
return OPAL_ERR_NOT_FOUND;
}
switch (port_attr->active_width) {
case 1:
/* 1x */
/* unity */
break;
case 2:
/* 4x */
*bandwidth *= 4;
break;
case 4:
/* 8x */
*bandwidth *= 8;
break;
case 8:
/* 12x */
*bandwidth *= 12;
break;
default:
/* Who knows? */
return OPAL_ERR_NOT_FOUND;
}
return OPAL_SUCCESS;
}
int opal_common_verbs_mtu(struct ibv_port_attr *port_attr)
{
if (NULL == port_attr) {
return 0;
}
switch(port_attr->active_mtu) {
case IBV_MTU_256: return 256;
case IBV_MTU_512: return 512;
case IBV_MTU_1024: return 1024;
case IBV_MTU_2048: return 2048;
case IBV_MTU_4096: return 4096;
default: return 0;
}
}

Просмотреть файл

@ -1,104 +0,0 @@
/*
* Copyright (c) 2012-2013 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "opal/constants.h"
#include <stdio.h>
#include <string.h>
#include <infiniband/verbs.h>
#include "common_verbs.h"
/*
* It seems you can't probe a device / port to see if it supports a
* specific type of QP. You just have to try to make it and see if it
* works. This is a short helper function to try to make a QP of a
* specific type and return whether it worked.
*/
static bool make_qp(struct ibv_pd *pd, struct ibv_cq *cq, enum ibv_qp_type type)
{
struct ibv_qp_init_attr qpia;
struct ibv_qp *qp;
memset(&qpia, 0, sizeof(qpia));
qpia.qp_context = NULL;
qpia.send_cq = cq;
qpia.recv_cq = cq;
qpia.srq = NULL;
qpia.cap.max_send_wr = 1;
qpia.cap.max_recv_wr = 1;
qpia.cap.max_send_sge = 1;
qpia.cap.max_recv_sge = 1;
qpia.cap.max_inline_data = 0;
qpia.qp_type = type;
qpia.sq_sig_all = 0;
qp = ibv_create_qp(pd, &qpia);
if (NULL != qp) {
ibv_destroy_qp(qp);
return true;
}
return false;
}
int opal_common_verbs_qp_test(struct ibv_context *device_context, int flags)
{
int rc = OPAL_SUCCESS;
struct ibv_pd *pd = NULL;
struct ibv_cq *cq = NULL;
/* Bozo check */
if (NULL == device_context ||
(0 == (flags & (OPAL_COMMON_VERBS_FLAGS_RC | OPAL_COMMON_VERBS_FLAGS_UD)))) {
return OPAL_ERR_BAD_PARAM;
}
/* Try to make both the PD and CQ */
pd = ibv_alloc_pd(device_context);
if (NULL == pd) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
cq = ibv_create_cq(device_context, 2, NULL, NULL, 0);
if (NULL == cq) {
rc = OPAL_ERR_OUT_OF_RESOURCE;
goto out;
}
/* Now try to make the QP(s) of the desired type(s) */
if (flags & OPAL_COMMON_VERBS_FLAGS_RC &&
!make_qp(pd, cq, IBV_QPT_RC)) {
rc = OPAL_ERR_NOT_SUPPORTED;
goto out;
}
if (flags & OPAL_COMMON_VERBS_FLAGS_NOT_RC &&
make_qp(pd, cq, IBV_QPT_RC)) {
rc = OPAL_ERR_TYPE_MISMATCH;
goto out;
}
if (flags & OPAL_COMMON_VERBS_FLAGS_UD &&
!make_qp(pd, cq, IBV_QPT_UD)) {
rc = OPAL_ERR_NOT_SUPPORTED;
goto out;
}
out:
/* Free the PD and/or CQ */
if (NULL != pd) {
ibv_dealloc_pd(pd);
}
if (NULL != cq) {
ibv_destroy_cq(cq);
}
return rc;
}

Просмотреть файл

@ -1,41 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2012 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_opal_common_verbs_CONFIG([action-if-can-compile],
# [action-if-cant-compile])
# ------------------------------------------------
AC_DEFUN([MCA_opal_common_verbs_CONFIG],[
AC_CONFIG_FILES([opal/mca/common/verbs/Makefile])
common_verbs_happy="no"
OPAL_CHECK_OPENFABRICS([common_verbs],
[common_verbs_happy="yes"])
AS_IF([test "$common_verbs_happy" = "yes"],
[$1],
[$2])
# substitute in the things needed to build openib
AC_SUBST([common_verbs_CFLAGS])
AC_SUBST([common_verbs_CPPFLAGS])
AC_SUBST([common_verbs_LDFLAGS])
AC_SUBST([common_verbs_LIBS])
])dnl

Просмотреть файл

@ -1,54 +0,0 @@
#
# Copyright (c) 2012-2014 Cisco Systems, Inc. All rights reserved.
#
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
[ibv_open_device fail]
Open MPI failed to open an OpenFabrics device. This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully. This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.
All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.
Hostname: %s
Device name: %s
Error (%d): %s
#
[ibv_query_device fail]
Open MPI failed to query an OpenFabrics device. This is an unusual
error; the system reported the OpenFabrics device as being present,
Open MPI was able to open it succesfully, but then later failed to
query it successfully. This usually indicates either a
misconfiguration or a failed OpenFabrics hardware device.
All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.
Hostname: %s
Device name: %s
Error (%d): %s
#
[nonexistent port]
WARNING: One or more nonexistent OpenFabrics devices/ports were
specified:
Host: %s
MCA parameter: ompi_common_verbs_%sclude
Nonexistent entities: %s
These entities will be ignored. You can disable this warning by
setting the ompi_common_verbs_warn_nonexistent_if MCA parameter to 0.
#
[ibv_fork_init fail]
Fork support was requested but the library call ibv_fork_init() failed.
Hostname: %s
Error (%d): %s
#

Просмотреть файл

@ -1,7 +0,0 @@
#
# owner/status file
# owner: institution that is responsible for this package
# status: e.g. active, maintenance, unmaintained
#
owner: MELLANOX
status: maintenance

Просмотреть файл

@ -1,40 +0,0 @@
#
# Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
# Copyright (c) 2012-2015 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
headers = common_verbs_usnic.h
sources = common_verbs_usnic_fake.c
# This component is always linked statically. It has code that is
# registered as a driver for libibverbs. There is no corresponding
# *un*register API in libibverbs, so this code can never be dlclosed.
# And therefore it must be in the libopen-pal library, not a DSO or
# dependent library.
noinst_LTLIBRARIES = lib@OPAL_LIB_PREFIX@mca_common_verbs_usnic.la
lib@OPAL_LIB_PREFIX@mca_common_verbs_usnic_la_SOURCES = \
$(headers) $(sources)
lib@OPAL_LIB_PREFIX@mca_common_verbs_usnic_la_CPPFLAGS = \
$(common_verbs_usnic_CPPFLAGS)
lib@OPAL_LIB_PREFIX@mca_common_verbs_usnic_la_LDFLAGS = \
$(common_verbs_usnic_LDFLAGS)
lib@OPAL_LIB_PREFIX@mca_common_verbs_usnic_la_LIBADD = \
$(common_verbs_usnic_LIBS)
# Conditionally install the header files
if WANT_INSTALL_HEADERS
opaldir = $(opalincludedir)/opal/mca/common/verbs_usnic
opal_HEADERS = $(headers)
else
opaldir = $(includedir)
endif

Просмотреть файл

@ -1,27 +0,0 @@
/*
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef _COMMON_VERBS_USNIC_H_
#define _COMMON_VERBS_USNIC_H_
#include "opal_config.h"
#include <stdint.h>
#include <infiniband/verbs.h>
BEGIN_C_DECLS
/*
* Register fake verbs drivers
*/
void opal_common_verbs_usnic_register_fake_drivers(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,135 +0,0 @@
/*
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/*
* The code in this file prevents spurious libibverbs warnings on
* stderr about devices that it doesn't recognize.
*
* Specifically, Cisco usNIC devices are exposed through the Linux
* InfiniBand kernel interface (i.e., they show up in
* /sys/class/infiniband). However, the userspace side of these
* drivers is not exposed through libibverbs (i.e., there is no
* libibverbs provider/plugin for usNIC). Therefore, when
* ibv_get_device_list() is invoked, libibverbs cannot find a plugin
* for usnic devices. This causes libibverbs to emit a spurious
* warning message on stderr.
*
* To avoid these extra stderr warnings, we insert a fake usnic verbs
* libibverbs provider that safely squelches these warnings.
*
* More specifically: the userspace side of usNIC is exposed through
* libfabric; we don't need libibverbs warnings about not being able
* to find a usnic driver.
*
* Note: this code is statically linked into libopen-pal. It is
* registered via ibv_register_driver(), and there is no corresponding
* *un*register IBV API. Hence, we cannot allow this code to be
* dlclosed (e.g., if it is a DSO or a dependent common library) -- it
* must be in libopen-pal itself, which will stay resident in the MPI
* application.
*/
#include "opal_config.h"
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <string.h>
#include <infiniband/verbs.h>
#ifdef HAVE_INFINIBAND_DRIVER_H
#include <infiniband/driver.h>
#endif
#include "common_verbs_usnic.h"
/***********************************************************************/
#define PCI_VENDOR_ID_CISCO (0x1137)
static struct ibv_context *fake_alloc_context(struct ibv_device *ibdev,
int cmd_fd)
{
/* Nothing to do here */
return NULL;
}
static void fake_free_context(struct ibv_context *ibctx)
{
/* Nothing to do here */
}
/* Put just enough in here to convince libibverbs that this is a valid
device, and a little extra just in case someone looks at this
struct in a debugger. */
static struct ibv_device fake_dev = {
.ops = {
.alloc_context = fake_alloc_context,
.free_context = fake_free_context
},
.name = "fake ibv_device inserted by Open MPI for non-verbs devices"
};
static struct ibv_device *fake_driver_init(const char *uverbs_sys_path,
int abi_version)
{
char value[8];
int vendor;
/* This function should only be invoked for
/sys/class/infiniband/usnic_X devices, but double check just to
be absolutely sure: read the vendor ID and ensure that it is
Cisco. */
if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
value, sizeof(value)) < 0) {
return NULL;
}
if (sscanf(value, "%i", &vendor) != 1) {
return NULL;
}
if (vendor == PCI_VENDOR_ID_CISCO) {
return &fake_dev;
}
/* We didn't find a device that we want to support */
return NULL;
}
void opal_common_verbs_usnic_register_fake_drivers(void)
{
/* No need to do this more than once */
static bool already_done = false;
if (already_done) {
return;
}
already_done = true;
/* If there are any usnic devices, then register a fake driver */
DIR *class_dir;
class_dir = opendir("/sys/class/infiniband");
if (NULL == class_dir) {
return;
}
bool found = false;
struct dirent *dent;
while ((dent = readdir(class_dir)) != NULL) {
if (strncmp(dent->d_name, "usnic_", 6) == 0) {
found = true;
break;
}
}
closedir(class_dir);
if (found) {
ibv_register_driver("usnic_verbs", fake_driver_init);
}
}

Просмотреть файл

@ -1,99 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2016 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
#
# This component is a workaround to a bug in libibverbs that prints a
# dire warning that usNIC devices are not supported (of course not --
# usNIC devices provide functionality through libfabric, not
# libibverbs). This component was written before a better workaround
# was created: a "no op" libibverbs plugin for usNIC devices
# (https://github.com/cisco/libusnic_verbs, and is also available in
# binary form on cisco.com).
#
# Hence, this component no longer builds by default. It's still
# available if a user specifically asks for it (e.g., if they do not
# want to install the "no op" libibverbs plugin), but it's not the
# default. This component also has the side-effect of making
# libopen-pal.so depend on libibverbs.so, which can be annoying for
# packagers (which is another reason it isn't built by default any
# more).
#
# This component must be linked statically into libopen-pal because it
# registers a provider for libibverbs at run time, and there's no
# libibverbs API to *un*register a plugin. Hence, we can't allow this
# code to be dlclosed/removed from the process. Hence: it must be
# compiled statically into libopen-pal.
#
AC_DEFUN([MCA_opal_common_verbs_usnic_COMPILE_MODE], [
AC_MSG_CHECKING([for MCA component $2:$3 compile mode])
$4="static"
AC_MSG_RESULT([$$4])
])
# MCA_opal_common_verbs_usnic_CONFIG([action-if-can-compile],
# [action-if-cant-compile])
# ------------------------------------------------
AC_DEFUN([MCA_opal_common_verbs_usnic_CONFIG],[
AC_CONFIG_FILES([opal/mca/common/verbs_usnic/Makefile])
common_verbs_usnic_happy=0
AC_ARG_WITH(verbs-usnic,
AC_HELP_STRING([--with-verbs-usnic],
[Add support in Open MPI to defeat a seemingly dire warning message from libibverbs that Cisco usNIC devices are not supported. This support is not compiled by default because you can also avoid this libibverbs bug by installing the libibverbs_usnic "no no" plugin, available from https://github.com/cisco/libusnic_verbs or in binary form from cisco.com]))
AS_IF([test "$with_verbs_usnic" = "yes"],
[common_verbs_usnic_happy=1])
AS_IF([test $common_verbs_usnic_happy -eq 1],
[OPAL_CHECK_OPENFABRICS([common_verbs_usnic],
[common_verbs_usnic_happy=1],
[common_verbs_usnic_happy=0])
])
AS_IF([test $common_verbs_usnic_happy -eq 1],
[AC_CHECK_MEMBER([struct ibv_device.ops],
[],
[AC_MSG_WARN([--with-verbs-usnic specified, but the verbs.h does not])
AC_MSG_WARN([have the required member fields. It is highly likely])
AC_MSG_WARN([that you do not need --with-verbs-usnic. Try configuring])
AC_MSG_WARN([and building Open MPI without it; if you get warnings])
AC_MSG_WARN([about usnic IB devices anyway, please let us know.])
AC_MSG_WARN([Since you asked for --with-verbs-usnic and we cannot])
AC_MSG_WARN([deliver it, configure will now abort.])
AC_MSG_ERROR([Cannot continue])
],
[#include <infiniband/verbs.h>])
])
AC_DEFINE_UNQUOTED([OPAL_COMMON_VERBS_USNIC_HAPPY],
[$common_verbs_usnic_happy],
[Whether the common/usnic_verbs component is being built or not])
AS_IF([test $common_verbs_usnic_happy -eq 1],
[$1],
[$2])
# substitute in the things needed to build openib
AC_SUBST([common_verbs_usnic_CPPFLAGS])
AC_SUBST([common_verbs_usnic_LDFLAGS])
AC_SUBST([common_verbs_usnic_LIBS])
])dnl

Просмотреть файл

@ -1,7 +0,0 @@
#
# owner/status file
# owner: institution that is responsible for this package
# status: e.g. active, maintenance, unmaintained
#
owner: Cisco
status: maintenance

Просмотреть файл

@ -69,9 +69,6 @@ AC_DEFUN([MCA_opal_hwloc_hwloc201_POST_CONFIG],[
# MCA_hwloc_hwloc201_CONFIG([action-if-found], [action-if-not-found])
# --------------------------------------------------------------------
AC_DEFUN([MCA_opal_hwloc_hwloc201_CONFIG],[
# Hwloc needs to know if we have Verbs support
AC_REQUIRE([OPAL_CHECK_VERBS_DIR])
AC_CONFIG_FILES([opal/mca/hwloc/hwloc201/Makefile])
OPAL_VAR_SCOPE_PUSH([HWLOC_VERSION opal_hwloc_hwloc201_save_CPPFLAGS opal_hwloc_hwloc201_save_LDFLAGS opal_hwloc_hwloc201_save_LIBS opal_hwloc_hwloc201_save_cairo opal_hwloc_hwloc201_save_xml opal_hwloc_hwloc201_save_mode opal_hwloc_hwloc201_basedir opal_hwloc_hwloc201_file opal_hwloc_hwloc201_save_cflags CPPFLAGS_save LIBS_save opal_hwloc_external])

Просмотреть файл

@ -1,41 +0,0 @@
# Copyright (c) 2014 Mellanox Technologies, Inc.
# All rights reserved.
# Copyright (c) 2017 IBM Corporation. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
AM_CPPFLAGS = $(oshmem_verbs_CPPFLAGS)
sources = \
sshmem_verbs.h \
sshmem_verbs_component.c \
sshmem_verbs_module.c
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if MCA_BUILD_oshmem_sshmem_verbs_DSO
component_noinst =
component_install = mca_sshmem_verbs.la
else
component_noinst = libmca_sshmem_verbs.la
component_install =
endif
mcacomponentdir = $(oshmemlibdir)
mcacomponent_LTLIBRARIES = $(component_install)
mca_sshmem_verbs_la_SOURCES = $(sources)
mca_sshmem_verbs_la_LDFLAGS = -module -avoid-version $(oshmem_verbs_LDFLAGS)
mca_sshmem_verbs_la_LIBADD = $(top_builddir)/oshmem/liboshmem.la \
$(oshmem_verbs_LIBS) \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/verbs/lib@OPAL_LIB_PREFIX@mca_common_verbs.la
noinst_LTLIBRARIES = $(component_noinst)
libmca_sshmem_verbs_la_SOURCES =$(sources)
libmca_sshmem_verbs_la_LDFLAGS = -module -avoid-version $(oshmem_verbs_LDFLAGS)
libmca_sshmem_verbs_la_LIBADD = $(oshmem_verbs_LIBS)

Просмотреть файл

@ -1,121 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2014 Mellanox Technologies, Inc.
# All rights reserved.
# Copyright (c) 2015 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
#
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_mca_sshmem_verbs_CONFIG(action-if-can-compile,
# [action-if-cant-compile])
# ------------------------------------------------
AC_DEFUN([MCA_oshmem_sshmem_verbs_CONFIG],[
AC_CONFIG_FILES([oshmem/mca/sshmem/verbs/Makefile])
# do we have the verbs shm stuff?
AC_MSG_CHECKING([if want verbs shared memory support])
AC_ARG_ENABLE(verbs-sshmem,
AC_HELP_STRING([--disable-verbs-sshmem],
[disable verbs shared memory support (default: enabled)]))
AS_IF([test "$enable_verbs_sshmem" = "no"],
[AC_MSG_RESULT([no])
oshmem_verbs_sm_build_verbs=0],
[AC_MSG_RESULT([yes])
OPAL_CHECK_OPENFABRICS([oshmem_verbs],
[oshmem_verbs_sm_build_verbs=1],
[oshmem_verbs_sm_build_verbs=0])])
# substitute in the things needed to build
AC_SUBST([oshmem_verbs_CFLAGS])
AC_SUBST([oshmem_verbs_CPPFLAGS])
AC_SUBST([oshmem_verbs_LDFLAGS])
AC_SUBST([oshmem_verbs_LIBS])
# ibv_reg_shared_mr was added in MOFED 1.8
oshmem_have_mpage=0
# If we have the oshmem_verbs stuff available, find out what we've got
AS_IF(
[test "$oshmem_verbs_sm_build_verbs" = "1"],
[
OSHMEM_LIBSHMEM_EXTRA_LDFLAGS="$OSHMEM_LIBSHMEM_EXTRA_LDFLAGS $oshmem_verbs_LDFLAGS"
OSHMEM_LIBSHMEM_EXTRA_LIBS="$OSHMEM_LIBSHMEM_EXTRA_LIBS $oshmem_verbs_LIBS"
oshmem_verbs_save_CPPFLAGS="$CPPFLAGS"
oshmem_verbs_save_LDFLAGS="$LDFLAGS"
oshmem_verbs_save_LIBS="$LIBS"
CPPFLAGS="$CPPFLAGS $oshmem_verbs_CPPFLAGS"
LDFLAGS="$LDFLAGS $oshmem_verbs_LDFLAGS"
LIBS="$LIBS $oshmem_verbs_LIBS"
AC_CHECK_DECLS([IBV_ACCESS_ALLOCATE_MR,IBV_ACCESS_SHARED_MR_USER_READ],
[oshmem_have_mpage=2], [],
[#include <infiniband/verbs.h>])
AC_CHECK_DECLS([IBV_EXP_ACCESS_ALLOCATE_MR,IBV_EXP_ACCESS_SHARED_MR_USER_READ],
[oshmem_have_mpage=3], [],
[#include <infiniband/verbs.h>])
CPPFLAGS="$oshmem_verbs_save_CPPFLAGS"
LDFLAGS="$oshmem_verbs_save_LDFLAGS"
LIBS="$oshmem_verbs_save_LIBS"
if test "x$oshmem_have_mpage" = "x0"; then
oshmem_verbs_sm_build_verbs=0
fi
])
AC_DEFINE_UNQUOTED(MPAGE_ENABLE, $oshmem_have_mpage, [Whether we can use M-PAGE supported since MOFED 1.8])
exp_access_happy=0
exp_reg_mr_happy=0
AS_IF([test "$oshmem_have_mpage" = "3"],
[
oshmem_verbs_save_CFLAGS="$CFLAGS"
CFLAGS="$CFLAGS -Wno-strict-prototypes -Werror"
AC_COMPILE_IFELSE(
[AC_LANG_PROGRAM([[#include <infiniband/verbs_exp.h>]],
[[
struct ibv_exp_reg_shared_mr_in in_smr;
uint64_t access_flags = IBV_EXP_ACCESS_SHARED_MR_USER_READ |
IBV_EXP_ACCESS_SHARED_MR_USER_WRITE |
IBV_EXP_ACCESS_SHARED_MR_GROUP_READ |
IBV_EXP_ACCESS_SHARED_MR_GROUP_WRITE |
IBV_EXP_ACCESS_SHARED_MR_OTHER_READ |
IBV_EXP_ACCESS_SHARED_MR_OTHER_WRITE;
in_smr.exp_access = access_flags;
ibv_exp_reg_shared_mr(&in_smr);
]])], [],
[oshmem_verbs_sm_build_verbs=0])
CFLAGS="$oshmem_verbs_save_CFLAGS"
AC_CHECK_MEMBER([struct ibv_exp_reg_shared_mr_in.exp_access],
[exp_access_happy=1],
[],
[#include <infiniband/verbs_exp.h>])
AC_CHECK_MEMBER([struct ibv_exp_reg_mr_in.create_flags],
[exp_reg_mr_happy=1],
[],
[#include <infiniband/verbs_exp.h>])
])
AC_DEFINE_UNQUOTED(MPAGE_HAVE_SMR_EXP_ACCESS, $exp_access_happy, [exp_access field is part of ibv_exp_reg_shared_mr_in])
AC_DEFINE_UNQUOTED(MPAGE_HAVE_IBV_EXP_REG_MR_CREATE_FLAGS, $exp_reg_mr_happy, [create_flags field is part of ibv_exp_reg_mr_in])
AS_IF([test "$enable_verbs_sshmem" = "yes" && test "$oshmem_verbs_sm_build_verbs" = "0"],
[AC_MSG_WARN([VERBS shared memory support requested but not found])
AC_MSG_ERROR([Cannot continue])])
AS_IF([test "$oshmem_verbs_sm_build_verbs" = "1"], [$1], [$2])
AC_DEFINE_UNQUOTED([OSHMEM_SSHMEM_VERBS],
[$oshmem_verbs_sm_build_verbs],
[Whether we have shared memory support for verbs or not])
])dnl

Просмотреть файл

@ -1,96 +0,0 @@
/*
* Copyright (c) 2014 Mellanox Technologies, Inc.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_SSHMEM_VERBS_EXPORT_H
#define MCA_SSHMEM_VERBS_EXPORT_H
#include "oshmem_config.h"
#include "oshmem/mca/sshmem/sshmem.h"
BEGIN_C_DECLS
#include <infiniband/verbs.h>
#include "opal/class/opal_list.h"
#include "opal/class/opal_value_array.h"
typedef struct openib_device_t {
struct ibv_device **ib_devs;
struct ibv_device *ib_dev;
struct ibv_context *ib_dev_context;
struct ibv_device_attr ib_dev_attr;
struct ibv_pd *ib_pd;
opal_value_array_t ib_mr_array;
struct ibv_mr *ib_mr_shared;
} openib_device_t;
#if (MPAGE_ENABLE > 0)
# if MPAGE_ENABLE < 3
# define IBV_EXP_ACCESS_ALLOCATE_MR IBV_ACCESS_ALLOCATE_MR
# define IBV_EXP_ACCESS_SHARED_MR_USER_READ IBV_ACCESS_SHARED_MR_USER_READ
# define IBV_EXP_ACCESS_SHARED_MR_USER_WRITE IBV_ACCESS_SHARED_MR_USER_WRITE
# define IBV_EXP_ACCESS_NO_RDMA IBV_ACCESS_NO_RDMA
# define ibv_exp_reg_shared_mr ibv_reg_shared_mr_ex
# define ibv_exp_reg_shared_mr_in ibv_reg_shared_mr_in
struct ibv_exp_reg_mr_in {
struct ibv_pd *pd;
void *addr;
size_t length;
uint64_t access;
uint32_t comp_mask;
};
static inline struct ibv_mr *ibv_exp_reg_mr(struct ibv_exp_reg_mr_in *in)
{
return ibv_reg_mr(in->pd, in->addr, in->length, in->access);
}
# endif
static inline void mca_sshmem_verbs_fill_shared_mr(struct ibv_exp_reg_shared_mr_in *mr, struct ibv_pd *pd, uint32_t handle, void *addr, uint64_t access)
{
mr->pd = pd;
mr->addr = addr;
mr->mr_handle = handle;
#if (MPAGE_HAVE_SMR_EXP_ACCESS)
mr->exp_access = access;
#else
mr->access = access;
#endif
mr->comp_mask = 0;
}
#endif /* MPAGE_ENABLE */
/**
* globally exported variable to hold the verbs component.
*/
typedef struct mca_sshmem_verbs_component_t {
/* base component struct */
mca_sshmem_base_component_t super;
/* priority for verbs component */
int priority;
char* hca_name;
int mr_interleave_factor;
int has_shared_mr;
} mca_sshmem_verbs_component_t;
OSHMEM_MODULE_DECLSPEC extern mca_sshmem_verbs_component_t
mca_sshmem_verbs_component;
typedef struct mca_sshmem_verbs_module_t {
mca_sshmem_base_module_t super;
} mca_sshmem_verbs_module_t;
extern mca_sshmem_verbs_module_t mca_sshmem_verbs_module;
END_C_DECLS
#endif /* MCA_SSHMEM_VERBS_EXPORT_H */

Просмотреть файл

@ -1,353 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2014 Mellanox Technologies, Inc.
* All rights reserved.
* Copyright (c) 2014 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2015 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "oshmem_config.h"
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif /* HAVE_UNISTD_H */
#include "opal/constants.h"
#include "opal/util/sys_limits.h"
#include "opal/mca/common/verbs/common_verbs.h"
#include "oshmem/mca/sshmem/sshmem.h"
#include "oshmem/mca/sshmem/base/base.h"
#include "sshmem_verbs.h"
/**
* public string showing the shmem ompi_mmap component version number
*/
const char *mca_sshmem_verbs_component_version_string =
"OSHMEM mmap sshmem MCA component version " OSHMEM_VERSION;
int mca_sshmem_verbs_relocate_backing_file = 0;
char *mca_sshmem_verbs_backing_file_base_dir = NULL;
bool mca_sshmem_verbs_nfs_warning = true;
/**
* local functions
*/
static int verbs_register(void);
static int verbs_open(void);
static int verbs_close(void);
static int verbs_query(mca_base_module_t **module, int *priority);
static int verbs_runtime_query(mca_base_module_t **module,
int *priority,
const char *hint);
/**
* instantiate the public struct with all of our public information
* and pointers to our public functions in it
*/
mca_sshmem_verbs_component_t mca_sshmem_verbs_component = {
/* ////////////////////////////////////////////////////////////////////// */
/* super */
/* ////////////////////////////////////////////////////////////////////// */
{
/**
* common MCA component data
*/
.base_version = {
MCA_SSHMEM_BASE_VERSION_2_0_0,
/* component name and version */
.mca_component_name = "verbs",
MCA_BASE_MAKE_VERSION(component, OSHMEM_MAJOR_VERSION, OSHMEM_MINOR_VERSION,
OSHMEM_RELEASE_VERSION),
.mca_open_component = verbs_open,
.mca_close_component = verbs_close,
.mca_query_component = verbs_query,
.mca_register_component_params = verbs_register,
},
/* MCA v2.0.0 component meta data */
.base_data = {
/* the component is checkpoint ready */
MCA_BASE_METADATA_PARAM_CHECKPOINT
},
.runtime_query = verbs_runtime_query,
},
};
/* ////////////////////////////////////////////////////////////////////////// */
static int
verbs_runtime_query(mca_base_module_t **module,
int *priority,
const char *hint)
{
int rc = OSHMEM_SUCCESS;
openib_device_t my_device;
openib_device_t *device = &my_device;
int num_devs = 0;
int i = 0;
*priority = 0;
*module = NULL;
/* If fork support is requested, try to enable it */
if (OSHMEM_SUCCESS != (rc = opal_common_verbs_fork_test())) {
return OSHMEM_ERROR;
}
memset(device, 0, sizeof(*device));
#ifdef HAVE_IBV_GET_DEVICE_LIST
device->ib_devs = ibv_get_device_list(&num_devs);
#else
#error unsupported ibv_get_device_list in infiniband/verbs.h
#endif
if (num_devs == 0 || !device->ib_devs) {
return OSHMEM_ERR_NOT_SUPPORTED;
}
/* Open device */
if (NULL != mca_sshmem_verbs_component.hca_name) {
for (i = 0; i < num_devs; i++) {
if (0 == strcmp(mca_sshmem_verbs_component.hca_name, ibv_get_device_name(device->ib_devs[i]))) {
device->ib_dev = device->ib_devs[i];
break;
}
}
} else {
device->ib_dev = device->ib_devs[0];
}
if (NULL == device->ib_dev) {
rc = OSHMEM_ERR_NOT_FOUND;
goto out;
}
if (NULL == (device->ib_dev_context = ibv_open_device(device->ib_dev))) {
rc = OSHMEM_ERR_RESOURCE_BUSY;
goto out;
}
/* Obtain device attributes */
if (ibv_query_device(device->ib_dev_context, &device->ib_dev_attr)) {
rc = OSHMEM_ERR_RESOURCE_BUSY;
goto out;
}
/* Allocate the protection domain for the device */
device->ib_pd = ibv_alloc_pd(device->ib_dev_context);
if (NULL == device->ib_pd) {
rc = OSHMEM_ERR_RESOURCE_BUSY;
goto out;
}
/* Allocate memory */
if (!rc) {
void *addr = NULL;
size_t size = (size_t)opal_getpagesize();
struct ibv_mr *ib_mr = NULL;
uint64_t access_flag = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;
uint64_t exp_access_flag = 0;
OBJ_CONSTRUCT(&device->ib_mr_array, opal_value_array_t);
opal_value_array_init(&device->ib_mr_array, sizeof(struct ibv_mr *));
#if (MPAGE_ENABLE > 0)
exp_access_flag = IBV_EXP_ACCESS_ALLOCATE_MR |
IBV_EXP_ACCESS_SHARED_MR_USER_READ |
IBV_EXP_ACCESS_SHARED_MR_USER_WRITE;
#endif /* MPAGE_ENABLE */
struct ibv_exp_reg_mr_in in = {device->ib_pd, addr, size, access_flag|exp_access_flag, 0};
ib_mr = ibv_exp_reg_mr(&in);
if (NULL == ib_mr) {
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
} else {
device->ib_mr_shared = ib_mr;
opal_value_array_append_item(&device->ib_mr_array, &ib_mr);
}
#if (MPAGE_ENABLE > 0)
if (!rc && (0 != mca_sshmem_verbs_component.has_shared_mr)) {
struct ibv_exp_reg_shared_mr_in in_smr;
access_flag = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ|
IBV_EXP_ACCESS_NO_RDMA;
addr = (void *)mca_sshmem_base_start_address;
mca_sshmem_verbs_fill_shared_mr(&in_smr, device->ib_pd, device->ib_mr_shared->handle, addr, access_flag);
ib_mr = ibv_exp_reg_shared_mr(&in_smr);
if (NULL == ib_mr) {
if (mca_sshmem_verbs_component.has_shared_mr == 1)
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
mca_sshmem_verbs_component.has_shared_mr = 0;
} else {
opal_value_array_append_item(&device->ib_mr_array, &ib_mr);
mca_sshmem_verbs_component.has_shared_mr = 1;
}
}
#else
if (!rc && mca_sshmem_verbs_component.has_shared_mr == 1) {
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
}
mca_sshmem_verbs_component.has_shared_mr = 0;
#endif /* MPAGE_ENABLE */
}
#if !MPAGE_HAVE_IBV_EXP_REG_MR_CREATE_FLAGS
/* disqualify ourselves if we can not alloc contig
* pages at fixed address
*/
if (mca_sshmem_verbs_component.has_shared_mr == 0)
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
#endif
/* all is well - rainbows and butterflies */
if (!rc) {
*priority = mca_sshmem_verbs_component.priority;
*module = (mca_base_module_t *)&mca_sshmem_verbs_module.super;
}
out:
if (device) {
if (0 < (i = opal_value_array_get_size(&device->ib_mr_array))) {
struct ibv_mr** array;
struct ibv_mr* ib_mr = NULL;
array = OPAL_VALUE_ARRAY_GET_BASE(&device->ib_mr_array, struct ibv_mr *);
/* destruct shared_mr first in order to avoid proc fs race */
for (i--;i >= 0; i--) {
ib_mr = array[i];
ibv_dereg_mr(ib_mr);
opal_value_array_remove_item(&device->ib_mr_array, i);
}
if (device->ib_mr_shared) {
device->ib_mr_shared = NULL;
}
OBJ_DESTRUCT(&device->ib_mr_array);
}
if (device->ib_pd) {
ibv_dealloc_pd(device->ib_pd);
device->ib_pd = NULL;
}
if(device->ib_dev_context) {
ibv_close_device(device->ib_dev_context);
device->ib_dev_context = NULL;
}
if(device->ib_devs) {
ibv_free_device_list(device->ib_devs);
device->ib_devs = NULL;
}
}
return rc;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
verbs_register(void)
{
int index;
/* ////////////////////////////////////////////////////////////////////// */
/* (default) priority - set high to make verbs the default */
mca_sshmem_verbs_component.priority = 20;
index = mca_base_component_var_register (&mca_sshmem_verbs_component.super.base_version,
"priority", "Priority for sshmem verbs "
"component (default: 20)", MCA_BASE_VAR_TYPE_INT,
NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_3,
MCA_BASE_VAR_SCOPE_ALL_EQ,
&mca_sshmem_verbs_component.priority);
mca_sshmem_verbs_component.hca_name = NULL;
index = mca_base_component_var_register (&mca_sshmem_verbs_component.super.base_version,
"hca_name", "Preferred hca (default: the first)", MCA_BASE_VAR_TYPE_STRING,
NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_3,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_sshmem_verbs_component.hca_name);
if (index) {
(void) mca_base_var_register_synonym(index, "oshmem", "memheap", "base",
"hca_name",
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
/* allow user specify hca port, extract hca name
* ex: mlx_4_0:1 is allowed
*/
if (mca_sshmem_verbs_component.hca_name) {
char *p;
p = strchr(mca_sshmem_verbs_component.hca_name, ':');
if (p)
*p = 0;
}
mca_sshmem_verbs_component.mr_interleave_factor = 2;
index = mca_base_component_var_register (&mca_sshmem_verbs_component.super.base_version,
"mr_interleave_factor", "try to give at least N Gbytes spaces between mapped memheaps "
"of other PEs that are local to me (default: 2)", MCA_BASE_VAR_TYPE_INT,
NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_3,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_sshmem_verbs_component.mr_interleave_factor);
if (index) {
(void) mca_base_var_register_synonym(index, "oshmem", "memheap", "base",
"mr_interleave_factor",
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
mca_sshmem_verbs_component.has_shared_mr = -1;
index = mca_base_component_var_register (&mca_sshmem_verbs_component.super.base_version,
"shared_mr", "Shared memory region usage "
"[0 - off, 1 - on, -1 - auto] (default: -1)", MCA_BASE_VAR_TYPE_INT,
NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
OPAL_INFO_LVL_4,
MCA_BASE_VAR_SCOPE_ALL_EQ,
&mca_sshmem_verbs_component.has_shared_mr);
return OSHMEM_SUCCESS;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
verbs_open(void)
{
return OSHMEM_SUCCESS;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
verbs_query(mca_base_module_t **module, int *priority)
{
*priority = mca_sshmem_verbs_component.priority;
*module = (mca_base_module_t *)&mca_sshmem_verbs_module.super;
return OSHMEM_SUCCESS;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
verbs_close(void)
{
return OSHMEM_SUCCESS;
}

Просмотреть файл

@ -1,460 +0,0 @@
/*
* Copyright (c) 2014 Mellanox Technologies, Inc.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "oshmem_config.h"
#include <errno.h>
#ifdef HAVE_FCNTL_H
#include <fcntl.h>
#endif /* HAVE_FCNTL_H */
#ifdef HAVE_SYS_MMAN_H
#include <sys/mman.h>
#endif /* HAVE_SYS_MMAN_H */
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif /* HAVE_UNISTD_H */
#ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif /* HAVE_SYS_TYPES_H */
#include <string.h>
#ifdef HAVE_NETDB_H
#include <netdb.h>
#endif /* HAVE_NETDB_H */
#include <time.h>
#ifdef HAVE_SYS_STAT_H
#include <sys/stat.h>
#endif /* HAVE_SYS_STAT_H */
#include "opal/constants.h"
#include "opal/util/output.h"
#include "opal/util/path.h"
#include "opal/util/show_help.h"
#include "oshmem/mca/sshmem/sshmem.h"
#include "oshmem/mca/sshmem/base/base.h"
#include "sshmem_verbs.h"
static openib_device_t memheap_device;
/* ////////////////////////////////////////////////////////////////////////// */
/*local functions */
/* local functions */
static int
module_init(void);
static int
segment_create(map_segment_t *ds_buf,
const char *file_name,
size_t size);
static void *
segment_attach(map_segment_t *ds_buf, sshmem_mkey_t *mkey);
static int
segment_detach(map_segment_t *ds_buf, sshmem_mkey_t *mkey);
static int
segment_unlink(map_segment_t *ds_buf);
static int
module_finalize(void);
/*
* mmap shmem module
*/
mca_sshmem_verbs_module_t mca_sshmem_verbs_module = {
/* super */
{
module_init,
segment_create,
segment_attach,
segment_detach,
segment_unlink,
module_finalize
}
};
/* ////////////////////////////////////////////////////////////////////////// */
static int
module_init(void)
{
/* nothing to do */
return OSHMEM_SUCCESS;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
module_finalize(void)
{
/* nothing to do */
return OSHMEM_SUCCESS;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
segment_create(map_segment_t *ds_buf,
const char *file_name,
size_t size)
{
int rc = OSHMEM_SUCCESS;
openib_device_t *device = &memheap_device;
int num_devs = 0;
int i = 0;
assert(ds_buf);
/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);
memset(device, 0, sizeof(*device));
#ifdef HAVE_IBV_GET_DEVICE_LIST
device->ib_devs = ibv_get_device_list(&num_devs);
#else
#error unsupported ibv_get_device_list in infiniband/verbs.h
#endif
if (num_devs == 0 || !device->ib_devs) {
return OSHMEM_ERR_NOT_SUPPORTED;
}
/* Open device */
if (NULL != mca_sshmem_verbs_component.hca_name) {
for (i = 0; i < num_devs; i++) {
if (0 == strcmp(mca_sshmem_verbs_component.hca_name, ibv_get_device_name(device->ib_devs[i]))) {
device->ib_dev = device->ib_devs[i];
break;
}
}
} else {
device->ib_dev = device->ib_devs[0];
}
if (NULL == device->ib_dev) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error getting device says %d: %s",
errno, strerror(errno))
);
return OSHMEM_ERR_NOT_FOUND;
}
if (NULL == (device->ib_dev_context = ibv_open_device(device->ib_dev))) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error obtaining device context for %s errno says %d: %s",
ibv_get_device_name(device->ib_dev), errno, strerror(errno))
);
return OSHMEM_ERR_RESOURCE_BUSY;
}
/* Obtain device attributes */
if (ibv_query_device(device->ib_dev_context, &device->ib_dev_attr)) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error obtaining device attributes for %s errno says %d: %s",
ibv_get_device_name(device->ib_dev), errno, strerror(errno))
);
return OSHMEM_ERR_RESOURCE_BUSY;
}
/* Allocate the protection domain for the device */
device->ib_pd = ibv_alloc_pd(device->ib_dev_context);
if (NULL == device->ib_pd) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error allocating protection domain for %s errno says %d: %s",
ibv_get_device_name(device->ib_dev), errno, strerror(errno))
);
return OSHMEM_ERR_RESOURCE_BUSY;
}
/* Allocate memory */
if (!rc) {
void *addr = NULL;
struct ibv_mr *ib_mr = NULL;
uint64_t access_flag = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;
uint64_t exp_access_flag = 0;
OBJ_CONSTRUCT(&device->ib_mr_array, opal_value_array_t);
opal_value_array_init(&device->ib_mr_array, sizeof(struct ibv_mr *));
#if (MPAGE_ENABLE > 0)
exp_access_flag = IBV_EXP_ACCESS_ALLOCATE_MR |
IBV_EXP_ACCESS_SHARED_MR_USER_READ |
IBV_EXP_ACCESS_SHARED_MR_USER_WRITE;
#endif /* MPAGE_ENABLE */
struct ibv_exp_reg_mr_in in = {device->ib_pd, addr, size, access_flag|exp_access_flag, 0};
#if MPAGE_HAVE_IBV_EXP_REG_MR_CREATE_FLAGS
if (0 == mca_sshmem_verbs_component.has_shared_mr) {
in.addr = (void *)mca_sshmem_base_start_address;
in.comp_mask = IBV_EXP_REG_MR_CREATE_FLAGS;
in.create_flags = IBV_EXP_REG_MR_CREATE_CONTIG;
in.exp_access = access_flag;
}
#endif
ib_mr = ibv_exp_reg_mr(&in);
if (NULL == ib_mr) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error to ibv_exp_reg_mr() %llu bytes errno says %d: %s",
(unsigned long long)size, errno, strerror(errno))
);
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
} else {
device->ib_mr_shared = ib_mr;
opal_value_array_append_item(&device->ib_mr_array, &ib_mr);
}
#if (MPAGE_ENABLE > 0)
if (!rc && mca_sshmem_verbs_component.has_shared_mr) {
void *addr = NULL;
access_flag = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ|
IBV_EXP_ACCESS_NO_RDMA;
addr = (void *)mca_sshmem_base_start_address;
struct ibv_exp_reg_shared_mr_in in;
mca_sshmem_verbs_fill_shared_mr(&in, device->ib_pd, device->ib_mr_shared->handle, addr, access_flag);
ib_mr = ibv_exp_reg_shared_mr(&in);
if (NULL == ib_mr) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error to ibv_reg_shared_mr() %llu bytes errno says %d: %s has_shared_mr: %d",
(unsigned long long)size, errno, strerror(errno),
mca_sshmem_verbs_component.has_shared_mr
)
);
rc = OSHMEM_ERR_OUT_OF_RESOURCE;
} else {
opal_value_array_append_item(&device->ib_mr_array, &ib_mr);
}
}
#endif /* MPAGE_ENABLE */
if (!rc) {
OPAL_OUTPUT_VERBOSE(
(70, oshmem_sshmem_base_framework.framework_output,
"ibv device %s shared_mr: %d",
ibv_get_device_name(device->ib_dev),
mca_sshmem_verbs_component.has_shared_mr)
);
if (mca_sshmem_verbs_component.has_shared_mr) {
assert(size == device->ib_mr_shared->length);
ds_buf->type = MAP_SEGMENT_ALLOC_IBV;
ds_buf->seg_id = device->ib_mr_shared->handle;
} else {
ds_buf->type = MAP_SEGMENT_ALLOC_IBV_NOSHMR;
ds_buf->seg_id = MAP_SEGMENT_SHM_INVALID;
}
ds_buf->super.va_base = ib_mr->addr;
ds_buf->seg_size = size;
ds_buf->super.va_end = (void*)((uintptr_t)ds_buf->super.va_base + ds_buf->seg_size);
}
}
OPAL_OUTPUT_VERBOSE(
(70, oshmem_sshmem_base_framework.framework_output,
"%s: %s: create %s "
"(id: %d, addr: %p size: %lu)\n",
mca_sshmem_verbs_component.super.base_version.mca_type_name,
mca_sshmem_verbs_component.super.base_version.mca_component_name,
(rc ? "failure" : "successful"),
ds_buf->seg_id, ds_buf->super.va_base, (unsigned long)ds_buf->seg_size)
);
return rc;
}
/* ////////////////////////////////////////////////////////////////////////// */
/**
* segment_attach can only be called after a successful call to segment_create
*/
static void *
segment_attach(map_segment_t *ds_buf, sshmem_mkey_t *mkey)
{
openib_device_t *device = &memheap_device;
static int mr_count = 0;
void *addr = NULL;
assert(ds_buf);
assert(mkey->va_base == 0);
if (MAP_SEGMENT_SHM_INVALID == (int)(mkey->u.key)) {
return (mkey->va_base);
}
/* workaround mtt problem - request aligned addresses */
++mr_count;
addr = (void *)((uintptr_t)mca_sshmem_base_start_address +
mca_sshmem_verbs_component.mr_interleave_factor * 1024ULL * 1024ULL * 1024ULL * mr_count);
{
struct ibv_mr *ib_mr = NULL;
uint64_t access_flag = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ |
IBV_EXP_ACCESS_NO_RDMA;
struct ibv_exp_reg_shared_mr_in in;
mca_sshmem_verbs_fill_shared_mr(&in, device->ib_pd, mkey->u.key, addr, access_flag);
ib_mr = ibv_exp_reg_shared_mr(&in);
if (NULL == ib_mr) {
mkey->va_base = (void *)-1;
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error to ibv_reg_shared_mr() %llu bytes errno says %d: %s",
(unsigned long long)ds_buf->seg_size, errno, strerror(errno))
);
} else {
if (ib_mr->addr != addr) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"Failed to map shared region to address %p got addr %p. Try to increase 'memheap_mr_interleave_factor' from %d",
addr, ib_mr->addr, mca_sshmem_verbs_component.mr_interleave_factor)
);
}
opal_value_array_append_item(&device->ib_mr_array, &ib_mr);
mkey->va_base = ib_mr->addr;
}
}
OPAL_OUTPUT_VERBOSE(
(70, oshmem_sshmem_base_framework.framework_output,
"%s: %s: attach successful "
"(id: %d, addr: %p size: %lu | va_base: 0x%p len: %d key %llx)\n",
mca_sshmem_verbs_component.super.base_version.mca_type_name,
mca_sshmem_verbs_component.super.base_version.mca_component_name,
ds_buf->seg_id, ds_buf->super.va_base, (unsigned long)ds_buf->seg_size,
mkey->va_base, mkey->len, (unsigned long long)mkey->u.key)
);
/* update returned base pointer with an offset that hides our stuff */
return (mkey->va_base);
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
segment_detach(map_segment_t *ds_buf, sshmem_mkey_t *mkey)
{
int rc = OSHMEM_SUCCESS;
openib_device_t *device = &memheap_device;
int i;
assert(ds_buf);
OPAL_OUTPUT_VERBOSE(
(70, oshmem_sshmem_base_framework.framework_output,
"%s: %s: detaching "
"(id: %d, addr: %p size: %lu)\n",
mca_sshmem_verbs_component.super.base_version.mca_type_name,
mca_sshmem_verbs_component.super.base_version.mca_component_name,
ds_buf->seg_id, ds_buf->super.va_base, (unsigned long)ds_buf->seg_size)
);
if (device) {
if (0 < (i = opal_value_array_get_size(&device->ib_mr_array))) {
struct ibv_mr** array;
struct ibv_mr* ib_mr = NULL;
array = OPAL_VALUE_ARRAY_GET_BASE(&device->ib_mr_array, struct ibv_mr *);
for (i--;i >= 0; i--) {
ib_mr = array[i];
if(ibv_dereg_mr(ib_mr)) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error ibv_dereg_mr(): %d: %s",
errno, strerror(errno))
);
rc = OSHMEM_ERROR;
}
opal_value_array_remove_item(&device->ib_mr_array, i);
}
if (!rc && device->ib_mr_shared) {
device->ib_mr_shared = NULL;
}
OBJ_DESTRUCT(&device->ib_mr_array);
}
if (!rc && device->ib_pd) {
if (ibv_dealloc_pd(device->ib_pd)) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error ibv_dealloc_pd(): %d: %s",
errno, strerror(errno))
);
rc = OSHMEM_ERROR;
} else {
device->ib_pd = NULL;
}
}
if(!rc && device->ib_dev_context) {
if(ibv_close_device(device->ib_dev_context)) {
OPAL_OUTPUT_VERBOSE(
(5, oshmem_sshmem_base_framework.framework_output,
"error ibv_close_device(): %d: %s",
errno, strerror(errno))
);
rc = OSHMEM_ERROR;
} else {
device->ib_dev_context = NULL;
}
}
if(!rc && device->ib_devs) {
ibv_free_device_list(device->ib_devs);
device->ib_devs = NULL;
}
}
/* reset the contents of the map_segment_t associated with this
* shared memory segment.
*/
shmem_ds_reset(ds_buf);
return rc;
}
/* ////////////////////////////////////////////////////////////////////////// */
static int
segment_unlink(map_segment_t *ds_buf)
{
/* not much unlink work needed for sysv */
OPAL_OUTPUT_VERBOSE(
(70, oshmem_sshmem_base_framework.framework_output,
"%s: %s: unlinking "
"(id: %d, addr: %p size: %lu)\n",
mca_sshmem_verbs_component.super.base_version.mca_type_name,
mca_sshmem_verbs_component.super.base_version.mca_component_name,
ds_buf->seg_id, ds_buf->super.va_base, (unsigned long)ds_buf->seg_size)
);
/* don't completely reset. in particular, only reset
* the id and flip the invalid bit. size and name values will remain valid
* across unlinks. other information stored in flags will remain untouched.
*/
ds_buf->seg_id = MAP_SEGMENT_SHM_INVALID;
/* note: this is only changing the valid bit to 0. */
MAP_SEGMENT_INVALIDATE(ds_buf);
return OSHMEM_SUCCESS;
}