1
1

btl/openib: So long / farewell / it's time to say goodnight

So long BTL openib!  After many years of (mostly) faithful service, it
is time to remove the openib BTL.  It has been fully replaced by other
components, such as the UCX PML and OFI MTL.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Этот коммит содержится в:
Jeff Squyres 2019-01-10 09:37:26 -08:00
родитель ead2efb136
Коммит 8de786f5a4
46 изменённых файлов: 4 добавлений и 22163 удалений

21
README
Просмотреть файл

@ -623,7 +623,6 @@ MPI Functionality and Features
- portals4
(2) The ob1 PML and the following BTLs support MPI_THREAD_MULTIPLE:
- openib (see exception below)
- self
- sm
- smcuda
@ -632,10 +631,6 @@ MPI Functionality and Features
- usnic
- vader (shared memory)
The openib BTL's RDMACM based connection setup mechanism is also not
thread safe. The default UDCM method should be used for
applications requiring MPI_THREAD_MULTIPLE support.
Currently, MPI File operations are not thread safe even if MPI is
initialized for MPI_THREAD_MULTIPLE support.
@ -794,7 +789,8 @@ Network Support
- In prior versions of Open MPI, InfiniBand and RoCE support was
provided through the openib BTL and ob1 PML plugins. Starting with
Open MPI 4.0.0, InfiniBand support through the openib plugin is both
deprecated and superseded by the ucx PML component.
deprecated and superseded by the ucx PML component. The openib BTL
was removed in Open MPI v5.0.0.
While the openib BTL depended on libibverbs, the UCX PML depends on
the UCX library.
@ -809,15 +805,6 @@ Network Support
for OpenSHMEM support, and "--mca osc ucx" for MPI RMA (one-sided)
operations.
- Although the ob1 PML+openib BTL is still the default for iWARP and
RoCE devices, it will reject InfiniBand defaults (by default) so
that they will use the ucx PML. If using the openib BTL is still
desired, set the following MCA parameters:
# Note that "vader" is Open MPI's shared memory BTL
$ mpirun --mca pml ob1 --mca btl openib,vader,self \
--mca btl_openib_allow_ib 1 ...
- The usnic BTL is support for Cisco's usNIC device ("userspace NIC")
on Cisco UCS servers with the Virtualized Interface Card (VIC).
Although the usNIC is accessed via the OpenFabrics Libfabric API
@ -850,8 +837,8 @@ Network Support
http://lwn.net/Articles/343351/
- The use of fork() with OpenFabrics-based networks (i.e., the openib
BTL) is only partially supported, and only on Linux kernels >=
- The use of fork() with OpenFabrics-based networks (i.e., the UCX
PML) is only partially supported, and only on Linux kernels >=
v2.6.15 with libibverbs v1.1 or later (first released as part of
OFED v1.2), per restrictions imposed by the OFED network stack.

Просмотреть файл

@ -1,131 +0,0 @@
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2014 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2011 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2011 Mellanox Technologies. All rights reserved.
# Copyright (c) 2012 Oak Ridge National Laboratory. All rights reserved
# Copyright (c) 2013 Intel, Inc. All rights reserved.
# Copyright (c) 2016 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
# Copyright (c) 2017 IBM Corporation. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
AM_CPPFLAGS = $(btl_openib_CPPFLAGS)
AM_LFLAGS = -Pbtl_openib_ini_yy
LEX_OUTPUT_ROOT = lex.btl_openib_ini_yy
amca_paramdir = $(AMCA_PARAM_SETS_DIR)
dist_amca_param_DATA = btl-openib-benchmark
dist_opaldata_DATA = \
help-mpi-btl-openib.txt \
connect/help-mpi-btl-openib-cpc-base.txt \
mca-btl-openib-device-params.ini
sources = \
btl_openib.c \
btl_openib.h \
btl_openib_component.c \
btl_openib_endpoint.c \
btl_openib_endpoint.h \
btl_openib_frag.c \
btl_openib_frag.h \
btl_openib_proc.c \
btl_openib_proc.h \
btl_openib_eager_rdma.h \
btl_openib_lex.h \
btl_openib_lex.l \
btl_openib_mca.c \
btl_openib_mca.h \
btl_openib_ini.c \
btl_openib_ini.h \
btl_openib_async.c \
btl_openib_async.h \
btl_openib_xrc.c \
btl_openib_xrc.h \
btl_openib_ip.h \
btl_openib_ip.c \
btl_openib_put.c \
btl_openib_get.c \
btl_openib_atomic.c \
connect/base.h \
connect/btl_openib_connect_base.c \
connect/btl_openib_connect_empty.c \
connect/btl_openib_connect_empty.h \
connect/connect.h
# If we have rdmacm support, build that CPC
if MCA_btl_openib_have_rdmacm
sources += \
connect/btl_openib_connect_rdmacm.c \
connect/btl_openib_connect_rdmacm.h
dist_opaldata_DATA += connect/help-mpi-btl-openib-cpc-rdmacm.txt
endif
# If we have udcm support, build that CPC
if MCA_btl_openib_have_udcm
sources += \
connect/btl_openib_connect_udcm.c \
connect/btl_openib_connect_udcm.h
# dist_opaldata_DATA += connect/help-mpi-btl-openib-cpc-ud.txt
endif
# If we have dynamic SL support, build those files
if MCA_btl_openib_have_dynamic_sl
sources += \
connect/btl_openib_connect_sl.c \
connect/btl_openib_connect_sl.h
endif
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if MCA_BUILD_opal_btl_openib_DSO
lib =
lib_sources =
component = mca_btl_openib.la
component_sources = $(sources)
else
lib = libmca_btl_openib.la
lib_sources = $(sources)
component =
component_sources =
endif
mcacomponentdir = $(opallibdir)
mcacomponent_LTLIBRARIES = $(component)
mca_btl_openib_la_SOURCES = $(component_sources)
mca_btl_openib_la_LDFLAGS = -module -avoid-version $(btl_openib_LDFLAGS)
mca_btl_openib_la_LIBADD = $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la \
$(btl_openib_LIBS) \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/verbs/lib@OPAL_LIB_PREFIX@mca_common_verbs.la
if OPAL_cuda_support
mca_btl_openib_la_LIBADD += \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/cuda/lib@OPAL_LIB_PREFIX@mca_common_cuda.la
endif
noinst_LTLIBRARIES = $(lib)
libmca_btl_openib_la_SOURCES = $(lib_sources)
libmca_btl_openib_la_LDFLAGS= -module -avoid-version $(btl_openib_LDFLAGS)
libmca_btl_openib_la_LIBADD = $(btl_openib_LIBS)
maintainer-clean-local:
rm -f btl_openib_lex.c

Просмотреть файл

@ -1,19 +0,0 @@
#
# These values are suitable for benchmarking with the openib and sm
# btls with a small number of MPI processes. If you're only going to
# use one process per node, remove "sm". These values are *NOT*
# scalable to large numbers of processes!
#
btl=openib,self,sm
btl_openib_max_btls=20
btl_openib_rd_num=128
btl_openib_rd_low=75
btl_openib_rd_win=50
btl_openib_max_eager_rdma=32
mpool_base_use_mem_hooks=1
mpi_leave_pinned=1
#
# Note that we are not limiting the max free list size, so for netpipe
# (for example), this is no problem. But we may want to explore the
# parameter space for other popular benchmarks.
#

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,933 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2009 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2007 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2011 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013-2014 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_IB_H
#define MCA_BTL_IB_H
#include "opal_config.h"
#include <sys/types.h>
#include <string.h>
#include <infiniband/verbs.h>
/* Open MPI includes */
#include "opal/class/opal_pointer_array.h"
#include "opal/class/opal_hash_table.h"
#include "opal/util/arch.h"
#include "opal/util/output.h"
#include "opal/mca/event/event.h"
#include "opal/threads/threads.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/rcache/rcache.h"
#include "opal/mca/mpool/mpool.h"
#include "opal/mca/btl/base/btl_base_error.h"
#include "opal/mca/btl/base/base.h"
#include "opal/runtime/opal_progress_threads.h"
#include "connect/connect.h"
BEGIN_C_DECLS
#define HAVE_XRC (OPAL_HAVE_CONNECTX_XRC || OPAL_HAVE_CONNECTX_XRC_DOMAINS)
#define ENABLE_DYNAMIC_SL OPAL_ENABLE_DYNAMIC_SL
#define MCA_BTL_IB_LEAVE_PINNED 1
#define IB_DEFAULT_GID_PREFIX 0xfe80000000000000ll
#define MCA_BTL_IB_PKEY_MASK 0x7fff
#define MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT (256)
/*--------------------------------------------------------------------*/
#if OPAL_ENABLE_DEBUG
#define ATTACH() do { \
int i = 0; \
opal_output(0, "WAITING TO DEBUG ATTACH"); \
while (i == 0) sleep(5); \
} while(0);
#else
#define ATTACH()
#endif
/*--------------------------------------------------------------------*/
/**
* Infiniband (IB) BTL component.
*/
enum {
BTL_OPENIB_HP_CQ,
BTL_OPENIB_LP_CQ,
BTL_OPENIB_MAX_CQ,
};
typedef enum {
MCA_BTL_OPENIB_TRANSPORT_IB,
MCA_BTL_OPENIB_TRANSPORT_IWARP,
MCA_BTL_OPENIB_TRANSPORT_RDMAOE,
MCA_BTL_OPENIB_TRANSPORT_UNKNOWN,
MCA_BTL_OPENIB_TRANSPORT_SIZE
} mca_btl_openib_transport_type_t;
typedef enum {
MCA_BTL_OPENIB_PP_QP,
MCA_BTL_OPENIB_SRQ_QP,
MCA_BTL_OPENIB_XRC_QP
} mca_btl_openib_qp_type_t;
struct mca_btl_openib_pp_qp_info_t {
int32_t rd_win;
int32_t rd_rsv;
}; typedef struct mca_btl_openib_pp_qp_info_t mca_btl_openib_pp_qp_info_t;
struct mca_btl_openib_srq_qp_info_t {
int32_t sd_max;
/* The init value for rd_curr_num variables of all SRQs */
int32_t rd_init;
/* The watermark, threshold - if the number of WQEs in SRQ is less then this value =>
the SRQ limit event (IBV_EVENT_SRQ_LIMIT_REACHED) will be generated on corresponding SRQ.
As result the maximal number of pre-posted WQEs on the SRQ will be increased */
int32_t srq_limit;
}; typedef struct mca_btl_openib_srq_qp_info_t mca_btl_openib_srq_qp_info_t;
struct mca_btl_openib_qp_info_t {
mca_btl_openib_qp_type_t type;
size_t size;
int32_t rd_num;
int32_t rd_low;
union {
mca_btl_openib_pp_qp_info_t pp_qp;
mca_btl_openib_srq_qp_info_t srq_qp;
} u;
}; typedef struct mca_btl_openib_qp_info_t mca_btl_openib_qp_info_t;
#define BTL_OPENIB_QP_TYPE(Q) (mca_btl_openib_component.qp_infos[(Q)].type)
#define BTL_OPENIB_QP_TYPE_PP(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_PP_QP)
#define BTL_OPENIB_QP_TYPE_SRQ(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_SRQ_QP)
#define BTL_OPENIB_QP_TYPE_XRC(Q) \
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_XRC_QP)
typedef enum {
BTL_OPENIB_RQ_SOURCE_DEVICE_INI = MCA_BASE_VAR_SOURCE_MAX,
} btl_openib_receive_queues_source_t;
typedef enum {
BTL_OPENIB_DT_IB,
BTL_OPENIB_DT_IWARP,
BTL_OPENIB_DT_ALL
} btl_openib_device_type_t;
/* The structer for manage all BTL SRQs */
typedef struct mca_btl_openib_srq_manager_t {
opal_mutex_t lock;
/* The keys of this hash table are addresses of
SRQs structures, and the elements are BTL modules
pointers that associated with these SRQs */
opal_hash_table_t srq_addr_table;
} mca_btl_openib_srq_manager_t;
struct mca_btl_openib_component_t {
mca_btl_base_component_3_0_0_t super; /**< base BTL component */
int ib_max_btls;
/**< maximum number of devices available to openib component */
int ib_num_btls;
/**< number of devices available to the openib component */
int ib_allowed_btls;
/**< number of devices allowed to the openib component */
struct mca_btl_openib_module_t **openib_btls;
/**< array of available BTLs */
opal_pointer_array_t devices; /**< array of available devices */
int devices_count;
int ib_free_list_num;
/**< initial size of free lists */
int ib_free_list_max;
/**< maximum size of free lists */
int ib_free_list_inc;
/**< number of elements to alloc when growing free lists */
opal_list_t ib_procs;
/**< list of ib proc structures */
opal_event_t ib_send_event;
/**< event structure for sends */
opal_event_t ib_recv_event;
/**< event structure for recvs */
opal_mutex_t ib_lock;
/**< lock for accessing module state */
char* ib_mpool_hints;
/**< hints for selecting an mpool component */
char *ib_rcache_name;
/**< name of ib registration cache */
uint8_t num_pp_qps; /**< number of pp qp's */
uint8_t num_srq_qps; /**< number of srq qp's */
uint8_t num_xrc_qps; /**< number of xrc qp's */
uint8_t num_qps; /**< total number of qp's */
opal_hash_table_t ib_addr_table; /**< used only for xrc.hash-table that
keeps table of all lids/subnets */
mca_btl_openib_qp_info_t* qp_infos;
size_t eager_limit; /**< Eager send limit of first fragment, in Bytes */
size_t max_send_size; /**< Maximum send size, in Bytes */
uint32_t max_hw_msg_size;/**< Maximum message size for RDMA protocols in Bytes */
uint32_t reg_mru_len; /**< Length of the registration cache most recently used list */
uint32_t use_srq; /**< Use the Shared Receive Queue (SRQ mode) */
uint32_t ib_cq_size[BTL_OPENIB_MAX_CQ]; /**< Max outstanding CQE on the CQ */
int ib_max_inline_data; /**< Max size of inline data */
unsigned int ib_pkey_val;
unsigned int ib_psn;
unsigned int ib_qp_ous_rd_atom;
uint32_t ib_mtu;
unsigned int ib_min_rnr_timer;
unsigned int ib_timeout;
unsigned int ib_retry_count;
unsigned int ib_rnr_retry;
unsigned int ib_max_rdma_dst_ops;
unsigned int ib_service_level;
#if (ENABLE_DYNAMIC_SL)
unsigned int ib_path_record_service_level;
#endif
int use_eager_rdma;
int eager_rdma_threshold; /**< After this number of msg, use RDMA for short messages, always */
int eager_rdma_num;
int32_t max_eager_rdma;
unsigned int btls_per_lid;
unsigned int max_lmc;
int apm_lmc;
int apm_ports;
unsigned int buffer_alignment; /**< Preferred communication buffer alignment in Bytes (must be power of two) */
opal_atomic_int32_t error_counter; /**< Counts number on error events that we got on all devices */
opal_event_base_t *async_evbase; /**< Async event base */
bool use_async_event_thread; /**< Use the async event handler */
mca_btl_openib_srq_manager_t srq_manager; /**< Hash table for all BTL SRQs */
/* declare as an int instead of btl_openib_device_type_t since there is no
guarantee about the size of an enum. this value will be registered as an
integer with the MCA variable system */
int device_type;
bool allow_ib;
char *if_include;
char **if_include_list;
char *if_exclude;
char **if_exclude_list;
char *ipaddr_include;
char *ipaddr_exclude;
/* MCA param btl_openib_receive_queues */
char *receive_queues;
/* Whether we got a non-default value of btl_openib_receive_queues */
mca_base_var_source_t receive_queues_source;
/** Colon-delimited list of filenames for device parameters */
char *device_params_file_names;
/** Whether we're in verbose mode or not */
bool verbose;
/** Whether we want a warning if no device-specific parameters are
found in INI files */
bool warn_no_device_params_found;
/** Whether we want a warning if non default GID prefix is not configured
on multiport setup */
bool warn_default_gid_prefix;
/** Whether we want a warning if the user specifies a non-existent
device and/or port via btl_openib_if_[in|ex]clude MCA params */
bool warn_nonexistent_if;
/** Whether we want to abort if there's not enough registered
memory available */
bool abort_not_enough_reg_mem;
/** Dummy argv-style list; a copy of names from the
if_[in|ex]clude list that we use for error checking (to ensure
that they all exist) */
char **if_list;
bool use_message_coalescing;
unsigned int cq_poll_ratio;
unsigned int cq_poll_progress;
unsigned int cq_poll_batch;
unsigned int eager_rdma_poll_ratio;
int rdma_qp;
int credits_qp; /* qp used for software flow control */
bool cpc_explicitly_defined;
/**< free list of frags only; used for pining user memory */
opal_free_list_t send_user_free;
/**< free list of frags only; used for pining user memory */
opal_free_list_t recv_user_free;
/**< frags for coalesced massages */
opal_free_list_t send_free_coalesced;
/** Default receive queues */
char* default_recv_qps;
/** GID index to use */
int gid_index;
/* Whether we want to allow connecting processes from different subnets.
* set to 'no' by default */
bool allow_different_subnets;
/** Whether we want a dynamically resizing srq, enabled by default */
bool enable_srq_resize;
bool allow_max_memory_registration;
int memory_registration_verbose_level;
int memory_registration_verbose;
int ignore_locality;
#if OPAL_CUDA_SUPPORT
bool cuda_async_send;
bool cuda_async_recv;
bool cuda_have_gdr;
bool driver_have_gdr;
bool cuda_want_gdr;
#endif /* OPAL_CUDA_SUPPORT */
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
bool rroce_enable;
#endif
unsigned int num_default_gid_btls; /* numbers of btl in the default subnet */
}; typedef struct mca_btl_openib_component_t mca_btl_openib_component_t;
OPAL_MODULE_DECLSPEC extern mca_btl_openib_component_t mca_btl_openib_component;
typedef mca_btl_base_recv_reg_t mca_btl_openib_recv_reg_t;
/**
* Common information for all ports that is sent in the modex message
*/
typedef struct mca_btl_openib_modex_message_t {
/** The subnet ID of this port */
uint64_t subnet_id;
/** LID of this port */
uint16_t lid;
/** APM LID for this port */
uint16_t apm_lid;
/** The MTU used by this port */
uint8_t mtu;
/** vendor id define device type and tuning */
uint32_t vendor_id;
/** vendor part id define device type and tuning */
uint32_t vendor_part_id;
/** Transport type of remote port */
uint8_t transport_type;
/** Dummy field used to calculate the real length */
uint8_t end;
} mca_btl_openib_modex_message_t;
#define MCA_BTL_OPENIB_MODEX_MSG_NTOH(hdr) \
do { \
(hdr).subnet_id = ntoh64((hdr).subnet_id); \
(hdr).lid = ntohs((hdr).lid); \
} while (0)
#define MCA_BTL_OPENIB_MODEX_MSG_HTON(hdr) \
do { \
(hdr).subnet_id = hton64((hdr).subnet_id); \
(hdr).lid = htons((hdr).lid); \
} while (0)
typedef struct mca_btl_openib_device_qp_t {
opal_free_list_t send_free; /**< free lists of send buffer descriptors */
opal_free_list_t recv_free; /**< free lists of receive buffer descriptors */
} mca_btl_openib_device_qp_t;
struct mca_btl_base_endpoint_t;
typedef struct mca_btl_openib_device_t {
opal_object_t super;
struct ibv_device *ib_dev; /* the ib device */
#if OPAL_ENABLE_PROGRESS_THREADS == 1
struct ibv_comp_channel *ib_channel; /* Channel event for the device */
opal_thread_t thread; /* Progress thread */
volatile bool progress; /* Progress status */
#endif
opal_mutex_t device_lock; /* device level lock */
struct ibv_context *ib_dev_context;
#if HAVE_DECL_IBV_EXP_QUERY_DEVICE
struct ibv_exp_device_attr ib_exp_dev_attr;
#endif
struct ibv_device_attr ib_dev_attr;
struct ibv_pd *ib_pd;
struct ibv_cq *ib_cq[BTL_OPENIB_MAX_CQ];
uint32_t cq_size[BTL_OPENIB_MAX_CQ];
mca_mpool_base_module_t *mpool;
mca_rcache_base_module_t *rcache;
/* MTU for this device */
uint32_t mtu;
/* Whether this device supports eager RDMA */
uint8_t use_eager_rdma;
uint8_t btls; /** < number of btls using this device */
opal_pointer_array_t *endpoints;
opal_pointer_array_t *device_btls;
uint16_t hp_cq_polls;
uint16_t eager_rdma_polls;
bool pollme;
volatile bool got_fatal_event;
volatile bool got_port_event;
#if HAVE_XRC
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_xrcd *xrcd;
#else
struct ibv_xrc_domain *xrc_domain;
#endif
int xrc_fd;
#endif
opal_atomic_int32_t non_eager_rdma_endpoints;
opal_atomic_int32_t eager_rdma_buffers_count;
struct mca_btl_base_endpoint_t **eager_rdma_buffers;
/**< frags for control massages */
opal_free_list_t send_free_control;
/* QP types and attributes that will be used on this device */
mca_btl_openib_device_qp_t *qps;
/* Maximum value supported by this device for max_inline_data */
uint32_t max_inline_data;
/* Registration limit and current count */
uint64_t mem_reg_max, mem_reg_max_total, mem_reg_active;
/* Device is ready for use */
bool ready_for_use;
/* Async event */
opal_event_t async_event;
} mca_btl_openib_device_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_device_t);
struct mca_btl_openib_module_pp_qp_t {
int32_t dummy;
}; typedef struct mca_btl_openib_module_pp_qp_t mca_btl_openib_module_pp_qp_t;
struct mca_btl_openib_module_srq_qp_t {
struct ibv_srq *srq;
opal_atomic_int32_t rd_posted;
opal_atomic_int32_t sd_credits; /* the max number of outstanding sends on a QP when using SRQ */
/* i.e. the number of frags that can be outstanding (down counter) */
opal_list_t pending_frags[2]; /**< list of high/low prio frags */
/** The number of receive buffers that can be post in the current time.
The value may be increased in the IBV_EVENT_SRQ_LIMIT_REACHED
event handler. The value starts from (rd_num / 4) and increased up to rd_num */
int32_t rd_curr_num;
/** We post additional WQEs only if a number of WQEs (in specific SRQ) is less of this value.
The value increased together with rd_curr_num. The value is unique for every SRQ. */
int32_t rd_low_local;
/** The flag points if we want to get the
IBV_EVENT_SRQ_LIMIT_REACHED events for dynamically resizing SRQ */
bool srq_limit_event_flag;
/**< In difference of the "--mca enable_srq_resize" parameter that says, if we want(or no)
to start with small num of pre-posted receive buffers (rd_curr_num) and to increase this number by needs
(the max of this value is rd_num * the whole size of SRQ), the "srq_limit_event_flag" says if we want to get limit event
from device if the defined srq limit was reached (signal to the main thread) and we put off this flag if the rd_curr_num
was increased up to rd_num.
In order to prevent lock/unlock operation in the critical path we prefer only put-on
the srq_limit_event_flag in asynchronous thread, because in this way we post receive buffers
in the main thread only and only after posting we set (if srq_limit_event_flag is true)
the limit for IBV_EVENT_SRQ_LIMIT_REACHED event. */
}; typedef struct mca_btl_openib_module_srq_qp_t mca_btl_openib_module_srq_qp_t;
struct mca_btl_openib_module_qp_t {
union {
mca_btl_openib_module_pp_qp_t pp_qp;
mca_btl_openib_module_srq_qp_t srq_qp;
} u;
}; typedef struct mca_btl_openib_module_qp_t mca_btl_openib_module_qp_t;
/**
* IB BTL Interface
*/
struct mca_btl_openib_module_t {
/* Base BTL module */
mca_btl_base_module_t super;
bool btl_inited;
bool srqs_created;
/** Common information about all ports */
mca_btl_openib_modex_message_t port_info;
/** Array of CPCs on this port */
opal_btl_openib_connect_base_module_t **cpcs;
/** Number of elements in the cpcs array */
uint8_t num_cpcs;
mca_btl_openib_device_t *device;
uint8_t port_num; /**< ID of the PORT */
uint16_t pkey_index;
struct ibv_port_attr ib_port_attr;
uint16_t lid; /**< lid that is actually used (for LMC) */
int apm_port; /**< Alternative port that may be used for APM */
uint8_t src_path_bits; /**< offset from base lid (for LMC) */
opal_atomic_int32_t num_peers;
opal_mutex_t ib_lock; /**< module level lock */
size_t eager_rdma_frag_size; /**< length of eager frag */
opal_atomic_int32_t eager_rdma_channels; /**< number of open RDMA channels */
mca_btl_base_module_error_cb_fn_t error_cb; /**< error handler */
mca_btl_openib_module_qp_t * qps;
int local_procs; /** number of local procs */
bool atomic_ops_be; /** atomic result is big endian */
bool allowed; /** is this port allowed */
};
typedef struct mca_btl_openib_module_t mca_btl_openib_module_t;
extern mca_btl_openib_module_t mca_btl_openib_module;
struct mca_btl_base_registration_handle_t {
uint32_t rkey;
uint32_t lkey;
};
struct mca_btl_openib_reg_t {
mca_rcache_base_registration_t base;
struct ibv_mr *mr;
mca_btl_base_registration_handle_t btl_handle;
};
typedef struct mca_btl_openib_reg_t mca_btl_openib_reg_t;
#if OPAL_ENABLE_PROGRESS_THREADS == 1
extern void* mca_btl_openib_progress_thread(opal_object_t*);
#endif
/**
* Register a callback function that is called on error..
*
* @param btl (IN) BTL module
* @return Status indicating if cleanup was successful
*/
int mca_btl_openib_register_error_cb(
struct mca_btl_base_module_t* btl,
mca_btl_base_module_error_cb_fn_t cbfunc
);
/**
* Cleanup any resources held by the BTL.
*
* @param btl BTL instance.
* @return OPAL_SUCCESS or error status on failure.
*/
extern int mca_btl_openib_finalize(
struct mca_btl_base_module_t* btl
);
/**
* PML->BTL notification of change in the process list.
*
* @param btl (IN) BTL module
* @param nprocs (IN) Number of processes
* @param procs (IN) Set of processes
* @param peers (OUT) Set of (optional) peer addressing info.
* @param reachable (IN/OUT) Set of processes that are reachable via this BTL.
* @return OPAL_SUCCESS or error status on failure.
*
*/
extern int mca_btl_openib_add_procs(
struct mca_btl_base_module_t* btl,
size_t nprocs,
struct opal_proc_t **procs,
struct mca_btl_base_endpoint_t** peers,
opal_bitmap_t* reachable
);
/**
* PML->BTL notification of change in the process list.
*
* @param btl (IN) BTL instance
* @param nproc (IN) Number of processes.
* @param procs (IN) Set of processes.
* @param peers (IN) Set of peer data structures.
* @return Status indicating if cleanup was successful
*
*/
extern int mca_btl_openib_del_procs(
struct mca_btl_base_module_t* btl,
size_t nprocs,
struct opal_proc_t **procs,
struct mca_btl_base_endpoint_t** peers
);
/**
* PML->BTL Initiate a send of the specified size.
*
* @param btl (IN) BTL instance
* @param btl_peer (IN) BTL peer addressing
* @param descriptor (IN) Descriptor of data to be transmitted.
* @param tag (IN) Tag.
*/
extern int mca_btl_openib_send(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* btl_peer,
struct mca_btl_base_descriptor_t* descriptor,
mca_btl_base_tag_t tag
);
/**
* PML->BTL Initiate a immediate send of the specified size.
*
* @param btl (IN) BTL instance
* @param ep (IN) Endpoint
* @param convertor (IN) Datatypes converter
* @param header (IN) PML header
* @param header_size (IN) PML header size
* @param payload_size (IN) Payload size
* @param order (IN) Order
* @param flags (IN) Flags
* @param tag (IN) Tag
* @param descriptor (OUT) Messages descriptor
*/
extern int mca_btl_openib_sendi( struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* ep,
struct opal_convertor_t* convertor,
void* header,
size_t header_size,
size_t payload_size,
uint8_t order,
uint32_t flags,
mca_btl_base_tag_t tag,
mca_btl_base_descriptor_t** descriptor
);
/* forward decaration for internal put/get */
struct mca_btl_openib_put_frag_t;
struct mca_btl_openib_get_frag_t;
/**
* @brief Schedule a put fragment with the HCA (internal)
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param frag (IN) Fragment prepared by mca_btl_openib_put
*
* If the fragment can not be scheduled due to resource limitations then
* the fragment will be put on the pending put fragment list and retried
* when another get/put fragment has completed.
*/
int mca_btl_openib_put_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
struct mca_btl_openib_put_frag_t *frag);
/**
* @brief Schedule an RDMA write with the HCA
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param local_address (IN) Source address
* @param remote_address (IN) Destination address
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
* @param size (IN) Number of bytes to write
* @param flags (IN) Transfer flags
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion
* @param cbcontext (IN) Context for completion callback
* @param cbdata (IN) Data for completion callback
*
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
* @return OPAL_SUCCCESS if the operation was successfully scheduled
*
* This function will attempt to schedule a put operation with the HCA.
*/
int mca_btl_openib_put (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
/**
* @brief Schedule a get fragment with the HCA (internal)
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param qp (IN) ID of queue pair to schedule the get on
* @param frag (IN) Fragment prepared by mca_btl_openib_get
*
* If the fragment can not be scheduled due to resource limitations then
* the fragment will be put on the pending get fragment list and retried
* when another get/put fragment has completed.
*/
int mca_btl_openib_get_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
struct mca_btl_openib_get_frag_t *frag);
/**
* @brief Schedule an RDMA read with the HCA
*
* @param btl (IN) BTL instance
* @param ep (IN) BTL endpoint
* @param local_address (IN) Destination address
* @param remote_address (IN) Source address
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
* @param size (IN) Number of bytes to read
* @param flags (IN) Transfer flags
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion
* @param cbcontext (IN) Context for completion callback
* @param cbdata (IN) Data for completion callback
*
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
* @return OPAL_SUCCCESS if the operation was successfully scheduled
*
* This function will attempt to schedule a get operation with the HCA.
*/
int mca_btl_openib_get (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
/**
* Initiate an asynchronous fetching atomic operation.
* Completion Semantics: if this function returns a 1 then the operation
* is complete. a return of OPAL_SUCCESS indicates
* the atomic operation has been queued with the
* network.
*
* @param btl (IN) BTL module
* @param endpoint (IN) BTL addressing information
* @param local_address (OUT) Local address to store the result in
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
* @param local_handle (IN) Local registration handle for region containing
* (local_address, local_address + 8)
* @param remote_handle (IN) Remote registration handle for region containing
* (remote_address, remote_address + 8)
* @param op (IN) Operation to perform
* @param operand (IN) Operand for the operation
* @param flags (IN) Flags for this put operation
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion (if queued)
* @param cbcontext (IN) Context for the callback
* @param cbdata (IN) Data for callback
*
* @retval OPAL_SUCCESS The operation was successfully queued
* @retval 1 The operation is complete
* @retval OPAL_ERROR The operation was NOT successfully queued
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
* operation. Try again later
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
* alignment restrictions or the operation {op} is not supported
* by the hardware.
*
* After the operation is complete the remote address specified by {remote_address} and
* {remote_handle} will be updated with (*remote_address) = (*remote_address) op operand.
* {local_address} will be updated with the previous value stored in {remote_address}.
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
* however, that not all btls will provide consistency between btl atomic operations and
* cpu atomics.
*/
int mca_btl_openib_atomic_fop (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, mca_btl_base_atomic_op_t op,
uint64_t operand, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata);
/**
* Initiate an asynchronous compare and swap operation.
* Completion Semantics: if this function returns a 1 then the operation
* is complete. a return of OPAL_SUCCESS indicates
* the atomic operation has been queued with the
* network.
*
* @param btl (IN) BTL module
* @param endpoint (IN) BTL addressing information
* @param local_address (OUT) Local address to store the result in
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
* @param local_handle (IN) Local registration handle for region containing
* (local_address, local_address + 8)
* @param remote_handle (IN) Remote registration handle for region containing
* (remote_address, remote_address + 8)
* @param compare (IN) Operand for the operation
* @param value (IN) Value to store on success
* @param flags (IN) Flags for this put operation
* @param order (IN) Ordering
* @param cbfunc (IN) Function to call on completion (if queued)
* @param cbcontext (IN) Context for the callback
* @param cbdata (IN) Data for callback
*
* @retval OPAL_SUCCESS The operation was successfully queued
* @retval 1 The operation is complete
* @retval OPAL_ERROR The operation was NOT successfully queued
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
* operation. Try again later
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
* alignment restrictions or the operation {op} is not supported
* by the hardware.
*
* After the operation is complete the remote address specified by {remote_address} and
* {remote_handle} will be updated with {value} if *remote_address == compare.
* {local_address} will be updated with the previous value stored in {remote_address}.
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
* however, that not all btls will provide consistency between btl atomic operations and
* cpu atomics.
*/
int mca_btl_openib_atomic_cswap (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, uint64_t compare,
uint64_t value, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata);
/**
* Allocate a descriptor.
*
* @param btl (IN) BTL module
* @param size (IN) Requested descriptor size.
*/
extern mca_btl_base_descriptor_t* mca_btl_openib_alloc(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint,
uint8_t order,
size_t size,
uint32_t flags);
/**
* Return a segment allocated by this BTL.
*
* @param btl (IN) BTL module
* @param descriptor (IN) Allocated descriptor.
*/
extern int mca_btl_openib_free(
struct mca_btl_base_module_t* btl,
mca_btl_base_descriptor_t* des);
/**
* Pack data and return a descriptor that can be
* used for send/put.
*
* @param btl (IN) BTL module
* @param peer (IN) BTL peer addressing
*/
mca_btl_base_descriptor_t* mca_btl_openib_prepare_src(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* peer,
struct opal_convertor_t* convertor,
uint8_t order,
size_t reserve,
size_t* size,
uint32_t flags
);
extern void mca_btl_openib_frag_progress_pending_put_get(
struct mca_btl_base_endpoint_t*, const int);
/**
* Fault Tolerance Event Notification Function
*
* @param state (IN) Checkpoint State
* @return OPAL_SUCCESS or failure status
*/
extern int mca_btl_openib_ft_event(int state);
/**
* Show an error during init, particularly when running out of
* registered memory.
*/
void mca_btl_openib_show_init_error(const char *file, int line,
const char *func, const char *dev);
/**
* Post to Shared Receive Queue with certain priority
*
* @param openib_btl (IN) BTL module
* @param additional (IN) Additional Bytes to reserve
* @param prio (IN) Priority (either BTL_OPENIB_HP_QP or BTL_OPENIB_LP_QP)
* @return OPAL_SUCCESS or failure status
*/
int mca_btl_openib_post_srr(mca_btl_openib_module_t* openib_btl, const int qp);
/**
* Get a transport name of btl by its transport type.
*/
const char* btl_openib_get_transport_name(mca_btl_openib_transport_type_t transport_type);
/**
* Get an endpoint for a process
*
* @param btl (IN) BTL module
* @param proc (IN) opal process object
*
* This function will return an existing endpoint if one exists otherwise it will allocate
* a new endpoint and return it.
*/
struct mca_btl_base_endpoint_t *mca_btl_openib_get_ep (struct mca_btl_base_module_t *btl,
struct opal_proc_t *proc);
/**
* Get a transport type of btl.
*/
mca_btl_openib_transport_type_t mca_btl_openib_get_transport_type(mca_btl_openib_module_t* openib_btl);
static inline int qp_cq_prio(const int qp)
{
if(0 == qp)
return BTL_OPENIB_HP_CQ; /* smallest qp is always HP */
/* If the size for this qp is <= the eager limit, make it a
high priority QP. Otherwise, make it a low priority QP. */
return (mca_btl_openib_component.qp_infos[qp].size <=
mca_btl_openib_component.eager_limit) ?
BTL_OPENIB_HP_CQ : BTL_OPENIB_LP_CQ;
}
#define BTL_OPENIB_RDMA_QP(QP) \
((QP) == mca_btl_openib_component.rdma_qp)
/**
* Run function as part of opal_progress()
*
* @param[in] fn function to run
* @param[in] arg function data
*/
int mca_btl_openib_run_in_main (void *(*fn)(void *), void *arg);
END_C_DECLS
#endif /* MCA_BTL_IB_H */

Просмотреть файл

@ -1,508 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2008-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved
* Copyright (c) 2013-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#include <fcntl.h>
#include <sys/poll.h>
#include <unistd.h>
#include <errno.h>
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "opal/mca/btl/base/base.h"
#include "btl_openib.h"
#include "btl_openib_mca.h"
#include "btl_openib_async.h"
#include "btl_openib_proc.h"
#include "btl_openib_endpoint.h"
static opal_list_t ignore_qp_err_list;
static opal_mutex_t ignore_qp_err_list_lock;
static opal_atomic_int32_t btl_openib_async_device_count = 0;
struct mca_btl_openib_async_poll {
int active_poll_size;
int poll_size;
struct pollfd *async_pollfd;
};
typedef struct mca_btl_openib_async_poll mca_btl_openib_async_poll;
typedef struct {
opal_list_item_t super;
struct ibv_qp *qp;
} mca_btl_openib_qp_list;
OBJ_CLASS_INSTANCE(mca_btl_openib_qp_list, opal_list_item_t, NULL, NULL);
static const char *openib_event_to_str (enum ibv_event_type event);
/* Function converts event to string (name)
* Open Fabris don't have function that do this job :(
*/
static const char *openib_event_to_str (enum ibv_event_type event)
{
switch (event) {
case IBV_EVENT_CQ_ERR:
return "IBV_EVENT_CQ_ERR";
case IBV_EVENT_QP_FATAL:
return "IBV_EVENT_QP_FATAL";
case IBV_EVENT_QP_REQ_ERR:
return "IBV_EVENT_QP_REQ_ERR";
case IBV_EVENT_QP_ACCESS_ERR:
return "IBV_EVENT_QP_ACCESS_ERR";
case IBV_EVENT_PATH_MIG:
return "IBV_EVENT_PATH_MIG";
case IBV_EVENT_PATH_MIG_ERR:
return "IBV_EVENT_PATH_MIG_ERR";
case IBV_EVENT_DEVICE_FATAL:
return "IBV_EVENT_DEVICE_FATAL";
case IBV_EVENT_SRQ_ERR:
return "IBV_EVENT_SRQ_ERR";
case IBV_EVENT_PORT_ERR:
return "IBV_EVENT_PORT_ERR";
case IBV_EVENT_COMM_EST:
return "IBV_EVENT_COMM_EST";
case IBV_EVENT_PORT_ACTIVE:
return "IBV_EVENT_PORT_ACTIVE";
case IBV_EVENT_SQ_DRAINED:
return "IBV_EVENT_SQ_DRAINED";
case IBV_EVENT_LID_CHANGE:
return "IBV_EVENT_LID_CHANGE";
case IBV_EVENT_PKEY_CHANGE:
return "IBV_EVENT_PKEY_CHANGE";
case IBV_EVENT_SM_CHANGE:
return "IBV_EVENT_SM_CHANGE";
case IBV_EVENT_QP_LAST_WQE_REACHED:
return "IBV_EVENT_QP_LAST_WQE_REACHED";
#if HAVE_DECL_IBV_EVENT_CLIENT_REREGISTER
case IBV_EVENT_CLIENT_REREGISTER:
return "IBV_EVENT_CLIENT_REREGISTER";
#endif
case IBV_EVENT_SRQ_LIMIT_REACHED:
return "IBV_EVENT_SRQ_LIMIT_REACHED";
default:
return "UNKNOWN";
}
}
/* QP to endpoint */
static mca_btl_openib_endpoint_t * qp2endpoint(struct ibv_qp *qp, mca_btl_openib_device_t *device)
{
mca_btl_openib_endpoint_t *ep;
int ep_i, qp_i;
for(ep_i = 0; ep_i < opal_pointer_array_get_size(device->endpoints); ep_i++) {
ep = opal_pointer_array_get_item(device->endpoints, ep_i);
for(qp_i = 0; qp_i < mca_btl_openib_component.num_qps; qp_i++) {
if (qp == ep->qps[qp_i].qp->lcl_qp)
return ep;
}
}
return NULL;
}
#if OPAL_HAVE_CONNECTX_XRC
/* XRC recive QP to endpoint */
static mca_btl_openib_endpoint_t * xrc_qp2endpoint(uint32_t qp_num, mca_btl_openib_device_t *device)
{
mca_btl_openib_endpoint_t *ep;
int ep_i;
for(ep_i = 0; ep_i < opal_pointer_array_get_size(device->endpoints); ep_i++) {
ep = opal_pointer_array_get_item(device->endpoints, ep_i);
if (qp_num == ep->xrc_recv_qp_num)
return ep;
}
return NULL;
}
#endif
/* Function inits mca_btl_openib_async_poll */
/* The main idea of resizing SRQ algorithm -
We create a SRQ with size = rd_num, but for efficient usage of resources
the number of WQEs that we post = rd_curr_num < rd_num and this value is
increased (by needs) in IBV_EVENT_SRQ_LIMIT_REACHED event handler (i.e. in this function),
the event will thrown by device if number of WQEs in SRQ will be less than srq_limit */
static int btl_openib_async_srq_limit_event(struct ibv_srq* srq)
{
int qp, rc = OPAL_SUCCESS;
mca_btl_openib_module_t *openib_btl = NULL;
opal_mutex_t *lock = &mca_btl_openib_component.srq_manager.lock;
opal_hash_table_t *srq_addr_table = &mca_btl_openib_component.srq_manager.srq_addr_table;
opal_mutex_lock(lock);
if (OPAL_SUCCESS != opal_hash_table_get_value_ptr(srq_addr_table,
&srq, sizeof(struct ibv_srq*), (void*) &openib_btl)) {
/* If there isn't any element with the key in the table =>
we assume that SRQ was destroyed and don't serve the event */
goto srq_limit_event_exit;
}
for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) {
if (!BTL_OPENIB_QP_TYPE_PP(qp)) {
if(openib_btl->qps[qp].u.srq_qp.srq == srq) {
break;
}
}
}
if(qp >= mca_btl_openib_component.num_qps) {
BTL_ERROR(("Open MPI tried to access a shared receive queue (SRQ) on the device %s that was not found. This should not happen, and is a fatal error. Your MPI job will now abort.\n", ibv_get_device_name(openib_btl->device->ib_dev)));
rc = OPAL_ERROR;
goto srq_limit_event_exit;
}
/* dynamically re-size the SRQ to be larger */
openib_btl->qps[qp].u.srq_qp.rd_curr_num <<= 1;
if(openib_btl->qps[qp].u.srq_qp.rd_curr_num >=
mca_btl_openib_component.qp_infos[qp].rd_num) {
openib_btl->qps[qp].u.srq_qp.rd_curr_num = mca_btl_openib_component.qp_infos[qp].rd_num;
openib_btl->qps[qp].u.srq_qp.rd_low_local = mca_btl_openib_component.qp_infos[qp].rd_low;
openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = false;
goto srq_limit_event_exit;
}
openib_btl->qps[qp].u.srq_qp.rd_low_local <<= 1;
openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = true;
srq_limit_event_exit:
opal_mutex_unlock(lock);
return rc;
}
/* Function handle async device events */
static void btl_openib_async_device (int fd, short flags, void *arg)
{
mca_btl_openib_device_t *device = (mca_btl_openib_device_t *) arg;
struct ibv_async_event event;
int event_type;
if (ibv_get_async_event((struct ibv_context *)device->ib_dev_context,&event) < 0) {
if (EWOULDBLOCK != errno) {
BTL_ERROR(("Failed to get async event"));
}
return;
}
event_type = event.event_type;
#if OPAL_HAVE_CONNECTX_XRC
/* is it XRC event ?*/
bool xrc_event = false;
if (IBV_XRC_QP_EVENT_FLAG & event.event_type) {
xrc_event = true;
/* Clean the bitnd handel as usual */
event_type ^= IBV_XRC_QP_EVENT_FLAG;
}
#endif
switch(event_type) {
case IBV_EVENT_PATH_MIG:
BTL_ERROR(("Alternative path migration event reported"));
if (APM_ENABLED) {
BTL_ERROR(("Trying to find additional path..."));
#if OPAL_HAVE_CONNECTX_XRC
if (xrc_event)
mca_btl_openib_load_apm_xrc_rcv(event.element.xrc_qp_num,
xrc_qp2endpoint(event.element.xrc_qp_num, device));
else
#endif
mca_btl_openib_load_apm(event.element.qp,
qp2endpoint(event.element.qp, device));
}
break;
case IBV_EVENT_DEVICE_FATAL:
/* Set the flag to fatal */
device->got_fatal_event = true;
/* It is not critical to protect the counter */
OPAL_THREAD_ADD_FETCH32(&mca_btl_openib_component.error_counter, 1);
/* fall through */
case IBV_EVENT_CQ_ERR:
case IBV_EVENT_QP_FATAL:
if (event_type == IBV_EVENT_QP_FATAL) {
mca_btl_openib_qp_list *qp_item;
bool in_ignore_list = false;
BTL_VERBOSE(("QP is in err state %p", (void *)event.element.qp));
/* look through ignore list */
opal_mutex_lock (&ignore_qp_err_list_lock);
OPAL_LIST_FOREACH(qp_item, &ignore_qp_err_list, mca_btl_openib_qp_list) {
if (qp_item->qp == event.element.qp) {
BTL_VERBOSE(("QP %p is in error ignore list",
(void *)event.element.qp));
in_ignore_list = true;
break;
}
}
opal_mutex_unlock (&ignore_qp_err_list_lock);
if (in_ignore_list) {
break;
}
}
/* fall through */
case IBV_EVENT_QP_REQ_ERR:
case IBV_EVENT_QP_ACCESS_ERR:
case IBV_EVENT_PATH_MIG_ERR:
case IBV_EVENT_SRQ_ERR:
opal_show_help("help-mpi-btl-openib.txt", "of error event",
true,opal_process_info.nodename, (int)getpid(),
event_type,
openib_event_to_str((enum ibv_event_type)event_type));
break;
case IBV_EVENT_PORT_ERR:
opal_show_help("help-mpi-btl-openib.txt", "of error event",
true,opal_process_info.nodename, (int)getpid(),
event_type,
openib_event_to_str((enum ibv_event_type)event_type));
/* Set the flag to indicate port error */
device->got_port_event = true;
OPAL_THREAD_ADD_FETCH32(&mca_btl_openib_component.error_counter, 1);
break;
case IBV_EVENT_COMM_EST:
case IBV_EVENT_PORT_ACTIVE:
case IBV_EVENT_SQ_DRAINED:
case IBV_EVENT_LID_CHANGE:
case IBV_EVENT_PKEY_CHANGE:
case IBV_EVENT_SM_CHANGE:
case IBV_EVENT_QP_LAST_WQE_REACHED:
#if HAVE_DECL_IBV_EVENT_CLIENT_REREGISTER
case IBV_EVENT_CLIENT_REREGISTER:
#endif
break;
/* The event is signaled when number of prepost receive WQEs is going
under predefined threshold - srq_limit */
case IBV_EVENT_SRQ_LIMIT_REACHED:
(void) btl_openib_async_srq_limit_event (event.element.srq);
break;
default:
opal_show_help("help-mpi-btl-openib.txt", "of unknown event",
true,opal_process_info.nodename, (int)getpid(),
event_type);
}
ibv_ack_async_event(&event);
}
static void apm_update_attr(struct ibv_qp_attr *attr, enum ibv_qp_attr_mask *mask)
{
*mask = IBV_QP_ALT_PATH|IBV_QP_PATH_MIG_STATE;
attr->alt_ah_attr.dlid = attr->ah_attr.dlid + 1;
attr->alt_ah_attr.src_path_bits = attr->ah_attr.src_path_bits + 1;
attr->alt_ah_attr.static_rate = attr->ah_attr.static_rate;
attr->alt_ah_attr.sl = attr->ah_attr.sl;
attr->alt_pkey_index = attr->pkey_index;
attr->alt_port_num = attr->port_num;
attr->alt_timeout = attr->timeout;
attr->path_mig_state = IBV_MIG_REARM;
BTL_VERBOSE(("New APM LMC loaded: alt_src_port:%d, dlid: %d, src_bits %d, old_src_bits: %d, old_dlid %d",
attr->alt_port_num, attr->alt_ah_attr.dlid,
attr->alt_ah_attr.src_path_bits, attr->ah_attr.src_path_bits, attr->ah_attr.dlid));
}
static int apm_update_port(mca_btl_openib_endpoint_t *ep,
struct ibv_qp_attr *attr, enum ibv_qp_attr_mask *mask)
{
size_t port_i;
uint16_t apm_lid = 0;
if (attr->port_num == ep->endpoint_btl->apm_port) {
/* all ports were used */
BTL_ERROR(("APM: already all ports were used port_num %d apm_port %d",
attr->port_num, ep->endpoint_btl->apm_port));
return OPAL_ERROR;
}
/* looking for alternatve lid on remote site */
for(port_i = 0; port_i < ep->endpoint_proc->proc_port_count; port_i++) {
if (ep->endpoint_proc->proc_ports[port_i].pm_port_info.lid == attr->ah_attr.dlid - mca_btl_openib_component.apm_lmc) {
apm_lid = ep->endpoint_proc->proc_ports[port_i].pm_port_info.apm_lid;
}
}
if (0 == apm_lid) {
/* APM was disabled on one of site ? */
BTL_VERBOSE(("APM: Was disabled ? dlid %d %d %d", attr->ah_attr.dlid, attr->ah_attr.src_path_bits, ep->endpoint_btl->src_path_bits));
return OPAL_ERROR;
}
/* We guess cthat the LMC is the same on all ports */
attr->alt_ah_attr.static_rate = attr->ah_attr.static_rate;
attr->alt_ah_attr.sl = attr->ah_attr.sl;
attr->alt_pkey_index = attr->pkey_index;
attr->alt_timeout = attr->timeout;
attr->path_mig_state = IBV_MIG_REARM;
*mask = IBV_QP_ALT_PATH|IBV_QP_PATH_MIG_STATE;
attr->alt_port_num = ep->endpoint_btl->apm_port;
attr->alt_ah_attr.src_path_bits = ep->endpoint_btl->src_path_bits;
attr->alt_ah_attr.dlid = apm_lid;
BTL_VERBOSE(("New APM port loaded: alt_src_port:%d, dlid: %d, src_bits: %d:%d, old_dlid %d",
attr->alt_port_num, attr->alt_ah_attr.dlid,
attr->ah_attr.src_path_bits, attr->alt_ah_attr.src_path_bits,
attr->ah_attr.dlid));
return OPAL_SUCCESS;
}
/* Load new dlid to the QP */
void mca_btl_openib_load_apm(struct ibv_qp *qp, mca_btl_openib_endpoint_t *ep)
{
struct ibv_qp_init_attr qp_init_attr;
struct ibv_qp_attr attr;
enum ibv_qp_attr_mask mask = 0;
struct mca_btl_openib_module_t *btl;
BTL_VERBOSE(("APM: Loading alternative path"));
assert (NULL != ep);
btl = ep->endpoint_btl;
if (ibv_query_qp(qp, &attr, mask, &qp_init_attr))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d", qp->qp_num));
if (mca_btl_openib_component.apm_lmc &&
attr.ah_attr.src_path_bits - btl->src_path_bits < mca_btl_openib_component.apm_lmc) {
BTL_VERBOSE(("APM LMC: src: %d btl_src: %d lmc_max: %d",
attr.ah_attr.src_path_bits,
btl->src_path_bits,
mca_btl_openib_component.apm_lmc));
apm_update_attr(&attr, &mask);
} else {
if (mca_btl_openib_component.apm_ports) {
/* Try to migrate to next port */
if (OPAL_SUCCESS != apm_update_port(ep, &attr, &mask))
return;
} else {
BTL_ERROR(("Failed to load alternative path, all %d were used",
attr.ah_attr.src_path_bits - btl->src_path_bits));
}
}
if (ibv_modify_qp(qp, &attr, mask))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d, errno says: %s (%d)",
qp->qp_num, strerror(errno), errno));
}
#if OPAL_HAVE_CONNECTX_XRC
void mca_btl_openib_load_apm_xrc_rcv(uint32_t qp_num, mca_btl_openib_endpoint_t *ep)
{
struct ibv_qp_init_attr qp_init_attr;
struct ibv_qp_attr attr;
enum ibv_qp_attr_mask mask = 0;
struct mca_btl_openib_module_t *btl;
BTL_VERBOSE(("APM XRC: Loading alternative path"));
assert (NULL != ep);
btl = ep->endpoint_btl;
if (ibv_query_xrc_rcv_qp(btl->device->xrc_domain, qp_num, &attr, mask, &qp_init_attr))
BTL_ERROR(("Failed to ibv_query_qp, qp num: %d", qp_num));
if (mca_btl_openib_component.apm_lmc &&
attr.ah_attr.src_path_bits - btl->src_path_bits < mca_btl_openib_component.apm_lmc) {
apm_update_attr(&attr, &mask);
} else {
if (mca_btl_openib_component.apm_ports) {
/* Try to migrate to next port */
if (OPAL_SUCCESS != apm_update_port(ep, &attr, &mask))
return;
} else {
BTL_ERROR(("Failed to load alternative path, all %d were used",
attr.ah_attr.src_path_bits - btl->src_path_bits));
}
}
ibv_modify_xrc_rcv_qp(btl->device->xrc_domain, qp_num, &attr, mask);
/* Maybe the qp already was modified by other process - ignoring error */
}
#endif
int mca_btl_openib_async_init (void)
{
if (!mca_btl_openib_component.use_async_event_thread ||
mca_btl_openib_component.async_evbase) {
return OPAL_SUCCESS;
}
mca_btl_openib_component.async_evbase = opal_progress_thread_init (NULL);
OBJ_CONSTRUCT(&ignore_qp_err_list, opal_list_t);
OBJ_CONSTRUCT(&ignore_qp_err_list_lock, opal_mutex_t);
/* Set the error counter to zero */
mca_btl_openib_component.error_counter = 0;
return OPAL_SUCCESS;
}
void mca_btl_openib_async_fini (void)
{
if (mca_btl_openib_component.async_evbase) {
OPAL_LIST_DESTRUCT(&ignore_qp_err_list);
OBJ_DESTRUCT(&ignore_qp_err_list_lock);
opal_progress_thread_finalize (NULL);
mca_btl_openib_component.async_evbase = NULL;
}
}
void mca_btl_openib_async_add_device (mca_btl_openib_device_t *device)
{
if (mca_btl_openib_component.async_evbase) {
if (1 == OPAL_THREAD_ADD_FETCH32 (&btl_openib_async_device_count, 1)) {
mca_btl_openib_async_init ();
}
opal_event_set (mca_btl_openib_component.async_evbase, &device->async_event,
device->ib_dev_context->async_fd, OPAL_EV_READ | OPAL_EV_PERSIST,
btl_openib_async_device, device);
opal_event_add (&device->async_event, 0);
}
}
void mca_btl_openib_async_rem_device (mca_btl_openib_device_t *device)
{
if (mca_btl_openib_component.async_evbase) {
opal_event_del (&device->async_event);
if (0 == OPAL_THREAD_ADD_FETCH32 (&btl_openib_async_device_count, -1)) {
mca_btl_openib_async_fini ();
}
}
}
void mca_btl_openib_async_add_qp_ignore (struct ibv_qp *qp)
{
if (mca_btl_openib_component.async_evbase) {
mca_btl_openib_qp_list *new_qp = OBJ_NEW(mca_btl_openib_qp_list);
if (OPAL_UNLIKELY(NULL == new_qp)) {
/* can allocate a small object. not much more can be done */
return;
}
BTL_VERBOSE(("Ignoring errors on QP %p", (void *) qp));
new_qp->qp = qp;
opal_mutex_lock (&ignore_qp_err_list_lock);
opal_list_append (&ignore_qp_err_list, (opal_list_item_t *) new_qp);
opal_mutex_unlock (&ignore_qp_err_list_lock);
}
}

Просмотреть файл

@ -1,59 +0,0 @@
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 Los Alamos National Security, LLC. All rights
* received.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_ASYNC_H
#define MCA_BTL_OPENIB_ASYNC_H
#include "btl_openib_endpoint.h"
void mca_btl_openib_load_apm(struct ibv_qp *qp, mca_btl_openib_endpoint_t *ep);
#if OPAL_HAVE_CONNECTX_XRC
void mca_btl_openib_load_apm_xrc_rcv(uint32_t qp_num, mca_btl_openib_endpoint_t *ep);
#endif
#define APM_ENABLED (0 != mca_btl_openib_component.apm_lmc || 0 != mca_btl_openib_component.apm_ports)
/**
* Initialize the async event base
*/
int mca_btl_openib_async_init (void);
/**
* Finalize the async event base
*/
void mca_btl_openib_async_fini (void);
/**
* Register a device with the async event base
*
* @param[in] device device to register
*/
void mca_btl_openib_async_add_device (mca_btl_openib_device_t *device);
/**
* Deregister a device with the async event base
*
* @param[in] device device to deregister
*/
void mca_btl_openib_async_rem_device (mca_btl_openib_device_t *device);
/**
* Ignore error events on a queue pair
*
* @param[in] qp queue pair to ignore
*/
void mca_btl_openib_async_add_qp_ignore (struct ibv_qp *qp);
#endif

Просмотреть файл

@ -1,140 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2014-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
#if HAVE_DECL_IBV_ATOMIC_HCA
static int mca_btl_openib_atomic_internal (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, enum ibv_wr_opcode opcode,
int64_t operand, int64_t operand2, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
mca_btl_openib_get_frag_t* frag = NULL;
int qp = order;
int32_t rkey;
int rc;
frag = to_get_frag(alloc_recv_user_frag());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = 8;
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t) local_address;
to_com_frag(frag)->endpoint = endpoint;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* set up descriptor */
frag->sr_desc.wr.atomic.remote_addr = remote_address;
frag->sr_desc.opcode = opcode;
frag->sr_desc.wr.atomic.compare_add = operand;
frag->sr_desc.wr.atomic.swap = operand2;
rkey = remote_handle->rkey;
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if((endpoint->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
rkey = opal_swap_bytes4 (rkey);
}
#endif
frag->sr_desc.wr.atomic.rkey = rkey;
/* NTH: the SRQ# is set in mca_btl_get_internal */
if (endpoint->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&endpoint->endpoint_lock);
rc = check_endpoint_state(endpoint, &to_base_frag(frag)->base, &endpoint->pending_get_frags);
OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
return OPAL_SUCCESS;
}
if (OPAL_SUCCESS != rc) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_get_internal (btl, endpoint, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
OPAL_THREAD_SCOPED_LOCK(&endpoint->endpoint_lock,
opal_list_append(&endpoint->pending_get_frags, (opal_list_item_t*)frag));
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_atomic_fop (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, mca_btl_base_atomic_op_t op,
uint64_t operand, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
if (OPAL_UNLIKELY(MCA_BTL_ATOMIC_ADD != op || (MCA_BTL_ATOMIC_FLAG_32BIT & flags))) {
return OPAL_ERR_NOT_SUPPORTED;
}
return mca_btl_openib_atomic_internal (btl, endpoint, local_address, remote_address, local_handle,
remote_handle, IBV_WR_ATOMIC_FETCH_AND_ADD, operand, 0,
flags, order, cbfunc, cbcontext, cbdata);
}
int mca_btl_openib_atomic_cswap (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
void *local_address, uint64_t remote_address,
struct mca_btl_base_registration_handle_t *local_handle,
struct mca_btl_base_registration_handle_t *remote_handle, uint64_t compare,
uint64_t value, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
if (OPAL_UNLIKELY(MCA_BTL_ATOMIC_FLAG_32BIT & flags)) {
return OPAL_ERR_NOT_SUPPORTED;
}
return mca_btl_openib_atomic_internal (btl, endpoint, local_address, remote_address, local_handle,
remote_handle, IBV_WR_ATOMIC_CMP_AND_SWP, compare, value,
flags, order, cbfunc, cbcontext, cbdata);
}
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,118 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2015-2018 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_OPENIB_EAGER_RDMA_BUF_H
#define MCA_BTL_OPENIB_EAGER_RDMA_BUF_H
#include "opal_config.h"
#include "btl_openib.h"
BEGIN_C_DECLS
struct mca_btl_openib_eager_rdma_local_t {
opal_ptr_t base; /**< buffer for RDMAing eager messages */
void *alloc_base; /**< allocated base */
mca_btl_openib_recv_frag_t *frags;
mca_btl_openib_reg_t *reg;
uint16_t head; /**< RDMA buffer to poll */
uint16_t tail; /**< Needed for credit managment */
opal_atomic_int32_t credits; /**< number of RDMA credits */
int32_t rd_win;
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
opal_mutex_t lock; /**< guard access to RDMA buffer */
int32_t rd_low;
};
typedef struct mca_btl_openib_eager_rdma_local_t mca_btl_openib_eager_rdma_local_t;
struct mca_btl_openib_eager_rdma_remote_t {
opal_ptr_t base; /**< address of remote buffer */
uint32_t rkey; /**< RKey for accessing remote buffer */
opal_atomic_int32_t head; /**< RDMA buffer to post to */
opal_atomic_int32_t tokens; /**< number of rdma tokens */
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
};
typedef struct mca_btl_openib_eager_rdma_remote_t mca_btl_openib_eager_rdma_remote_t;
#define MCA_BTL_OPENIB_RDMA_FRAG(F) \
(openib_frag_type(F) == MCA_BTL_OPENIB_FRAG_EAGER_RDMA)
#define EAGER_RDMA_BUFFER_REMOTE (0)
#define EAGER_RDMA_BUFFER_LOCAL (0xff)
#ifdef WORDS_BIGENDIAN
#define MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE(F) ((F)->u.size >> 8)
#define MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(F, S) \
((F)->u.size = (S) << 8)
#else
#define MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE(F) ((F)->u.size & 0x00ffffff)
#define MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(F, S) \
((F)->u.size = (S) & 0x00ffffff)
#endif
#define MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(F) \
(((volatile uint8_t*)(F)->ftr->u.buf)[3] != EAGER_RDMA_BUFFER_REMOTE)
#define MCA_BTL_OPENIB_RDMA_FRAG_REMOTE(F) \
(!MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(F))
#define MCA_BTL_OPENIB_RDMA_MAKE_REMOTE(F) do { \
((volatile uint8_t*)(F)->u.buf)[3] = EAGER_RDMA_BUFFER_REMOTE; \
}while (0)
#define MCA_BTL_OPENIB_RDMA_MAKE_LOCAL(F) do { \
((volatile uint8_t*)(F)->u.buf)[3] = EAGER_RDMA_BUFFER_LOCAL; \
}while (0)
#define MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG(E, I) \
(&(E)->eager_rdma_local.frags[(I)])
#define MCA_BTL_OPENIB_RDMA_NEXT_INDEX(I) do { \
(I) = ((I) + 1); \
if((I) == \
mca_btl_openib_component.eager_rdma_num) \
(I) = 0; \
} while (0)
#if OPAL_ENABLE_DEBUG
/**
* @brief read and increment the remote head index and generate a sequence
* number
*/
#define MCA_BTL_OPENIB_RDMA_MOVE_INDEX(HEAD, OLD_HEAD, SEQ) \
do { \
(SEQ) = OPAL_THREAD_ADD_FETCH32(&(HEAD), 1) - 1; \
(OLD_HEAD) = (SEQ) % mca_btl_openib_component.eager_rdma_num; \
} while(0)
#else
/**
* @brief read and increment the remote head index
*/
#define MCA_BTL_OPENIB_RDMA_MOVE_INDEX(HEAD, OLD_HEAD) \
do { \
(OLD_HEAD) = (OPAL_THREAD_ADD_FETCH32((opal_atomic_int32_t *) &(HEAD), 1) - 1) % mca_btl_openib_component.eager_rdma_num; \
} while(0)
#endif
END_C_DECLS
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,720 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2009 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2007-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 NVIDIA Corporation. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_ENDPOINT_H
#define MCA_BTL_IB_ENDPOINT_H
#include <errno.h>
#include <string.h>
#include "opal/class/opal_list.h"
#include "opal/mca/event/event.h"
#include "opal/util/output.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/btl/base/btl_base_error.h"
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_eager_rdma.h"
#include "connect/base.h"
#define QP_TX_BATCH_COUNT 64
#define QP_TX_BATCH_COUNT 64
BEGIN_C_DECLS
struct mca_btl_openib_frag_t;
struct mca_btl_openib_proc_modex_t;
/**
* State of IB endpoint connection.
*/
typedef enum {
/* Defines the state in which this BTL instance
* has started the process of connection */
MCA_BTL_IB_CONNECTING,
/* Waiting for ack from endpoint */
MCA_BTL_IB_CONNECT_ACK,
/*Waiting for final connection ACK from endpoint */
MCA_BTL_IB_WAITING_ACK,
/* Connected ... both sender & receiver have
* buffers associated with this connection */
MCA_BTL_IB_CONNECTED,
/* Connection is closed, there are no resources
* associated with this */
MCA_BTL_IB_CLOSED,
/* Maximum number of retries have been used.
* Report failure on send to upper layer */
MCA_BTL_IB_FAILED
} mca_btl_openib_endpoint_state_t;
typedef struct mca_btl_openib_rem_qp_info_t {
uint32_t rem_qp_num;
/* Remote QP number */
uint32_t rem_psn;
/* Remote processes port sequence number */
} mca_btl_openib_rem_qp_info_t;
typedef struct mca_btl_openib_rem_srq_info_t {
/* Remote SRQ number */
uint32_t rem_srq_num;
} mca_btl_openib_rem_srq_info_t;
typedef struct mca_btl_openib_rem_info_t {
/* Local identifier of the remote process */
uint16_t rem_lid;
/* subnet id of remote process */
uint64_t rem_subnet_id;
/* MTU of remote process */
uint32_t rem_mtu;
/* index of remote endpoint in endpoint array */
uint32_t rem_index;
/* Remote QPs */
mca_btl_openib_rem_qp_info_t *rem_qps;
/* Remote xrc_srq info, used only with XRC connections */
mca_btl_openib_rem_srq_info_t *rem_srqs;
/* Vendor id of remote HCA */
uint32_t rem_vendor_id;
/* Vendor part id of remote HCA */
uint32_t rem_vendor_part_id;
/* Transport type of remote port */
mca_btl_openib_transport_type_t rem_transport_type;
} mca_btl_openib_rem_info_t;
/**
* Agggregates all per peer qp info for an endpoint
*/
typedef struct mca_btl_openib_endpoint_pp_qp_t {
opal_atomic_int32_t sd_credits; /**< this rank's view of the credits
* available for sending:
* this is the credits granted by the
* remote peer which has some relation to the
* number of receive buffers posted remotely
*/
opal_atomic_int32_t rd_posted; /**< number of descriptors posted to the nic*/
opal_atomic_int32_t rd_credits; /**< number of credits to return to peer */
opal_atomic_int32_t cm_received; /**< Credit messages received */
opal_atomic_int32_t cm_return; /**< how may credits to return */
opal_atomic_int32_t cm_sent; /**< Outstanding number of credit messages */
} mca_btl_openib_endpoint_pp_qp_t;
/**
* Aggregates all srq qp info for an endpoint
*/
typedef struct mca_btl_openib_endpoint_srq_qp_t {
int32_t dummy;
} mca_btl_openib_endpoint_srq_qp_t;
typedef struct mca_btl_openib_qp_t {
struct ibv_qp *lcl_qp;
uint32_t lcl_psn;
opal_atomic_int32_t sd_wqe; /**< number of available send wqe entries */
opal_atomic_int32_t sd_wqe_inflight;
int wqe_count;
int users;
opal_mutex_t lock;
} mca_btl_openib_qp_t;
typedef struct mca_btl_openib_endpoint_qp_t {
mca_btl_openib_qp_t *qp;
opal_list_t no_credits_pending_frags[2]; /**< put fragment here if there is no credits
available */
opal_list_t no_wqe_pending_frags[2]; /**< put fragments here if there is no wqe
available */
opal_atomic_int32_t rd_credit_send_lock; /**< Lock credit send fragment */
mca_btl_openib_send_control_frag_t *credit_frag;
size_t ib_inline_max; /**< max size of inline send*/
union {
mca_btl_openib_endpoint_srq_qp_t srq_qp;
mca_btl_openib_endpoint_pp_qp_t pp_qp;
} u;
} mca_btl_openib_endpoint_qp_t;
/**
* An abstraction that represents a connection to a endpoint process.
* An instance of mca_btl_base_endpoint_t is associated w/ each process
* and BTL pair at startup. However, connections to the endpoint
* are established dynamically on an as-needed basis:
*/
struct mca_btl_base_endpoint_t {
opal_list_item_t super;
/** BTL module that created this connection */
struct mca_btl_openib_module_t* endpoint_btl;
/** proc structure corresponding to endpoint */
struct mca_btl_openib_proc_t* endpoint_proc;
/** local CPC to connect to this endpoint */
opal_btl_openib_connect_base_module_t *endpoint_local_cpc;
/** hook for local CPC to hang endpoint-specific data */
void *endpoint_local_cpc_data;
/** If endpoint_local_cpc->cbm_uses_cts is true and this endpoint
is iWARP, then endpoint_initiator must be true on the side
that actually initiates the QP, false on the other side. This
bool is used to know which way to send the first CTS
message. */
bool endpoint_initiator;
/** pointer to remote proc's CPC data (essentially its CPC modex
message) */
opal_btl_openib_connect_base_module_data_t *endpoint_remote_cpc_data;
/** current state of the connection */
mca_btl_openib_endpoint_state_t endpoint_state;
/** number of connection retries attempted */
size_t endpoint_retries;
/** timestamp of when the first connection was attempted */
double endpoint_tstamp;
/** lock for concurrent access to endpoint state */
opal_mutex_t endpoint_lock;
/** list of pending frags due to lazy connection establishment
for this endpotint */
opal_list_t pending_lazy_frags;
mca_btl_openib_endpoint_qp_t *qps;
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_qp *xrc_recv_qp;
#else
uint32_t xrc_recv_qp_num; /* in xrc we will use it as recv qp */
#endif
uint32_t xrc_recv_psn;
/** list of pending rget ops */
opal_list_t pending_get_frags;
/** list of pending rput ops */
opal_list_t pending_put_frags;
/** number of available get tokens */
opal_atomic_int32_t get_tokens;
/** subnet id of this endpoint*/
uint64_t subnet_id;
/** used only for xrc; pointer to struct that keeps remote port
info */
struct ib_address_t *ib_addr;
/** number of eager received */
opal_atomic_int32_t eager_recv_count;
/** info about remote RDMA buffer */
mca_btl_openib_eager_rdma_remote_t eager_rdma_remote;
/** info about local RDMA buffer */
mca_btl_openib_eager_rdma_local_t eager_rdma_local;
/** index of the endpoint in endpoints array */
int32_t index;
/** does the endpoint require network byte ordering? */
bool nbo;
/** use eager rdma for this peer? */
bool use_eager_rdma;
/** information about the remote port */
mca_btl_openib_rem_info_t rem_info;
/** Frag for initial wireup CTS protocol; will be NULL if CPC
indicates that it does not want to use CTS */
mca_btl_openib_recv_frag_t endpoint_cts_frag;
/** Memory registration info for the CTS frag */
struct ibv_mr *endpoint_cts_mr;
/** Whether we've posted receives on this EP or not (only used in
CTS protocol) */
bool endpoint_posted_recvs;
/** Whether we've received the CTS from the peer or not (only used
in CTS protocol) */
bool endpoint_cts_received;
/** Whether we've send out CTS to the peer or not (only used in
CTS protocol) */
bool endpoint_cts_sent;
};
typedef struct mca_btl_base_endpoint_t mca_btl_base_endpoint_t;
typedef mca_btl_base_endpoint_t mca_btl_openib_endpoint_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_endpoint_t);
static inline int32_t qp_get_wqe(mca_btl_openib_endpoint_t *ep, const int qp)
{
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, -1);
}
static inline int32_t qp_put_wqe(mca_btl_openib_endpoint_t *ep, const int qp)
{
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, 1);
}
static inline int32_t qp_inc_inflight_wqe(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
frag->n_wqes_inflight = 0;
return OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe_inflight, 1);
}
static inline void qp_inflight_wqe_to_frag(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
frag->n_wqes_inflight = ep->qps[qp].qp->sd_wqe_inflight;
ep->qps[qp].qp->sd_wqe_inflight = 0;
}
static inline int qp_frag_to_wqe(mca_btl_openib_endpoint_t *ep, const int qp, mca_btl_openib_com_frag_t *frag)
{
int n;
n = frag->n_wqes_inflight;
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].qp->sd_wqe, n);
frag->n_wqes_inflight = 0;
return n;
}
static inline int qp_need_signal(mca_btl_openib_endpoint_t *ep, const int qp, size_t size, int rdma)
{
/* note that size here is payload only */
if (ep->qps[qp].qp->sd_wqe <= 0 ||
size + sizeof(mca_btl_openib_header_t) + (rdma ? sizeof(mca_btl_openib_footer_t) : 0) > ep->qps[qp].ib_inline_max ||
(!BTL_OPENIB_QP_TYPE_PP(qp) && ep->endpoint_btl->qps[qp].u.srq_qp.sd_credits <= 0)) {
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
return 1;
}
if (0 < --ep->qps[qp].qp->wqe_count) {
return 0;
}
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
return 1;
}
static inline void qp_reset_signal_count(mca_btl_openib_endpoint_t *ep, const int qp)
{
ep->qps[qp].qp->wqe_count = QP_TX_BATCH_COUNT;
}
int mca_btl_openib_endpoint_send(mca_btl_base_endpoint_t*,
mca_btl_openib_send_frag_t*);
int mca_btl_openib_endpoint_post_send(mca_btl_openib_endpoint_t*,
mca_btl_openib_send_frag_t*);
void mca_btl_openib_endpoint_send_credits(mca_btl_base_endpoint_t*, const int);
void mca_btl_openib_endpoint_connect_eager_rdma(mca_btl_openib_endpoint_t*);
int mca_btl_openib_endpoint_post_recvs(mca_btl_openib_endpoint_t*);
/* the endpoint lock must be held with OPAL_THREAD_LOCK for both CTS and cpc complete */
void mca_btl_openib_endpoint_send_cts(mca_btl_openib_endpoint_t *endpoint);
void mca_btl_openib_endpoint_cpc_complete(mca_btl_openib_endpoint_t*);
void mca_btl_openib_endpoint_connected(mca_btl_openib_endpoint_t*);
void mca_btl_openib_endpoint_init(mca_btl_openib_module_t*,
mca_btl_base_endpoint_t*,
opal_btl_openib_connect_base_module_t *local_cpc,
struct mca_btl_openib_proc_modex_t *remote_proc_info,
opal_btl_openib_connect_base_module_data_t *remote_cpc_data);
/*
* Invoke an error on the btl associated with an endpoint. If we
* don't have an endpoint, then just use the first one on the
* component list of BTLs.
*/
void *mca_btl_openib_endpoint_invoke_error(void *endpoint);
static inline int post_recvs(mca_btl_base_endpoint_t *ep, const int qp,
const int num_post)
{
int i, rc;
struct ibv_recv_wr *bad_wr, *wr_list = NULL, *wr = NULL;
mca_btl_openib_module_t *openib_btl = ep->endpoint_btl;
if(0 == num_post)
return OPAL_SUCCESS;
for(i = 0; i < num_post; i++) {
opal_free_list_item_t* item;
item = opal_free_list_wait (&openib_btl->device->qps[qp].recv_free);
to_base_frag(item)->base.order = qp;
to_com_frag(item)->endpoint = ep;
if(NULL == wr)
wr = wr_list = &to_recv_frag(item)->rd_desc;
else
wr = wr->next = &to_recv_frag(item)->rd_desc;
OPAL_OUTPUT((-1, "Posting recv (QP num %d): WR ID %p, SG addr %p, len %d, lkey %d",
ep->qps[qp].qp->lcl_qp->qp_num,
(void*) ((uintptr_t*)wr->wr_id),
(void*)((uintptr_t*) wr->sg_list[0].addr),
wr->sg_list[0].length,
wr->sg_list[0].lkey));
}
wr->next = NULL;
rc = ibv_post_recv(ep->qps[qp].qp->lcl_qp, wr_list, &bad_wr);
if (0 == rc)
return OPAL_SUCCESS;
BTL_ERROR(("error %d posting receive on qp %d", rc, qp));
return OPAL_ERROR;
}
static inline int mca_btl_openib_endpoint_post_rr_nolock(
mca_btl_base_endpoint_t *ep, const int qp)
{
int rd_rsv = mca_btl_openib_component.qp_infos[qp].u.pp_qp.rd_rsv;
int rd_num = mca_btl_openib_component.qp_infos[qp].rd_num;
int rd_low = mca_btl_openib_component.qp_infos[qp].rd_low;
int cqp = mca_btl_openib_component.credits_qp, rc;
int cm_received = 0, num_post = 0;
assert(BTL_OPENIB_QP_TYPE_PP(qp));
if(ep->qps[qp].u.pp_qp.rd_posted <= rd_low)
num_post = rd_num - ep->qps[qp].u.pp_qp.rd_posted;
assert(num_post >= 0);
if(ep->qps[qp].u.pp_qp.cm_received >= (rd_rsv >> 2))
cm_received = ep->qps[qp].u.pp_qp.cm_received;
if((rc = post_recvs(ep, qp, num_post)) != OPAL_SUCCESS) {
return rc;
}
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.rd_posted, num_post);
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.rd_credits, num_post);
/* post buffers for credit management on credit management qp */
if((rc = post_recvs(ep, cqp, cm_received)) != OPAL_SUCCESS) {
return rc;
}
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.cm_return, cm_received);
OPAL_THREAD_ADD_FETCH32(&ep->qps[qp].u.pp_qp.cm_received, -cm_received);
assert(ep->qps[qp].u.pp_qp.rd_credits <= rd_num &&
ep->qps[qp].u.pp_qp.rd_credits >= 0);
return OPAL_SUCCESS;
}
static inline int mca_btl_openib_endpoint_post_rr(
mca_btl_base_endpoint_t *ep, const int qp)
{
int ret;
OPAL_THREAD_LOCK(&ep->endpoint_lock);
ret = mca_btl_openib_endpoint_post_rr_nolock(ep, qp);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
return ret;
}
static inline __opal_attribute_always_inline__ bool btl_openib_credits_send_trylock (mca_btl_openib_endpoint_t *ep, int qp)
{
int32_t _tmp_value = 0;
return OPAL_ATOMIC_COMPARE_EXCHANGE_STRONG_32(&ep->qps[qp].rd_credit_send_lock, &_tmp_value, 1);
}
#define BTL_OPENIB_CREDITS_SEND_UNLOCK(E, Q) \
OPAL_ATOMIC_SWAP_32 (&(E)->qps[(Q)].rd_credit_send_lock, 0)
#define BTL_OPENIB_GET_CREDITS(FROM, TO) \
TO = OPAL_ATOMIC_SWAP_32(&FROM, 0)
static inline bool check_eager_rdma_credits(const mca_btl_openib_endpoint_t *ep)
{
return (ep->eager_rdma_local.credits > ep->eager_rdma_local.rd_win) ? true :
false;
}
static inline bool
check_send_credits(const mca_btl_openib_endpoint_t *ep, const int qp)
{
if(!BTL_OPENIB_QP_TYPE_PP(qp))
return false;
return (ep->qps[qp].u.pp_qp.rd_credits >=
mca_btl_openib_component.qp_infos[qp].u.pp_qp.rd_win) ? true : false;
}
static inline void send_credits(mca_btl_openib_endpoint_t *ep, int qp)
{
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
if(check_send_credits(ep, qp))
goto try_send;
} else {
qp = mca_btl_openib_component.credits_qp;
}
if(!check_eager_rdma_credits(ep))
return;
try_send:
if(btl_openib_credits_send_trylock(ep, qp))
mca_btl_openib_endpoint_send_credits(ep, qp);
}
static inline int check_endpoint_state(mca_btl_openib_endpoint_t *ep,
mca_btl_base_descriptor_t *des, opal_list_t *pending_list)
{
int rc = OPAL_ERR_RESOURCE_BUSY;
switch(ep->endpoint_state) {
case MCA_BTL_IB_CLOSED:
rc = ep->endpoint_local_cpc->cbm_start_connect(ep->endpoint_local_cpc, ep);
if (OPAL_SUCCESS == rc) {
rc = OPAL_ERR_RESOURCE_BUSY;
}
/* fall through */
default:
opal_list_append(pending_list, (opal_list_item_t *)des);
break;
case MCA_BTL_IB_FAILED:
rc = OPAL_ERR_UNREACH;
break;
case MCA_BTL_IB_CONNECTED:
rc = OPAL_SUCCESS;
break;
}
return rc;
}
static inline __opal_attribute_always_inline__ int
ib_send_flags(uint32_t size, mca_btl_openib_endpoint_qp_t *qp, int do_signal)
{
if (do_signal) {
return IBV_SEND_SIGNALED |
((size <= qp->ib_inline_max) ? IBV_SEND_INLINE : 0);
} else {
return ((size <= qp->ib_inline_max) ? IBV_SEND_INLINE : 0);
}
}
static inline int
acquire_eager_rdma_send_credit(mca_btl_openib_endpoint_t *endpoint)
{
if(OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, 1);
return OPAL_ERR_OUT_OF_RESOURCE;
}
return OPAL_SUCCESS;
}
static inline int post_send(mca_btl_openib_endpoint_t *ep,
mca_btl_openib_send_frag_t *frag, const bool rdma, int do_signal)
{
mca_btl_openib_module_t *openib_btl = ep->endpoint_btl;
mca_btl_base_segment_t *seg = &to_base_frag(frag)->segment;
struct ibv_sge *sg = &to_com_frag(frag)->sg_entry;
struct ibv_send_wr *sr_desc = &to_out_frag(frag)->sr_desc;
struct ibv_send_wr *bad_wr;
int qp = to_base_frag(frag)->base.order;
sg->length = seg->seg_len + sizeof(mca_btl_openib_header_t) +
(rdma ? sizeof(mca_btl_openib_footer_t) : 0) + frag->coalesced_length;
sr_desc->send_flags = ib_send_flags(sg->length, &(ep->qps[qp]), do_signal);
if(ep->nbo)
BTL_OPENIB_HEADER_HTON(*frag->hdr);
if(rdma) {
int32_t head;
mca_btl_openib_footer_t* ftr =
(mca_btl_openib_footer_t*)(((char*)frag->hdr) + sg->length +
BTL_OPENIB_FTR_PADDING(sg->length) - sizeof(mca_btl_openib_footer_t));
sr_desc->opcode = IBV_WR_RDMA_WRITE;
MCA_BTL_OPENIB_RDMA_FRAG_SET_SIZE(ftr, sg->length);
MCA_BTL_OPENIB_RDMA_MAKE_LOCAL(ftr);
#if OPAL_ENABLE_DEBUG
/* NTH: generate the sequence from the remote head index to ensure that the
* wrong sequence isn't set. The way this code used to look the sequence number
* and head were updated independently and it led to false positives for incorrect
* sequence numbers. */
MCA_BTL_OPENIB_RDMA_MOVE_INDEX(ep->eager_rdma_remote.head, head, ftr->seq);
#else
MCA_BTL_OPENIB_RDMA_MOVE_INDEX(ep->eager_rdma_remote.head, head);
#endif
if(ep->nbo)
BTL_OPENIB_FOOTER_HTON(*ftr);
sr_desc->wr.rdma.rkey = ep->eager_rdma_remote.rkey;
sr_desc->wr.rdma.remote_addr =
ep->eager_rdma_remote.base.lval +
head * openib_btl->eager_rdma_frag_size +
sizeof(mca_btl_openib_header_t) +
mca_btl_openib_component.eager_limit +
sizeof(mca_btl_openib_footer_t);
sr_desc->wr.rdma.remote_addr -= sg->length + BTL_OPENIB_FTR_PADDING(sg->length);
} else {
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
sr_desc->opcode = IBV_WR_SEND;
} else {
sr_desc->opcode = IBV_WR_SEND_WITH_IMM;
#if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
sr_desc->imm_data = htonl(ep->rem_info.rem_index);
#else
sr_desc->imm_data = ep->rem_info.rem_index;
#endif
}
}
#if HAVE_XRC
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if(BTL_OPENIB_QP_TYPE_XRC(qp))
sr_desc->qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
if(BTL_OPENIB_QP_TYPE_XRC(qp))
sr_desc->xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#endif
#endif
assert(sg->addr == (uint64_t)(uintptr_t)frag->hdr);
if (sr_desc->send_flags & IBV_SEND_SIGNALED) {
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
} else {
qp_inc_inflight_wqe(ep, qp, to_com_frag(frag));
}
return ibv_post_send(ep->qps[qp].qp->lcl_qp, sr_desc, &bad_wr);
}
/* called with the endpoint lock held */
static inline int mca_btl_openib_endpoint_credit_acquire (struct mca_btl_base_endpoint_t *endpoint, int qp,
int prio, size_t size, bool *do_rdma,
mca_btl_openib_send_frag_t *frag, bool queue_frag)
{
mca_btl_openib_module_t *openib_btl = endpoint->endpoint_btl;
mca_btl_openib_header_t *hdr = frag->hdr;
size_t eager_limit;
int32_t cm_return;
eager_limit = mca_btl_openib_component.eager_limit +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t);
if (!(prio && size < eager_limit && acquire_eager_rdma_send_credit(endpoint) == OPAL_SUCCESS)) {
*do_rdma = false;
prio = !prio;
if (BTL_OPENIB_QP_TYPE_PP(qp)) {
if (OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, 1);
if (queue_frag) {
opal_list_append(&endpoint->qps[qp].no_credits_pending_frags[prio],
(opal_list_item_t *)frag);
}
return OPAL_ERR_OUT_OF_RESOURCE;
}
} else {
if(OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, -1) < 0) {
OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, 1);
if (queue_frag) {
OPAL_THREAD_LOCK(&openib_btl->ib_lock);
opal_list_append(&openib_btl->qps[qp].u.srq_qp.pending_frags[prio],
(opal_list_item_t *)frag);
OPAL_THREAD_UNLOCK(&openib_btl->ib_lock);
}
return OPAL_ERR_OUT_OF_RESOURCE;
}
}
} else {
/* High priority frag. Try to send over eager RDMA */
*do_rdma = true;
}
/* Set all credits */
BTL_OPENIB_GET_CREDITS(endpoint->eager_rdma_local.credits, hdr->credits);
if (hdr->credits) {
hdr->credits |= BTL_OPENIB_RDMA_CREDITS_FLAG;
}
if (!*do_rdma) {
if (BTL_OPENIB_QP_TYPE_PP(qp) && 0 == hdr->credits) {
BTL_OPENIB_GET_CREDITS(endpoint->qps[qp].u.pp_qp.rd_credits, hdr->credits);
}
} else {
hdr->credits |= (qp << 11);
}
BTL_OPENIB_GET_CREDITS(endpoint->qps[qp].u.pp_qp.cm_return, cm_return);
/* cm_seen is only 8 bytes, but cm_return is 32 bytes */
if(cm_return > 255) {
hdr->cm_seen = 255;
cm_return -= 255;
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.cm_return, cm_return);
} else {
hdr->cm_seen = cm_return;
}
return OPAL_SUCCESS;
}
/* called with the endpoint lock held. */
static inline void mca_btl_openib_endpoint_credit_release (struct mca_btl_base_endpoint_t *endpoint, int qp,
bool do_rdma, mca_btl_openib_send_frag_t *frag)
{
mca_btl_openib_header_t *hdr = frag->hdr;
if (BTL_OPENIB_IS_RDMA_CREDITS(hdr->credits)) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_local.credits, BTL_OPENIB_CREDITS(hdr->credits));
}
if (do_rdma) {
OPAL_THREAD_ADD_FETCH32(&endpoint->eager_rdma_remote.tokens, 1);
} else {
if(BTL_OPENIB_QP_TYPE_PP(qp)) {
OPAL_THREAD_ADD_FETCH32 (&endpoint->qps[qp].u.pp_qp.rd_credits, hdr->credits);
OPAL_THREAD_ADD_FETCH32(&endpoint->qps[qp].u.pp_qp.sd_credits, 1);
} else if BTL_OPENIB_QP_TYPE_SRQ(qp){
mca_btl_openib_module_t *openib_btl = endpoint->endpoint_btl;
OPAL_THREAD_ADD_FETCH32(&openib_btl->qps[qp].u.srq_qp.sd_credits, 1);
}
}
}
END_C_DECLS
#endif

Просмотреть файл

@ -1,222 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2012 Oracle and/or its affiliates. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_eager_rdma.h"
int mca_btl_openib_frag_init(opal_free_list_item_t* item, void* ctx)
{
mca_btl_openib_frag_init_data_t* init_data = (mca_btl_openib_frag_init_data_t *) ctx;
mca_btl_openib_frag_t *frag = to_base_frag(item);
if(MCA_BTL_OPENIB_FRAG_RECV == frag->type) {
to_recv_frag(frag)->qp_idx = init_data->order;
to_com_frag(frag)->sg_entry.length =
mca_btl_openib_component.qp_infos[init_data->order].size +
sizeof(mca_btl_openib_header_t) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t);
}
if(MCA_BTL_OPENIB_FRAG_SEND == frag->type)
to_send_frag(frag)->qp_idx = init_data->order;
frag->list = init_data->list;
return OPAL_SUCCESS;
}
static void base_constructor(mca_btl_openib_frag_t *frag)
{
frag->base.order = MCA_BTL_NO_ORDER;
}
static void com_constructor(mca_btl_openib_com_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
mca_btl_openib_reg_t* reg =
(mca_btl_openib_reg_t*)base_frag->base.super.registration;
frag->registration = reg;
if(reg) {
frag->sg_entry.lkey = reg->mr->lkey;
}
frag->n_wqes_inflight = 0;
}
static void out_constructor(mca_btl_openib_out_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->sr_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->sr_desc.num_sge = 1;
frag->sr_desc.opcode = IBV_WR_SEND;
frag->sr_desc.send_flags = IBV_SEND_SIGNALED;
frag->sr_desc.next = NULL;
}
static void in_constructor(mca_btl_openib_in_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
}
static void send_constructor(mca_btl_openib_send_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_SEND;
frag->chdr = (mca_btl_openib_header_t*)base_frag->base.super.ptr;
frag->hdr = (mca_btl_openib_header_t*)
(((unsigned char*)base_frag->base.super.ptr) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t));
base_frag->segment.seg_addr.pval = frag->hdr + 1;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
frag->coalesced_length = 0;
OBJ_CONSTRUCT(&frag->coalesced_frags, opal_list_t);
}
static void recv_constructor(mca_btl_openib_recv_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_RECV;
frag->hdr = (mca_btl_openib_header_t*)base_frag->base.super.ptr;
base_frag->segment.seg_addr.pval =
((unsigned char* )frag->hdr) + sizeof(mca_btl_openib_header_t);
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
frag->rd_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->rd_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->rd_desc.num_sge = 1;
frag->rd_desc.next = NULL;
}
static void send_control_constructor(mca_btl_openib_send_control_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_CONTROL;
/* adjusting headers because there is no coalesce header in control messages */
frag->hdr = frag->chdr;
to_base_frag(frag)->segment.seg_addr.pval = frag->hdr + 1;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t)frag->hdr;
}
static void put_constructor(mca_btl_openib_put_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_SEND_USER;
to_out_frag(frag)->sr_desc.opcode = IBV_WR_RDMA_WRITE;
frag->cb.func = NULL;
}
static void get_constructor(mca_btl_openib_get_frag_t *frag)
{
to_base_frag(frag)->type = MCA_BTL_OPENIB_FRAG_RECV_USER;
frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;
frag->sr_desc.sg_list = &to_com_frag(frag)->sg_entry;
frag->sr_desc.num_sge = 1;
frag->sr_desc.opcode = IBV_WR_RDMA_READ;
frag->sr_desc.send_flags = IBV_SEND_SIGNALED;
frag->sr_desc.next = NULL;
}
static void coalesced_constructor(mca_btl_openib_coalesced_frag_t *frag)
{
mca_btl_openib_frag_t *base_frag = to_base_frag(frag);
base_frag->type = MCA_BTL_OPENIB_FRAG_COALESCED;
base_frag->base.des_segments = &base_frag->segment;
base_frag->base.des_segment_count = 1;
}
OBJ_CLASS_INSTANCE(
mca_btl_openib_frag_t,
mca_btl_base_descriptor_t,
base_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_com_frag_t,
mca_btl_openib_frag_t,
com_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_out_frag_t,
mca_btl_openib_com_frag_t,
out_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_in_frag_t,
mca_btl_openib_com_frag_t,
in_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_send_frag_t,
mca_btl_openib_out_frag_t,
send_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_recv_frag_t,
mca_btl_openib_in_frag_t,
recv_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_send_control_frag_t,
mca_btl_openib_send_frag_t,
send_control_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_put_frag_t,
mca_btl_openib_out_frag_t,
put_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_get_frag_t,
mca_btl_openib_in_frag_t,
get_constructor,
NULL);
OBJ_CLASS_INSTANCE(
mca_btl_openib_coalesced_frag_t,
mca_btl_openib_frag_t,
coalesced_constructor,
NULL);

Просмотреть файл

@ -1,422 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_FRAG_H
#define MCA_BTL_IB_FRAG_H
#include "opal_config.h"
#include "opal/align.h"
#include "opal/mca/btl/btl.h"
#include <infiniband/verbs.h>
BEGIN_C_DECLS
struct mca_btl_openib_reg_t;
struct mca_btl_openib_header_t {
mca_btl_base_tag_t tag;
uint8_t cm_seen;
uint16_t credits;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[4];
#endif
};
typedef struct mca_btl_openib_header_t mca_btl_openib_header_t;
#define BTL_OPENIB_RDMA_CREDITS_FLAG (1<<15)
#define BTL_OPENIB_IS_RDMA_CREDITS(I) ((I)&BTL_OPENIB_RDMA_CREDITS_FLAG)
#define BTL_OPENIB_CREDITS(I) ((I)&~BTL_OPENIB_RDMA_CREDITS_FLAG)
#define BTL_OPENIB_HEADER_HTON(h) \
do { \
(h).credits = htons((h).credits); \
} while (0)
#define BTL_OPENIB_HEADER_NTOH(h) \
do { \
(h).credits = ntohs((h).credits); \
} while (0)
typedef struct mca_btl_openib_header_coalesced_t {
mca_btl_base_tag_t tag;
uint32_t size;
uint32_t alloc_size;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[4];
#endif
} mca_btl_openib_header_coalesced_t;
#define BTL_OPENIB_HEADER_COALESCED_NTOH(h) \
do { \
(h).size = ntohl((h).size); \
(h).alloc_size = ntohl((h).alloc_size); \
} while(0)
#define BTL_OPENIB_HEADER_COALESCED_HTON(h) \
do { \
(h).size = htonl((h).size); \
(h).alloc_size = htonl((h).alloc_size); \
} while(0)
#if OPAL_OPENIB_PAD_HDR
/* BTL_OPENIB_FTR_PADDING
* This macro is used to keep the pointer to openib footers aligned for
* systems like SPARC64 that take a big performance hit when addresses
* are not aligned (and by default sigbus instead of coercing the type on
* an unaligned address).
*
* We assure alignment of a packet's structures when OPAL_OPENIB_PAD_HDR
* is set to 1. When this is the case then several structures are padded
* to assure alignment and the mca_btl_openib_footer_t structure itself
* will uses the BTL_OPENIB_FTR_PADDING macro to shift the location of the
* pointer to assure proper alignment after the PML Header and data.
* For example sending a 1 byte data packet the memory layout without
* footer alignment would look something like the following:
*
* 0x00 : mca_btl_openib_coalesced_header_t (12 bytes + 4 byte pad)
* 0x10 : mca_btl_openib_control_header_t (1 byte + 7 byte pad)
* 0x18 : mca_btl_openib_header_t (4 bytes + 4 byte pad)
* 0x20 : PML Header and data (16 bytes PML + 1 byte data)
* 0x29 : mca_btl_openib_footer_t (4 bytes + 4 byte pad)
* 0x31 : end of packet
*
* By applying the BTL_OPENIB_FTR_PADDING() in the progress_one_device
* and post_send routines we adjust the pointer to mca_btl_openib_footer_t
* from 0x29 to 0x2C thus correctly aligning the start of the
* footer pointer. This adjustment will cause the padding field of
* mca_btl_openib_footer_t to overlap with the neighboring memory but since
* we never use the padding we do not end up inadvertently overwriting
* memory that does not belong to the fragment.
*/
#define BTL_OPENIB_FTR_PADDING(size) \
OPAL_ALIGN_PAD_AMOUNT(size, sizeof(uint64_t))
/* BTL_OPENIB_ALIGN_COALESCE_HDR
* This macro is used in btl_openib.c, while creating a coalesce fragment,
* to align the coalesce headers.
*/
#define BTL_OPENIB_ALIGN_COALESCE_HDR(ptr) \
OPAL_ALIGN_PTR(ptr, sizeof(uint32_t), unsigned char*)
/* BTL_OPENIB_COALESCE_HDR_PADDING
* This macro is used in btl_openib_component.c, while parsing an incoming
* coalesce fragment, to determine the padding amount used to align the
* mca_btl_openib_coalesce_hdr_t.
*/
#define BTL_OPENIB_COALESCE_HDR_PADDING(ptr) \
OPAL_ALIGN_PAD_AMOUNT(ptr, sizeof(uint32_t))
#else
#define BTL_OPENIB_FTR_PADDING(size) 0
#define BTL_OPENIB_ALIGN_COALESCE_HDR(ptr) ptr
#define BTL_OPENIB_COALESCE_HDR_PADDING(ptr) 0
#endif
struct mca_btl_openib_footer_t {
#if OPAL_ENABLE_DEBUG
uint32_t seq;
#endif
union {
uint32_t size;
uint8_t buf[4];
} u;
#if OPAL_OPENIB_PAD_HDR
#if OPAL_ENABLE_DEBUG
/* this footer needs to be of a 8-byte multiple so by adding the
* seq field you throw this off and you cannot just remove the
* padding because the padding is needed in order to adjust the alignment
* and not overwrite other packets.
*/
uint8_t padding[12];
#else
uint8_t padding[8];
#endif
#endif
};
typedef struct mca_btl_openib_footer_t mca_btl_openib_footer_t;
#ifdef WORDS_BIGENDIAN
#define MCA_BTL_OPENIB_FTR_SIZE_REVERSE(ftr)
#else
#define MCA_BTL_OPENIB_FTR_SIZE_REVERSE(ftr) \
do { \
uint8_t tmp = (ftr).u.buf[0]; \
(ftr).u.buf[0]=(ftr).u.buf[2]; \
(ftr).u.buf[2]=tmp; \
} while (0)
#endif
#if OPAL_ENABLE_DEBUG
#define BTL_OPENIB_FOOTER_SEQ_HTON(h) ((h).seq = htonl((h).seq))
#define BTL_OPENIB_FOOTER_SEQ_NTOH(h) ((h).seq = ntohl((h).seq))
#else
#define BTL_OPENIB_FOOTER_SEQ_HTON(h)
#define BTL_OPENIB_FOOTER_SEQ_NTOH(h)
#endif
#define BTL_OPENIB_FOOTER_HTON(h) \
do { \
BTL_OPENIB_FOOTER_SEQ_HTON(h); \
MCA_BTL_OPENIB_FTR_SIZE_REVERSE(h); \
} while (0)
#define BTL_OPENIB_FOOTER_NTOH(h) \
do { \
BTL_OPENIB_FOOTER_SEQ_NTOH(h); \
MCA_BTL_OPENIB_FTR_SIZE_REVERSE(h); \
} while (0)
#define MCA_BTL_OPENIB_CONTROL_CREDITS 0
#define MCA_BTL_OPENIB_CONTROL_RDMA 1
#define MCA_BTL_OPENIB_CONTROL_COALESCED 2
#define MCA_BTL_OPENIB_CONTROL_CTS 3
struct mca_btl_openib_control_header_t {
uint8_t type;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[7];
#endif
};
typedef struct mca_btl_openib_control_header_t mca_btl_openib_control_header_t;
struct mca_btl_openib_eager_rdma_header_t {
mca_btl_openib_control_header_t control;
uint32_t rkey;
opal_ptr_t rdma_start;
};
typedef struct mca_btl_openib_eager_rdma_header_t mca_btl_openib_eager_rdma_header_t;
#define BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_HTON(h) \
do { \
(h).rkey = htonl((h).rkey); \
(h).rdma_start.lval = hton64((h).rdma_start.lval); \
} while (0)
#define BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_NTOH(h) \
do { \
(h).rkey = ntohl((h).rkey); \
(h).rdma_start.lval = ntoh64((h).rdma_start.lval); \
} while (0)
struct mca_btl_openib_rdma_credits_header_t {
mca_btl_openib_control_header_t control;
#if OPAL_OPENIB_PAD_HDR
uint8_t padding[1];
#endif
uint8_t qpn;
uint16_t rdma_credits;
};
typedef struct mca_btl_openib_rdma_credits_header_t mca_btl_openib_rdma_credits_header_t;
#define BTL_OPENIB_RDMA_CREDITS_HEADER_HTON(h) \
do { \
(h).rdma_credits = htons((h).rdma_credits); \
} while (0)
#define BTL_OPENIB_RDMA_CREDITS_HEADER_NTOH(h) \
do { \
(h).rdma_credits = ntohs((h).rdma_credits); \
} while (0)
enum mca_btl_openib_frag_type_t {
MCA_BTL_OPENIB_FRAG_RECV,
MCA_BTL_OPENIB_FRAG_RECV_USER,
MCA_BTL_OPENIB_FRAG_SEND,
MCA_BTL_OPENIB_FRAG_SEND_USER,
MCA_BTL_OPENIB_FRAG_EAGER_RDMA,
MCA_BTL_OPENIB_FRAG_CONTROL,
MCA_BTL_OPENIB_FRAG_COALESCED
};
typedef enum mca_btl_openib_frag_type_t mca_btl_openib_frag_type_t;
#define openib_frag_type(f) (to_base_frag(f)->type)
/**
* IB fragment derived type.
*/
/* base openib frag */
typedef struct mca_btl_openib_frag_t {
mca_btl_base_descriptor_t base;
mca_btl_base_segment_t segment;
mca_btl_openib_frag_type_t type;
opal_free_list_t* list;
} mca_btl_openib_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_frag_t);
#define to_base_frag(f) ((mca_btl_openib_frag_t*)(f))
/* frag used for communication */
typedef struct mca_btl_openib_com_frag_t {
mca_btl_openib_frag_t super;
struct ibv_sge sg_entry;
struct mca_btl_openib_reg_t *registration;
struct mca_btl_base_endpoint_t *endpoint;
/* number of unsignaled frags sent before this frag. */
uint32_t n_wqes_inflight;
} mca_btl_openib_com_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_com_frag_t);
#define to_com_frag(f) ((mca_btl_openib_com_frag_t*)(f))
typedef struct mca_btl_openib_out_frag_t {
mca_btl_openib_com_frag_t super;
struct ibv_send_wr sr_desc;
} mca_btl_openib_out_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_out_frag_t);
#define to_out_frag(f) ((mca_btl_openib_out_frag_t*)(f))
typedef struct mca_btl_openib_com_frag_t mca_btl_openib_in_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_in_frag_t);
#define to_in_frag(f) ((mca_btl_openib_in_frag_t*)(f))
typedef struct mca_btl_openib_send_frag_t {
mca_btl_openib_out_frag_t super;
mca_btl_openib_header_t *hdr, *chdr;
mca_btl_openib_footer_t *ftr;
uint8_t qp_idx;
uint32_t coalesced_length;
opal_list_t coalesced_frags;
} mca_btl_openib_send_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_send_frag_t);
#define to_send_frag(f) ((mca_btl_openib_send_frag_t*)(f))
typedef struct mca_btl_openib_recv_frag_t {
mca_btl_openib_in_frag_t super;
mca_btl_openib_header_t *hdr;
mca_btl_openib_footer_t *ftr;
struct ibv_recv_wr rd_desc;
uint8_t qp_idx;
} mca_btl_openib_recv_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_recv_frag_t);
#define to_recv_frag(f) ((mca_btl_openib_recv_frag_t*)(f))
typedef struct mca_btl_openib_put_frag_t {
mca_btl_openib_out_frag_t super;
struct {
mca_btl_base_rdma_completion_fn_t func;
mca_btl_base_registration_handle_t *local_handle;
void *context;
void *data;
} cb;
} mca_btl_openib_put_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_put_frag_t);
#define to_put_frag(f) ((mca_btl_openib_put_frag_t*)(f))
typedef struct mca_btl_openib_get_frag_t {
mca_btl_openib_in_frag_t super;
struct ibv_send_wr sr_desc;
struct {
mca_btl_base_rdma_completion_fn_t func;
mca_btl_base_registration_handle_t *local_handle;
void *context;
void *data;
} cb;
} mca_btl_openib_get_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_get_frag_t);
#define to_get_frag(f) ((mca_btl_openib_get_frag_t*)(f))
typedef struct mca_btl_openib_send_frag_t mca_btl_openib_send_control_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_send_control_frag_t);
#define to_send_control_frag(f) ((mca_btl_openib_send_control_frag_t*)(f))
typedef struct mca_btl_openib_coalesced_frag_t {
mca_btl_openib_frag_t super;
mca_btl_openib_send_frag_t *send_frag;
mca_btl_openib_header_coalesced_t *hdr;
bool sent;
} mca_btl_openib_coalesced_frag_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_coalesced_frag_t);
#define to_coalesced_frag(f) ((mca_btl_openib_coalesced_frag_t*)(f))
/*
* Allocate an IB send descriptor
*
*/
static inline mca_btl_openib_send_control_frag_t *
alloc_control_frag(mca_btl_openib_module_t *btl)
{
return to_send_control_frag(opal_free_list_wait (&btl->device->send_free_control));
}
static inline uint8_t frag_size_to_order(mca_btl_openib_module_t* btl,
size_t size)
{
int qp;
for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++)
if(mca_btl_openib_component.qp_infos[qp].size >= size)
return qp;
return MCA_BTL_NO_ORDER;
}
static inline mca_btl_openib_com_frag_t *alloc_send_user_frag(void)
{
return to_com_frag(opal_free_list_get (&mca_btl_openib_component.send_user_free));
}
static inline mca_btl_openib_com_frag_t *alloc_recv_user_frag(void)
{
return to_com_frag(opal_free_list_get (&mca_btl_openib_component.recv_user_free));
}
static inline mca_btl_openib_coalesced_frag_t *alloc_coalesced_frag(void)
{
return to_coalesced_frag(opal_free_list_get (&mca_btl_openib_component.send_free_coalesced));
}
#define MCA_BTL_IB_FRAG_RETURN(frag) \
do { \
opal_free_list_return (to_base_frag(frag)->list, \
(opal_free_list_item_t*)(frag)); \
} while(0)
#define MCA_BTL_OPENIB_CLEAN_PENDING_FRAGS(list) \
do { \
opal_list_item_t *_frag_item; \
while (NULL != (_frag_item = opal_list_remove_first(list))) { \
MCA_BTL_IB_FRAG_RETURN(_frag_item); \
} \
} while (0)
struct mca_btl_openib_module_t;
struct mca_btl_openib_frag_init_data_t {
uint8_t order;
opal_free_list_t* list;
};
typedef struct mca_btl_openib_frag_init_data_t mca_btl_openib_frag_init_data_t;
int mca_btl_openib_frag_init(opal_free_list_item_t* item, void* ctx);
END_C_DECLS
#endif

Просмотреть файл

@ -1,167 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
/*
* RDMA READ remote buffer to local buffer address.
*/
int mca_btl_openib_get (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata)
{
mca_btl_openib_get_frag_t* frag = NULL;
int qp = order;
int rc;
if (OPAL_UNLIKELY(size > btl->btl_get_limit)) {
return OPAL_ERR_BAD_PARAM;
}
frag = to_get_frag(alloc_recv_user_frag());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = size;
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
to_com_frag(frag)->sg_entry.addr = (uint64_t)(uintptr_t) local_address;
to_com_frag(frag)->endpoint = ep;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* set up descriptor */
frag->sr_desc.wr.rdma.remote_addr = remote_address;
/* the opcode may have been changed by an atomic operation */
frag->sr_desc.opcode = IBV_WR_RDMA_READ;
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
frag->sr_desc.wr.rdma.rkey = opal_swap_bytes4 (remote_handle->rkey);
} else
#endif
{
frag->sr_desc.wr.rdma.rkey = remote_handle->rkey;
}
if (ep->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&ep->endpoint_lock);
rc = check_endpoint_state(ep, &to_base_frag(frag)->base, &ep->pending_get_frags);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
return OPAL_SUCCESS;
}
if (OPAL_SUCCESS != rc) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_get_internal (btl, ep, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
OPAL_THREAD_LOCK(&ep->endpoint_lock);
opal_list_append(&ep->pending_get_frags, (opal_list_item_t*)frag);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_get_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
mca_btl_openib_get_frag_t *frag)
{
int qp = to_base_frag(frag)->base.order;
struct ibv_send_wr *bad_wr;
#if HAVE_XRC
if (MCA_BTL_XRC_ENABLED && BTL_OPENIB_QP_TYPE_XRC(qp)) {
/* NTH: the remote SRQ number is only available once the endpoint is connected. By
* setting the value here instead of mca_btl_openib_get we guarantee the rem_srqs
* array is initialized. */
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
frag->sr_desc.qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
frag->sr_desc.xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#endif
}
#endif
/* check for a send wqe */
if (qp_get_wqe(ep, qp) < 0) {
qp_put_wqe(ep, qp);
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* check for a get token */
if (OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,-1) < 0) {
qp_put_wqe(ep, qp);
OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,1);
return OPAL_ERR_OUT_OF_RESOURCE;
}
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
if (ibv_post_send(ep->qps[qp].qp->lcl_qp, &frag->sr_desc, &bad_wr)) {
qp_put_wqe(ep, qp);
OPAL_THREAD_ADD_FETCH32(&ep->get_tokens,1);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,664 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2012-2017 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "opal/util/show_help.h"
#include "opal/util/string_copy.h"
#include "btl_openib.h"
#include "btl_openib_lex.h"
#include "btl_openib_ini.h"
static const char *ini_filename = NULL;
static bool initialized = false;
static opal_list_t devices;
static char *key_buffer = NULL;
static size_t key_buffer_len = 0;
/*
* Struct to hold the section name, vendor ID, and list of vendor part
* ID's and a corresponding set of values (parsed from an INI file).
*/
typedef struct parsed_section_values_t {
char *name;
uint32_t *vendor_ids;
int vendor_ids_len;
uint32_t *vendor_part_ids;
int vendor_part_ids_len;
opal_btl_openib_ini_values_t values;
} parsed_section_values_t;
/*
* Struct to hold the final values. Different from above in a few ways:
*
* - The vendor and part IDs will always be set properly
* - There will only be one part ID (i.e., the above struct is
* exploded into multiple of these for each of searching)
* - There is a super of opal_list_item_t so that we can have a list
* of these
*/
typedef struct device_values_t {
opal_list_item_t super;
char *section_name;
uint32_t vendor_id;
uint32_t vendor_part_id;
opal_btl_openib_ini_values_t values;
} device_values_t;
static void device_values_constructor(device_values_t *s);
static void device_values_destructor(device_values_t *s);
OBJ_CLASS_INSTANCE(device_values_t,
opal_list_item_t,
device_values_constructor,
device_values_destructor);
/*
* Local functions
*/
static int parse_file(char *filename);
static int parse_line(parsed_section_values_t *item);
static void reset_section(bool had_previous_value, parsed_section_values_t *s);
static void reset_values(opal_btl_openib_ini_values_t *v);
static int save_section(parsed_section_values_t *s);
/*
* Read the INI files for device-specific values and save them in
* internal data structures for later lookup.
*/
int opal_btl_openib_ini_init(void)
{
int ret = OPAL_ERR_NOT_FOUND;
char *colon;
char separator = ':';
OBJ_CONSTRUCT(&devices, opal_list_t);
colon = strchr(mca_btl_openib_component.device_params_file_names, separator);
if (NULL == colon) {
/* If we've only got 1 file (i.e., no colons found), parse it
and be done */
ret = parse_file(mca_btl_openib_component.device_params_file_names);
} else {
/* Otherwise, loop over all the files and parse them */
char *orig = strdup(mca_btl_openib_component.device_params_file_names);
char *str = orig;
while (NULL != (colon = strchr(str, ':'))) {
*colon = '\0';
ret = parse_file(str);
/* Note that NOT_FOUND and SUCCESS are not fatal errors
and we keep going. Other errors are treated as
fatal */
if (OPAL_ERR_NOT_FOUND != ret && OPAL_SUCCESS != ret) {
break;
}
str = colon + 1;
}
/* Parse the last file if we didn't have a fatal error above */
if (OPAL_ERR_NOT_FOUND != ret && OPAL_SUCCESS != ret) {
ret = parse_file(str);
}
/* All done */
free(orig);
}
/* Return SUCCESS unless we got a fatal error */
initialized = true;
return (OPAL_SUCCESS == ret || OPAL_ERR_NOT_FOUND == ret) ?
OPAL_SUCCESS : ret;
}
/*
* The component found a device and is querying to see if an INI file
* specified any parameters for it.
*/
int opal_btl_openib_ini_query(uint32_t vendor_id, uint32_t vendor_part_id,
opal_btl_openib_ini_values_t *values)
{
int ret;
device_values_t *h;
if (!initialized) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_init())) {
return ret;
}
}
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Querying INI files for vendor 0x%04x, part ID %d",
vendor_id, vendor_part_id));
}
reset_values(values);
/* Iterate over all the saved devices */
OPAL_LIST_FOREACH(h, &devices, device_values_t) {
if (vendor_id == h->vendor_id &&
vendor_part_id == h->vendor_part_id) {
/* Found it! */
/* NOTE: There is a bug in the PGI 6.2 series that causes
the compiler to choke when copying structs containing
bool members by value. So do a memcpy here instead. */
memcpy(values, &h->values, sizeof(h->values));
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Found corresponding INI values: %s",
h->section_name));
}
return OPAL_SUCCESS;
}
}
/* If we fall through to here, we didn't find it */
if (mca_btl_openib_component.verbose) {
BTL_OUTPUT(("Did not find corresponding INI values"));
}
return OPAL_ERR_NOT_FOUND;
}
/*
* The component is shutting down; release all internal state
*/
int opal_btl_openib_ini_finalize(void)
{
if (initialized) {
OPAL_LIST_DESTRUCT(&devices);
initialized = true;
}
return OPAL_SUCCESS;
}
/**************************************************************************/
/*
* Parse a single file
*/
static int parse_file(char *filename)
{
int val;
int ret = OPAL_SUCCESS;
bool showed_no_section_warning = false;
bool showed_unexpected_tokens_warning = false;
parsed_section_values_t section;
reset_section(false, &section);
/* Open the file */
ini_filename = filename;
btl_openib_ini_yyin = fopen(filename, "r");
if (NULL == btl_openib_ini_yyin) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:file not found",
true, filename);
ret = OPAL_ERR_NOT_FOUND;
goto cleanup;
}
/* Do the parsing */
btl_openib_ini_parse_done = false;
btl_openib_ini_yynewlines = 1;
btl_openib_ini_init_buffer(btl_openib_ini_yyin);
while (!btl_openib_ini_parse_done) {
val = btl_openib_ini_yylex();
switch (val) {
case BTL_OPENIB_INI_PARSE_DONE:
/* This will also set btl_openib_ini_parse_done to true, so just
break here */
break;
case BTL_OPENIB_INI_PARSE_NEWLINE:
/* blank line! ignore it */
break;
case BTL_OPENIB_INI_PARSE_SECTION:
/* We're starting a new section; if we have previously
parsed a section, go see if we can use its values. */
save_section(&section);
reset_section(true, &section);
section.name = strdup(btl_openib_ini_yytext);
break;
case BTL_OPENIB_INI_PARSE_SINGLE_WORD:
if (NULL == section.name) {
/* Warn that there is no current section, and ignore
this parameter */
if (!showed_no_section_warning) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:not in a section", true);
showed_no_section_warning = true;
}
/* Parse it and then dump it */
parse_line(&section);
reset_section(true, &section);
} else {
parse_line(&section);
}
break;
default:
/* anything else is an error */
if (!showed_unexpected_tokens_warning) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:unexpected token", true);
showed_unexpected_tokens_warning = true;
}
break;
}
}
save_section(&section);
fclose(btl_openib_ini_yyin);
btl_openib_ini_yylex_destroy ();
cleanup:
reset_section(true, &section);
if (NULL != key_buffer) {
free(key_buffer);
key_buffer = NULL;
key_buffer_len = 0;
}
return ret;
}
/*
* Parse a single line in the INI file
*/
static int parse_line(parsed_section_values_t *sv)
{
int val, ret = OPAL_SUCCESS;
char *value = NULL;
bool showed_unknown_field_warning = false;
/* Save the name name */
if (key_buffer_len < strlen(btl_openib_ini_yytext) + 1) {
char *tmp;
key_buffer_len = strlen(btl_openib_ini_yytext) + 1;
tmp = (char *) realloc(key_buffer, key_buffer_len);
if (NULL == tmp) {
free(key_buffer);
key_buffer_len = 0;
key_buffer = NULL;
return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
}
key_buffer = tmp;
}
opal_string_copy(key_buffer, btl_openib_ini_yytext, key_buffer_len);
/* The first thing we have to see is an "=" */
val = btl_openib_ini_yylex();
if (btl_openib_ini_parse_done || BTL_OPENIB_INI_PARSE_EQUAL != val) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:expected equals", true);
return OPAL_ERROR;
}
/* Next we get the value */
val = btl_openib_ini_yylex();
if (BTL_OPENIB_INI_PARSE_SINGLE_WORD != val && BTL_OPENIB_INI_PARSE_VALUE != val) {
return OPAL_ERROR;
}
value = strdup(btl_openib_ini_yytext);
/* Now we need to see the newline */
val = btl_openib_ini_yylex();
/* If we did not get EOL or EOF, something is wrong */
if (BTL_OPENIB_INI_PARSE_NEWLINE != val && BTL_OPENIB_INI_PARSE_DONE != val) {
opal_show_help("help-mpi-btl-openib.txt", "ini file:expected newline", true);
free(value);
return OPAL_ERROR;
}
/* Ok, we got a good parse. Now figure out what it is and save
the value. Note that the flex already took care of trimming
all whitespace at the beginning and ending of the value. */
if (0 == strcasecmp(key_buffer, "vendor_id")) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_intify_list(value, &sv->vendor_ids,
&sv->vendor_ids_len))) {
return ret;
}
}
else if (0 == strcasecmp(key_buffer, "vendor_part_id")) {
if (OPAL_SUCCESS != (ret = opal_btl_openib_ini_intify_list(value, &sv->vendor_part_ids,
&sv->vendor_part_ids_len))) {
return ret;
}
}
else if (0 == strcasecmp(key_buffer, "mtu")) {
/* Single value */
sv->values.mtu = (uint32_t) opal_btl_openib_ini_intify(value);
sv->values.mtu_set = true;
}
else if (0 == strcasecmp(key_buffer, "use_eager_rdma")) {
/* Single value */
sv->values.use_eager_rdma = (uint32_t) opal_btl_openib_ini_intify(value);
sv->values.use_eager_rdma_set = true;
}
else if (0 == strcasecmp(key_buffer, "receive_queues")) {
/* Single value (already strdup'ed) */
sv->values.receive_queues = value;
value = NULL;
}
else if (0 == strcasecmp(key_buffer, "max_inline_data")) {
/* Single value */
sv->values.max_inline_data = (int32_t) opal_btl_openib_ini_intify(value);
sv->values.max_inline_data_set = true;
}
else if (0 == strcasecmp(key_buffer, "rdmacm_reject_causes_connect_error")) {
/* Single value */
sv->values.rdmacm_reject_causes_connect_error =
(bool) opal_btl_openib_ini_intify(value);
sv->values.rdmacm_reject_causes_connect_error_set = true;
}
else if (0 == strcasecmp(key_buffer, "ignore_device")) {
/* Single value */
sv->values.ignore_device = (bool) opal_btl_openib_ini_intify(value);
sv->values.ignore_device_set = true;
}
else {
/* Have no idea what this parameter is. Not an error -- just
ignore it */
if (!showed_unknown_field_warning) {
opal_show_help("help-mpi-btl-openib.txt",
"ini file:unknown field", true,
ini_filename, btl_openib_ini_yynewlines,
key_buffer);
showed_unknown_field_warning = true;
}
}
/* All done */
if (NULL != value) {
free(value);
}
return ret;
}
/*
* Construct an device_values_t and set all of its values to known states
*/
static void device_values_constructor(device_values_t *s)
{
s->section_name = NULL;
s->vendor_id = 0;
s->vendor_part_id = 0;
reset_values(&s->values);
}
/*
* Destruct an device_values_t and free any memory that it has
*/
static void device_values_destructor(device_values_t *s)
{
if (NULL != s->section_name) {
free(s->section_name);
}
if (NULL != s->values.receive_queues) {
free(s->values.receive_queues);
}
}
/*
* Reset a parsed section; free any memory that it may have had
*/
static void reset_section(bool had_previous_value, parsed_section_values_t *s)
{
if (had_previous_value) {
if (NULL != s->name) {
free(s->name);
}
if (NULL != s->vendor_ids) {
free(s->vendor_ids);
}
if (NULL != s->vendor_part_ids) {
free(s->vendor_part_ids);
}
}
s->name = NULL;
s->vendor_ids = NULL;
s->vendor_ids_len = 0;
s->vendor_part_ids = NULL;
s->vendor_part_ids_len = 0;
reset_values(&s->values);
}
/*
* Reset the values to known states
*/
static void reset_values(opal_btl_openib_ini_values_t *v)
{
v->mtu = 0;
v->mtu_set = false;
v->use_eager_rdma = 0;
v->use_eager_rdma_set = false;
v->receive_queues = NULL;
v->max_inline_data = 0;
v->max_inline_data_set = false;
v->rdmacm_reject_causes_connect_error = false;
v->rdmacm_reject_causes_connect_error_set = false;
v->ignore_device = false;
v->ignore_device_set = false;
}
/*
* If we have a valid section, see if we have a matching section
* somewhere (i.e., same vendor ID and vendor part ID). If we do,
* update the values. If not, save the values in a new instance and
* add it to the list.
*/
static int save_section(parsed_section_values_t *s)
{
int i, j;
device_values_t *h;
bool found;
/* Is the parsed section valid? */
if (NULL == s->name || 0 == s->vendor_ids_len ||
0 == s->vendor_part_ids_len) {
return OPAL_ERR_BAD_PARAM;
}
/* Iterate over each of the vendor/part IDs in the parsed
values */
for (i = 0; i < s->vendor_ids_len; ++i) {
for (j = 0; j < s->vendor_part_ids_len; ++j) {
found = false;
/* Iterate over all the saved devices */
OPAL_LIST_FOREACH(h, &devices, device_values_t) {
if (s->vendor_ids[i] == h->vendor_id &&
s->vendor_part_ids[j] == h->vendor_part_id) {
/* Found a match. Update any newly-set values. */
if (s->values.mtu_set) {
h->values.mtu = s->values.mtu;
h->values.mtu_set = true;
}
if (s->values.use_eager_rdma_set) {
h->values.use_eager_rdma = s->values.use_eager_rdma;
h->values.use_eager_rdma_set = true;
}
if (NULL != s->values.receive_queues) {
h->values.receive_queues =
strdup(s->values.receive_queues);
}
if (s->values.max_inline_data_set) {
h->values.max_inline_data = s->values.max_inline_data;
h->values.max_inline_data_set = true;
}
if (s->values.rdmacm_reject_causes_connect_error_set) {
h->values.rdmacm_reject_causes_connect_error =
s->values.rdmacm_reject_causes_connect_error;
h->values.rdmacm_reject_causes_connect_error_set =
true;
}
if (s->values.ignore_device_set) {
h->values.ignore_device = s->values.ignore_device;
h->values.ignore_device_set = true;
}
found = true;
break;
}
}
/* Did we find/update it in the exising list? If not,
create a new one. */
if (!found) {
h = OBJ_NEW(device_values_t);
h->section_name = strdup(s->name);
h->vendor_id = s->vendor_ids[i];
h->vendor_part_id = s->vendor_part_ids[j];
/* NOTE: There is a bug in the PGI 6.2 series that
causes the compiler to choke when copying structs
containing bool members by value. So do a memcpy
here instead. */
memcpy(&h->values, &s->values, sizeof(s->values));
/* Need to strdup the string, though */
if (NULL != s->values.receive_queues) {
h->values.receive_queues = strdup(s->values.receive_queues);
}
opal_list_append(&devices, &h->super);
}
}
}
/* All done */
return OPAL_SUCCESS;
}
/*
* Do string-to-integer conversion, for both hex and decimal numbers
*/
int opal_btl_openib_ini_intify(char *str)
{
return strtol (str, NULL, 0);
}
/*
* Take a comma-delimited list and infity them all
*/
int opal_btl_openib_ini_intify_list(char *value, uint32_t **values, int *len)
{
char *comma;
char *str = value;
*len = 0;
/* Comma-delimited list of values */
comma = strchr(str, ',');
if (NULL == comma) {
/* If we only got one value (i.e., no comma found), then
just make an array of one value and save it */
*values = (uint32_t *) malloc(sizeof(uint32_t));
if (NULL == *values) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
*values[0] = (uint32_t) opal_btl_openib_ini_intify(str);
*len = 1;
} else {
int newsize = 1;
/* Count how many values there are and allocate enough space
for them */
while (NULL != comma) {
++newsize;
str = comma + 1;
comma = strchr(str, ',');
}
*values = (uint32_t *) malloc(sizeof(uint32_t) * newsize);
if (NULL == *values) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Iterate over the values and save them */
str = value;
comma = strchr(str, ',');
while (NULL != comma) {
*comma = '\0';
(*values)[*len] = (uint32_t) opal_btl_openib_ini_intify(str);
++(*len);
str = comma + 1;
comma = strchr(str, ',');
}
/* Get the last value (i.e., the value after the last
comma, because it won't have been snarfed in the
loop) */
(*values)[*len] = (uint32_t) opal_btl_openib_ini_intify(str);
++(*len);
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,70 +0,0 @@
/*
* Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2008 Mellanox Technologies. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_PTL_IB_PARAMS_H
#define MCA_PTL_IB_PARAMS_H
#include "btl_openib.h"
/*
* Struct to hold the settable values that may be specified in the INI
* file
*/
typedef struct opal_btl_openib_ini_values_t {
uint32_t mtu;
bool mtu_set;
uint32_t use_eager_rdma;
bool use_eager_rdma_set;
char *receive_queues;
int32_t max_inline_data;
bool max_inline_data_set;
bool rdmacm_reject_causes_connect_error;
bool rdmacm_reject_causes_connect_error_set;
bool ignore_device;
bool ignore_device_set;
} opal_btl_openib_ini_values_t;
BEGIN_C_DECLS
/**
* Read in the INI files containing device params
*/
int opal_btl_openib_ini_init(void);
/**
* Query the read-in params for a given device
*/
int opal_btl_openib_ini_query(uint32_t vendor_id,
uint32_t vendor_part_id,
opal_btl_openib_ini_values_t *values);
/**
* Shut down / release all internal state
*/
int opal_btl_openib_ini_finalize(void);
/**
* string to int convertors with dec/hex autodetection
*/
int opal_btl_openib_ini_intify(char *string);
int opal_btl_openib_ini_intify_list(char *str, uint32_t **values, int *len);
END_C_DECLS
#endif

Просмотреть файл

@ -1,433 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2008 Chelsio, Inc. All rights reserved.
* Copyright (c) 2008-2010 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2017 Los Alamos National Security, LLC. All rights
* reserved.
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#if OPAL_HAVE_RDMACM
#include <rdma/rdma_cma.h>
#include <stdlib.h>
#include <stdio.h>
#include "opal/util/argv.h"
#include "opal/util/if.h"
#include "opal/util/proc.h"
#include "opal/util/show_help.h"
#include "connect/connect.h"
#endif
/* Always want to include this file */
#include "btl_openib_endpoint.h"
#include "btl_openib_ip.h"
#if OPAL_HAVE_RDMACM
/*
* The cruft below maintains the linked list of rdma ipv4 addresses and their
* associated rdma device names and device port numbers.
*/
struct rdma_addr_list {
opal_list_item_t super;
uint32_t addr;
uint32_t subnet;
char addr_str[16];
char dev_name[IBV_SYSFS_NAME_MAX];
uint8_t dev_port;
};
typedef struct rdma_addr_list rdma_addr_list_t;
static OBJ_CLASS_INSTANCE(rdma_addr_list_t, opal_list_item_t,
NULL, NULL);
static opal_list_t *myaddrs = NULL;
#if OPAL_ENABLE_DEBUG
static char *stringify(uint32_t addr)
{
static char line[64];
memset(line, 0, sizeof(line));
snprintf(line, sizeof(line) - 1, "%d.%d.%d.%d (0x%x)",
#if defined(WORDS_BIGENDIAN)
(addr >> 24),
(addr >> 16) & 0xff,
(addr >> 8) & 0xff,
addr & 0xff,
#else
addr & 0xff,
(addr >> 8) & 0xff,
(addr >> 16) & 0xff,
(addr >> 24),
#endif
addr);
return line;
}
#endif
/* Note that each device port can have multiple IP addresses
* associated with it (aka IP aliasing). However, the openib module
* only knows about (device,port) tuples -- not IP addresses (only the
* RDMA CM CPC knows which IP addresses are associated with each
* (device,port) tuple). Thus, any searching of device list for the
* IP Address or subnets may not work as one might expect. The
* current behavior is to return the IP address (or subnet) of the
* *first* instance of the device on the list. This behavior is
* uniform for subnet and IP addresses and thus should not cause any
* mismatches. If this behavior is not preferred by the user, the MCA
* parameters to include/exclude specific IP addresses can be used to
* precisely specify which addresses are used (e.g., to effect
* specific subnet routing).
*/
uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port)
{
struct rdma_addr_list *addr;
/* In the off chance that the user forces a non-RDMACM CPC and an
* IP-based mechanism, the list will be uninitialized. Return 0
* to prevent crashes, and the lack of it actually working will be
* caught at a later stage.
*/
if (NULL == myaddrs) {
return 0;
}
OPAL_LIST_FOREACH(addr, myaddrs, struct rdma_addr_list) {
if (!strcmp(addr->dev_name, ib_dev->name) &&
port == addr->dev_port) {
return addr->subnet;
}
}
return 0;
}
/* This function should not be necessary, as rdma_get_local_addr would
* be more correct in returning the IP address given the cm_id (and
* not necessitate having to do a list look up). Unfortunately, the
* subnet and IP address look up needs to match or there could be a
* mismatch if IP Aliases are being used. For more information on
* this, please read comment above mca_btl_openib_get_ip_subnet_id.
*/
uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port)
{
struct rdma_addr_list *addr;
/* Sanity check */
if (NULL == myaddrs) {
return 0;
}
BTL_VERBOSE(("Looking for %s:%d in IP address list",
ibv_get_device_name(verbs->device), port));
OPAL_LIST_FOREACH(addr, myaddrs, struct rdma_addr_list) {
if (!strcmp(addr->dev_name, verbs->device->name) &&
port == addr->dev_port) {
BTL_VERBOSE(("FOUND: %s:%d is %s",
ibv_get_device_name(verbs->device), port,
stringify(addr->addr)));
return addr->addr;
}
}
return 0;
}
static int dev_specified(char *name, int port)
{
char **list;
if (NULL != mca_btl_openib_component.if_include) {
int i;
list = opal_argv_split(mca_btl_openib_component.if_include, ',');
for (i = 0; NULL != list[i]; i++) {
char **temp = opal_argv_split(list[i], ':');
if (0 == strcmp(name, temp[0]) &&
(NULL == temp[1] || port == atoi(temp[1]))) {
return 0;
}
}
return 1;
}
if (NULL != mca_btl_openib_component.if_exclude) {
int i;
list = opal_argv_split(mca_btl_openib_component.if_exclude, ',');
for (i = 0; NULL != list[i]; i++) {
char **temp = opal_argv_split(list[i], ':');
if (0 == strcmp(name, temp[0]) &&
(NULL == temp[1] || port == atoi(temp[1]))) {
return 1;
}
}
}
return 0;
}
static int ipaddr_specified(struct sockaddr_in *ipaddr, uint32_t netmask)
{
uint32_t all = ~((uint32_t) 0);
if (NULL != mca_btl_openib_component.ipaddr_include) {
char **list;
int i;
list = opal_argv_split(mca_btl_openib_component.ipaddr_include, ',');
for (i = 0; NULL != list[i]; i++) {
uint32_t subnet, list_subnet;
struct in_addr ipae;
char **temp = opal_argv_split(list[i], '/');
if (NULL == temp || NULL == temp[0] || NULL == temp[1] ||
NULL != temp[2]) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "include",
opal_process_info.nodename, list[i],
"Invalid specification (missing \"/\")");
if (NULL != temp) {
opal_argv_free(temp);
}
continue;
}
if (1 != inet_pton(ipaddr->sin_family, temp[0], &ipae)) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "include",
opal_process_info.nodename, list[i],
"Invalid specification (inet_pton() failed)");
opal_argv_free(temp);
continue;
}
list_subnet = ntohl(ipae.s_addr) & ~(all >> atoi(temp[1]));
subnet = ntohl(ipaddr->sin_addr.s_addr) & ~(all >> netmask);
opal_argv_free(temp);
if (subnet == list_subnet) {
return 0;
}
}
return 1;
}
if (NULL != mca_btl_openib_component.ipaddr_exclude) {
char **list;
int i;
list = opal_argv_split(mca_btl_openib_component.ipaddr_exclude, ',');
for (i = 0; NULL != list[i]; i++) {
uint32_t subnet, list_subnet;
struct in_addr ipae;
char **temp = opal_argv_split(list[i], '/');
if (NULL == temp || NULL == temp[0] || NULL == temp[1] ||
NULL != temp[2]) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "exclude",
opal_process_info.nodename, list[i],
"Invalid specification (missing \"/\")");
if (NULL != temp) {
opal_argv_free(temp);
}
continue;
}
if (1 != inet_pton(ipaddr->sin_family, temp[0], &ipae)) {
opal_show_help("help-mpi-btl-openib.txt",
"invalid ipaddr_inexclude", true, "exclude",
opal_process_info.nodename, list[i],
"Invalid specification (inet_pton() failed)");
opal_argv_free(temp);
continue;
}
list_subnet = ntohl(ipae.s_addr) & ~(all >> atoi(temp[1]));
subnet = ntohl(ipaddr->sin_addr.s_addr) & ~(all >> netmask);
opal_argv_free(temp);
if (subnet == list_subnet) {
return 1;
}
}
}
return 0;
}
static int add_rdma_addr(struct sockaddr *ipaddr, uint32_t netmask)
{
struct sockaddr_in *sinp;
struct rdma_cm_id *cm_id;
struct rdma_event_channel *ch;
int rc = OPAL_SUCCESS;
struct rdma_addr_list *myaddr;
uint32_t all = ~((uint32_t) 0);
/* Ensure that this IP address is not in 127.0.0.1/8. If it is,
skip it because we never want loopback addresses to be
considered RDMA devices that remote peers can use to connect
to.
This check is necessary because of a change that almost went
into RDMA CM in OFED 1.5.1. We asked for a delay so that we
could get a release of Open MPI out that includes the
127-ignoring logic; hence, this change will likely be in a
future version of OFED (perhaps OFED 1.6?).
OMPI uses rdma_bind_addr() to determine if a local IP address
is an RDMA device or not. If it succeeds and we get a non-NULL
verbs pointer back in the return, we say that it's a valid RDMA
device. Up through OFED 1.5, rdma_bind_addr(127.0.0.1), would
succeed, but the verbs pointer returned would be NULL. Hence,
we knew it was loopback, and therefore we skipped it.
The proposed RDMA CM change would return a non-NULL/valid verbs
pointer when binding to 127.0.0.1/8. This, of course, screws
up OMPI because we then advertise 127.0.0.1 in the modex as an
address that remote peers can use to contact this process via
RDMA. Hence, we have to specifically exclude 127.0.0.1/8 --
don't even both trying to rdma_bind_addr() to it because we
know we don't want loopback addresses at all. */
sinp = (struct sockaddr_in *)ipaddr;
if ((sinp->sin_addr.s_addr & htonl(0xff000000)) == htonl(0x7f000000)) {
rc = OPAL_SUCCESS;
goto out1;
}
ch = rdma_create_event_channel();
if (NULL == ch) {
BTL_VERBOSE(("failed creating RDMA CM event channel"));
rc = OPAL_ERROR;
goto out1;
}
rc = rdma_create_id(ch, &cm_id, NULL, RDMA_PS_TCP);
if (rc) {
BTL_VERBOSE(("rdma_create_id returned %d", rc));
rc = OPAL_ERROR;
goto out2;
}
/* Bind the newly created cm_id to the IP address. This will,
amongst other things, verify that the device is verbs
capable */
rc = rdma_bind_addr(cm_id, ipaddr);
if (rc || !cm_id->verbs) {
rc = OPAL_SUCCESS;
goto out3;
}
/* Verify that the device has not been excluded */
rc = dev_specified(cm_id->verbs->device->name, cm_id->port_num);
if (rc) {
rc = OPAL_SUCCESS;
goto out3;
}
/* Verify that the device has a valid IP address */
if (0 == ((struct sockaddr_in *)ipaddr)->sin_addr.s_addr ||
ipaddr_specified((struct sockaddr_in *)ipaddr, netmask)) {
rc = OPAL_SUCCESS;
goto out3;
}
myaddr = OBJ_NEW(rdma_addr_list_t);
if (NULL == myaddr) {
BTL_ERROR(("malloc failed!"));
rc = OPAL_ERROR;
goto out3;
}
myaddr->addr = sinp->sin_addr.s_addr;
myaddr->subnet = ntohl(myaddr->addr) & ~(all >> netmask);
inet_ntop(sinp->sin_family, &sinp->sin_addr,
myaddr->addr_str, sizeof(myaddr->addr_str));
memcpy(myaddr->dev_name, cm_id->verbs->device->name, IBV_SYSFS_NAME_MAX);
myaddr->dev_port = cm_id->port_num;
BTL_VERBOSE(("Adding addr %s (0x%x) subnet 0x%x as %s:%d",
myaddr->addr_str, myaddr->addr, myaddr->subnet,
myaddr->dev_name, myaddr->dev_port));
opal_list_append(myaddrs, &(myaddr->super));
out3:
rdma_destroy_id(cm_id);
out2:
rdma_destroy_event_channel(ch);
out1:
return rc;
}
int mca_btl_openib_build_rdma_addr_list(void)
{
int rc = OPAL_SUCCESS, i;
myaddrs = OBJ_NEW(opal_list_t);
if (NULL == myaddrs) {
BTL_ERROR(("malloc failed!"));
return OPAL_ERROR;
}
for (i = opal_ifbegin(); i >= 0; i = opal_ifnext(i)) {
struct sockaddr ipaddr;
uint32_t netmask;
opal_ifindextoaddr(i, &ipaddr, sizeof(struct sockaddr));
opal_ifindextomask(i, &netmask, sizeof(uint32_t));
if (ipaddr.sa_family == AF_INET) {
rc = add_rdma_addr(&ipaddr, netmask);
if (OPAL_SUCCESS != rc) {
break;
}
}
}
return rc;
}
void mca_btl_openib_free_rdma_addr_list(void)
{
if (NULL != myaddrs) {
OPAL_LIST_RELEASE(myaddrs);
myaddrs = NULL;
}
}
#else
/* !OPAL_HAVE_RDMACM case */
uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port)
{
return 0;
}
uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port)
{
return 0;
}
int mca_btl_openib_build_rdma_addr_list(void)
{
return OPAL_SUCCESS;
}
void mca_btl_openib_free_rdma_addr_list(void)
{
}
#endif

Просмотреть файл

@ -1,55 +0,0 @@
/*
* Copyright (c) 2008 Chelsio, Inc. All rights reserved.
* Copyright (c) 2008-2010 Cisco Systems, Inc. All rights reserved.
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_IP_H
#define MCA_BTL_OPENIB_IP_H
#include "opal_config.h"
BEGIN_C_DECLS
/**
* Get an IP equivalent of a subnet ID.
*
* @param ib_dev (IN) IBV device
* @return Value of the IPv4 Address bitwise-and'ed with the Netmask
*/
extern uint64_t mca_btl_openib_get_ip_subnet_id(struct ibv_device *ib_dev,
uint8_t port);
/**
* Get the IPv4 address of the specified HCA/RNIC device and physical port.
*
* @param verbs (IN) cm_id verbs of the IBV device
* @param port (IN) physical port of the IBV device
* @return IPv4 Address
*/
extern uint32_t mca_btl_openib_rdma_get_ipv4addr(struct ibv_context *verbs,
uint8_t port);
/**
* Create a list of all available IBV devices and each device's
* relevant information. This is necessary for
* mca_btl_openib_rdma_get_ipv4addr to work.
*
* @return OPAL_SUCCESS or failure status
*/
extern int mca_btl_openib_build_rdma_addr_list(void);
/**
* Free the list of all available IBV devices created by
* mca_btl_openib_build_rdma_addr_list.
*/
extern void mca_btl_openib_free_rdma_addr_list(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,74 +0,0 @@
/* -*- C -*-
*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_INI_LEX_H_
#define BTL_OPENIB_INI_LEX_H_
#include "opal_config.h"
#ifdef malloc
#undef malloc
#endif
#ifdef realloc
#undef realloc
#endif
#ifdef free
#undef free
#endif
#include <stdio.h>
BEGIN_C_DECLS
int btl_openib_ini_yylex(void);
int btl_openib_ini_init_buffer(FILE *file);
int btl_openib_ini_yylex_destroy(void);
extern FILE *btl_openib_ini_yyin;
extern bool btl_openib_ini_parse_done;
extern char *btl_openib_ini_yytext;
extern int btl_openib_ini_yynewlines;
/*
* Make lex-generated files not issue compiler warnings
*/
#define YY_STACK_USED 0
#define YY_ALWAYS_INTERACTIVE 0
#define YY_NEVER_INTERACTIVE 0
#define YY_MAIN 0
#define YY_NO_UNPUT 1
#define YY_SKIP_YYWRAP 1
enum {
BTL_OPENIB_INI_PARSE_DONE,
BTL_OPENIB_INI_PARSE_ERROR,
BTL_OPENIB_INI_PARSE_NEWLINE,
BTL_OPENIB_INI_PARSE_SECTION,
BTL_OPENIB_INI_PARSE_EQUAL,
BTL_OPENIB_INI_PARSE_SINGLE_WORD,
BTL_OPENIB_INI_PARSE_VALUE,
BTL_OPENIB_INI_PARSE_MAX
};
END_C_DECLS
#endif

Просмотреть файл

@ -1,148 +0,0 @@
%option nounput
%option noinput
%{ /* -*- C -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <stdio.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "btl_openib_lex.h"
BEGIN_C_DECLS
/*
* local functions
*/
static int btl_openib_ini_yywrap(void);
END_C_DECLS
/*
* global variables
*/
int btl_openib_ini_yynewlines = 1;
bool btl_openib_ini_parse_done = false;
char *btl_openib_ini_string = NULL;
%}
WHITE [\f\t\v ]
CHAR [A-Za-z0-9_\-\.]
NAME_CHAR [A-Za-z0-9_\-\.\\\/]
%x comment
%x section_name
%x section_end
%x value
%%
{WHITE}*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
#.*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
"//".*\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
"/*" { BEGIN(comment);
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<comment>[^*\n]* ; /* Eat up non '*'s */
<comment>"*"+[^*/\n]* ; /* Eat '*'s not followed by a '/' */
<comment>\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<comment>"*"+"/" { BEGIN(INITIAL); /* Done with block comment */
return BTL_OPENIB_INI_PARSE_NEWLINE; }
{WHITE}*\[{WHITE}* { BEGIN(section_name); }
<section_name>({NAME_CHAR}|{WHITE})*{NAME_CHAR}/{WHITE}*\] {
BEGIN(section_end);
return BTL_OPENIB_INI_PARSE_SECTION; }
<section_name>\n { ++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_ERROR; }
<section_name>. { return BTL_OPENIB_INI_PARSE_ERROR; }
<section_end>{WHITE}*\]{WHITE}*\n { BEGIN(INITIAL);
++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
{WHITE}*"="{WHITE}* { BEGIN(value);
return BTL_OPENIB_INI_PARSE_EQUAL; }
{WHITE}+ ; /* whitespace */
{CHAR}+ { return BTL_OPENIB_INI_PARSE_SINGLE_WORD; }
<value>{WHITE}*\n { BEGIN(INITIAL);
++btl_openib_ini_yynewlines;
return BTL_OPENIB_INI_PARSE_NEWLINE; }
<value>[^\n]*[^\t \n]/[\t ]* {
return BTL_OPENIB_INI_PARSE_VALUE; }
. { return BTL_OPENIB_INI_PARSE_ERROR; }
%%
/* Old flex (2.5.4a? and older) does not define a destroy function */
#if !defined(YY_FLEX_SUBMINOR_VERSION)
#define YY_FLEX_SUBMINOR_VERSION 0
#endif
#if (YY_FLEX_MAJOR_VERSION < 2) || (YY_FLEX_MAJOR_VERSION == 2 && (YY_FLEX_MINOR_VERSION < 5 || (YY_FLEX_MINOR_VERSION == 5 && YY_FLEX_SUBMINOR_VERSION < 5)))
int btl_openib_ini_yylex_destroy(void)
{
if (NULL != YY_CURRENT_BUFFER) {
yy_delete_buffer(YY_CURRENT_BUFFER);
#if defined(YY_CURRENT_BUFFER_LVALUE)
YY_CURRENT_BUFFER_LVALUE = NULL;
#else
YY_CURRENT_BUFFER = NULL;
#endif /* YY_CURRENT_BUFFER_LVALUE */
}
return YY_NULL;
}
#endif
static int btl_openib_ini_yywrap(void)
{
btl_openib_ini_parse_done = true;
return 1;
}
/*
* Ensure that we have a valid yybuffer to use. Specifically, if this
* scanner is invoked a second time, finish_parsing() (above) will
* have been executed, and the current buffer will have been freed.
* Flex doesn't recognize this fact because as far as it's concerned,
* its internal state was already initialized, so it thinks it should
* have a valid buffer. Hence, here we ensure to give it a valid
* buffer.
*/
int btl_openib_ini_init_buffer(FILE *file)
{
YY_BUFFER_STATE buf = yy_create_buffer(file, YY_BUF_SIZE);
yy_switch_to_buffer(buf);
return 0;
}

Просмотреть файл

@ -1,799 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013-2015 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2016 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <string.h>
#include "opal/util/bit_ops.h"
#include "opal/util/printf.h"
#include "opal/mca/common/verbs/common_verbs.h"
#include "opal/mca/installdirs/installdirs.h"
#include "opal/util/os_dirpath.h"
#include "opal/util/output.h"
#include "opal/util/show_help.h"
#include "opal/util/proc.h"
#include "btl_openib.h"
#include "btl_openib_mca.h"
#include "btl_openib_ini.h"
#include "connect/base.h"
#ifdef HAVE_IBV_FORK_INIT
#define OPAL_HAVE_IBV_FORK_INIT 1
#else
#define OPAL_HAVE_IBV_FORK_INIT 0
#endif
/*
* Local flags
*/
enum {
REGINT_NEG_ONE_OK = 0x01,
REGINT_GE_ZERO = 0x02,
REGINT_GE_ONE = 0x04,
REGINT_NONZERO = 0x08,
REGINT_MAX = 0x88
};
enum {
REGSTR_EMPTY_OK = 0x01,
REGSTR_MAX = 0x88
};
static mca_base_var_enum_value_t ib_mtu_values[] = {
{IBV_MTU_256, "256B"},
{IBV_MTU_512, "512B"},
{IBV_MTU_1024, "1k"},
{IBV_MTU_2048, "2k"},
{IBV_MTU_4096, "4k"},
{0, NULL}
};
static mca_base_var_enum_value_t device_type_values[] = {
{BTL_OPENIB_DT_IB, "infiniband"},
{BTL_OPENIB_DT_IB, "ib"},
{BTL_OPENIB_DT_IWARP, "iwarp"},
{BTL_OPENIB_DT_IWARP, "iw"},
{BTL_OPENIB_DT_ALL, "all"},
{0, NULL}
};
static int btl_openib_cq_size;
static bool btl_openib_have_fork_support = OPAL_HAVE_IBV_FORK_INIT;
/*
* utility routine for string parameter registration
*/
static int reg_string(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
const char* default_value, char **storage,
int flags)
{
int index;
assert (NULL != storage);
/* The MCA variable system will not change this pointer */
*storage = (char *) default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if (0 != (flags & REGSTR_EMPTY_OK) && (NULL == *storage || 0 == strlen(*storage))) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_int(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
int default_value, int *storage, int flags)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_INT,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if (0 != (flags & REGINT_NEG_ONE_OK) && -1 == *storage) {
return OPAL_SUCCESS;
}
if ((0 != (flags & REGINT_GE_ZERO) && *storage < 0) ||
(0 != (flags & REGINT_GE_ONE) && *storage < 1) ||
(0 != (flags & REGINT_NONZERO) && 0 == *storage)) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_uint(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
unsigned int default_value, unsigned int *storage,
int flags)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_UNSIGNED_INT,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
if ((0 != (flags & REGINT_GE_ONE) && *storage < 1) ||
(0 != (flags & REGINT_NONZERO) && 0 == *storage)) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OPAL_ERR_BAD_PARAM;
}
return OPAL_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_bool(const char* param_name,
const char* deprecated_param_name,
const char* param_desc,
bool default_value, bool *storage)
{
int index;
*storage = default_value;
index = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
param_name, param_desc, MCA_BASE_VAR_TYPE_BOOL,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY, storage);
if (NULL != deprecated_param_name) {
(void) mca_base_var_register_synonym(index, "ompi", "btl", "openib",
deprecated_param_name,
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
}
return OPAL_SUCCESS;
}
/*
* Register and check all MCA parameters
*/
int btl_openib_register_mca_params(void)
{
mca_base_var_enum_t *new_enum;
char *default_qps;
uint32_t mid_qp_size;
char *msg, *str;
int ret, tmp;
ret = OPAL_SUCCESS;
#define CHECK(expr) do {\
tmp = (expr); \
if (OPAL_SUCCESS != tmp) ret = tmp; \
} while (0)
/* register openib component parameters */
CHECK(reg_bool("verbose", NULL,
"Output some verbose OpenIB BTL information "
"(0 = no output, nonzero = output)", false,
&mca_btl_openib_component.verbose));
CHECK(reg_bool("warn_no_device_params_found",
"warn_no_hca_params_found",
"Warn when no device-specific parameters are found in the INI file specified by the btl_openib_device_param_files MCA parameter "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_no_device_params_found));
CHECK(reg_bool("warn_default_gid_prefix", NULL,
"Warn when there is more than one active ports and at least one of them connected to the network with only default GID prefix configured "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_default_gid_prefix));
CHECK(reg_bool("warn_nonexistent_if", NULL,
"Warn if non-existent devices and/or ports are specified in the btl_openib_if_[in|ex]clude MCA parameters "
"(0 = do not warn; any other value = warn)",
true, &mca_btl_openib_component.warn_nonexistent_if));
/* If we print a warning about not having enough registered memory
available, do we want to abort? */
CHECK(reg_bool("abort_not_enough_reg_mem", NULL,
"If there is not enough registered memory available on the system for Open MPI to function properly, Open MPI will issue a warning. If this MCA parameter is set to true, then Open MPI will also abort all MPI jobs "
"(0 = warn, but do not abort; any other value = warn and abort)",
false, &mca_btl_openib_component.abort_not_enough_reg_mem));
CHECK(reg_uint("poll_cq_batch", NULL,
"Retrieve up to poll_cq_batch completions from CQ",
MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT, &mca_btl_openib_component.cq_poll_batch,
REGINT_GE_ONE));
opal_asprintf(&str, "%s/mca-btl-openib-device-params.ini",
opal_install_dirs.opaldatadir);
if (NULL == str) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
CHECK(reg_string("device_param_files", "hca_param_files",
"Colon-delimited list of INI-style files that contain device vendor/part-specific parameters (use semicolon for Windows)",
str, &mca_btl_openib_component.device_params_file_names,
0));
free(str);
(void)mca_base_var_enum_create("btl_openib_device_types", device_type_values, &new_enum);
mca_btl_openib_component.device_type = BTL_OPENIB_DT_ALL;
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"device_type", "Specify to only use IB or iWARP "
"network adapters (infiniband = only use InfiniBand "
"HCAs; iwarp = only use iWARP NICs; all = use any "
"available adapters)", MCA_BASE_VAR_TYPE_INT, new_enum,
0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.device_type);
if (0 > tmp) ret = tmp;
OBJ_RELEASE(new_enum);
/*
* Provide way for using to override policy of ignoring IB HCAs
*/
mca_btl_openib_component.allow_ib = false;
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"allow_ib",
"Override policy since Open MPI 4.0 of ignoring IB HCAs for openib BTL",
MCA_BASE_VAR_TYPE_BOOL, NULL,
0, 0, OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.allow_ib);
CHECK(reg_int("max_btls", NULL,
"Maximum number of device ports to use "
"(-1 = use all available, otherwise must be >= 1)",
-1, &mca_btl_openib_component.ib_max_btls,
REGINT_NEG_ONE_OK | REGINT_GE_ONE));
CHECK(reg_int("free_list_num", NULL,
"Initial size of free lists "
"(must be >= 1)",
8, &mca_btl_openib_component.ib_free_list_num,
REGINT_GE_ONE));
CHECK(reg_int("free_list_max", NULL,
"Maximum size of free lists "
"(-1 = infinite, otherwise must be >= 0)",
-1, &mca_btl_openib_component.ib_free_list_max,
REGINT_NEG_ONE_OK | REGINT_GE_ONE));
CHECK(reg_int("free_list_inc", NULL,
"Increment size of free lists "
"(must be >= 1)",
32, &mca_btl_openib_component.ib_free_list_inc,
REGINT_GE_ONE));
CHECK(reg_string("mpool_hints", NULL, "hints for selecting a memory pool (default: none)",
NULL, &mca_btl_openib_component.ib_mpool_hints,
0));
CHECK(reg_string("rcache", NULL,
"Name of the registration cache to be used (it is unlikely that you will ever want to change this)",
"grdma", &mca_btl_openib_component.ib_rcache_name,
0));
CHECK(reg_int("reg_mru_len", NULL,
"Length of the registration cache most recently used list "
"(must be >= 1)",
16, (int*) &mca_btl_openib_component.reg_mru_len,
REGINT_GE_ONE));
CHECK(reg_int("cq_size", "ib_cq_size",
"Minimum size of the OpenFabrics completion queue "
"(CQs are automatically sized based on the number "
"of peer MPI processes; this value determines the "
"*minimum* size of all CQs)",
8192, &btl_openib_cq_size, REGINT_GE_ONE));
mca_btl_openib_component.ib_cq_size[BTL_OPENIB_LP_CQ] =
mca_btl_openib_component.ib_cq_size[BTL_OPENIB_HP_CQ] = (uint32_t) btl_openib_cq_size;
CHECK(reg_int("max_inline_data", "ib_max_inline_data",
"Maximum size of inline data segment "
"(-1 = run-time probe to discover max value, otherwise must be >= 0). "
"If not explicitly set, use max_inline_data from "
"the INI file containing device-specific parameters",
-1, &mca_btl_openib_component.ib_max_inline_data,
REGINT_NEG_ONE_OK | REGINT_GE_ZERO));
CHECK(reg_uint("pkey", "ib_pkey_val",
"OpenFabrics partition key (pkey) value. "
"Unsigned integer decimal or hex values are allowed (e.g., \"3\" or \"0x3f\") and will be masked against the maximum allowable IB partition key value (0x7fff)",
0, &mca_btl_openib_component.ib_pkey_val, 0));
CHECK(reg_uint("psn", "ib_psn",
"OpenFabrics packet sequence starting number "
"(must be >= 0)",
0, &mca_btl_openib_component.ib_psn, 0));
CHECK(reg_uint("ib_qp_ous_rd_atom", NULL,
"InfiniBand outstanding atomic reads "
"(must be >= 0)",
4, &mca_btl_openib_component.ib_qp_ous_rd_atom, 0));
opal_asprintf(&msg, "OpenFabrics MTU, in bytes (if not specified in INI files). Valid values are: %d=256 bytes, %d=512 bytes, %d=1024 bytes, %d=2048 bytes, %d=4096 bytes",
IBV_MTU_256,
IBV_MTU_512,
IBV_MTU_1024,
IBV_MTU_2048,
IBV_MTU_4096);
if (NULL == msg) {
/* Don't try to recover from this */
return OPAL_ERR_OUT_OF_RESOURCE;
}
mca_btl_openib_component.ib_mtu = 0;
(void) mca_base_var_enum_create("btl_openib_mtus", ib_mtu_values, &new_enum);
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"mtu", msg, MCA_BASE_VAR_TYPE_INT, new_enum,
0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&mca_btl_openib_component.ib_mtu);
if (0 <= tmp) {
(void) mca_base_var_register_synonym(tmp, "ompi", "btl", "openib", "ib_mtu",
MCA_BASE_VAR_SYN_FLAG_DEPRECATED);
} else {
ret = tmp;
}
OBJ_RELEASE(new_enum);
free(msg);
CHECK(reg_uint("ib_min_rnr_timer", NULL, "InfiniBand minimum "
"\"receiver not ready\" timer, in seconds "
"(must be >= 0 and <= 31)",
25, &mca_btl_openib_component.ib_min_rnr_timer, 0));
CHECK(reg_uint("ib_timeout", NULL,
"InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout) "
"(must be >= 0 and <= 31)",
20, &mca_btl_openib_component.ib_timeout, 0));
CHECK(reg_uint("ib_retry_count", NULL,
"InfiniBand transmit retry count "
"(must be >= 0 and <= 7)",
7, &mca_btl_openib_component.ib_retry_count, 0));
CHECK(reg_uint("ib_rnr_retry", NULL,
"InfiniBand \"receiver not ready\" "
"retry count; applies *only* to SRQ/XRC queues. PP queues "
"use RNR retry values of 0 because Open MPI performs "
"software flow control to guarantee that RNRs never occur "
"(must be >= 0 and <= 7; 7 = \"infinite\")",
7, &mca_btl_openib_component.ib_rnr_retry, 0));
CHECK(reg_uint("ib_max_rdma_dst_ops", NULL, "InfiniBand maximum pending RDMA "
"destination operations "
"(must be >= 0)",
4, &mca_btl_openib_component.ib_max_rdma_dst_ops, 0));
CHECK(reg_uint("ib_service_level", NULL, "InfiniBand service level "
"(must be >= 0 and <= 15)",
0, &mca_btl_openib_component.ib_service_level, 0));
#if (ENABLE_DYNAMIC_SL)
CHECK(reg_uint("ib_path_record_service_level", NULL,
"Enable getting InfiniBand service level from PathRecord "
"(must be >= 0, 0 = disabled, positive = try to get the "
"service level from PathRecord)",
0, &mca_btl_openib_component.ib_path_record_service_level, 0));
#endif
CHECK(reg_int("use_eager_rdma", NULL, "Use RDMA for eager messages "
"(-1 = use device default, 0 = do not use eager RDMA, "
"1 = use eager RDMA)",
-1, &mca_btl_openib_component.use_eager_rdma, 0));
CHECK(reg_int("eager_rdma_threshold", NULL,
"Use RDMA for short messages after this number of "
"messages are received from a given peer "
"(must be >= 1)",
16, &mca_btl_openib_component.eager_rdma_threshold, REGINT_GE_ONE));
CHECK(reg_int("max_eager_rdma", NULL, "Maximum number of peers allowed to use "
"RDMA for short messages (RDMA is used for all long "
"messages, except if explicitly disabled, such as "
"with the \"dr\" pml) "
"(must be >= 0)",
16, &mca_btl_openib_component.max_eager_rdma, REGINT_GE_ZERO));
CHECK(reg_int("eager_rdma_num", NULL, "Number of RDMA buffers to allocate "
"for small messages "
"(must be >= 1)",
16, &mca_btl_openib_component.eager_rdma_num, REGINT_GE_ONE));
mca_btl_openib_component.eager_rdma_num++;
CHECK(reg_uint("btls_per_lid", NULL, "Number of BTLs to create for each "
"InfiniBand LID "
"(must be >= 1)",
1, &mca_btl_openib_component.btls_per_lid, REGINT_GE_ONE));
CHECK(reg_uint("max_lmc", NULL, "Maximum number of LIDs to use for each device port "
"(must be >= 0, where 0 = use all available)",
1, &mca_btl_openib_component.max_lmc, 0));
CHECK(reg_int("enable_apm_over_lmc", NULL, "Maximum number of alternative paths for each device port "
"(must be >= -1, where 0 = disable apm, -1 = all available alternative paths )",
0, &mca_btl_openib_component.apm_lmc, REGINT_NEG_ONE_OK|REGINT_GE_ZERO));
CHECK(reg_int("enable_apm_over_ports", NULL, "Enable alternative path migration (APM) over different ports of the same device "
"(must be >= 0, where 0 = disable APM over ports, 1 = enable APM over ports of the same device)",
0, &mca_btl_openib_component.apm_ports, REGINT_GE_ZERO));
CHECK(reg_bool("use_async_event_thread", NULL,
"If nonzero, use the thread that will handle InfiniBand asynchronous events",
true, &mca_btl_openib_component.use_async_event_thread));
CHECK(reg_bool("enable_srq_resize", NULL,
"Enable/Disable on demand SRQ resize. "
"(0 = without resizing, nonzero = with resizing)", 1,
&mca_btl_openib_component.enable_srq_resize));
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
CHECK(reg_bool("rroce_enable", NULL,
"Enable/Disable routing between different subnets"
"(0 = disable, nonzero = enable)", false,
&mca_btl_openib_component.rroce_enable));
#endif
CHECK(reg_uint("buffer_alignment", NULL,
"Preferred communication buffer alignment, in bytes "
"(must be > 0 and power of two)",
64, &mca_btl_openib_component.buffer_alignment, 0));
CHECK(reg_bool("use_message_coalescing", NULL,
"If nonzero, use message coalescing", false,
&mca_btl_openib_component.use_message_coalescing));
CHECK(reg_uint("cq_poll_ratio", NULL,
"How often to poll high priority CQ versus low priority CQ",
100, &mca_btl_openib_component.cq_poll_ratio, REGINT_GE_ONE));
CHECK(reg_uint("eager_rdma_poll_ratio", NULL,
"How often to poll eager RDMA channel versus CQ",
100, &mca_btl_openib_component.eager_rdma_poll_ratio, REGINT_GE_ONE));
CHECK(reg_uint("hp_cq_poll_per_progress", NULL,
"Max number of completion events to process for each call "
"of BTL progress engine",
10, &mca_btl_openib_component.cq_poll_progress, REGINT_GE_ONE));
CHECK(reg_uint("max_hw_msg_size", NULL,
"Maximum size (in bytes) of a single fragment of a long message when using the RDMA protocols (must be > 0 and <= hw capabilities).",
0, &mca_btl_openib_component.max_hw_msg_size, 0));
CHECK(reg_bool("allow_max_memory_registration", NULL,
"Allow maximum possible memory to register with HCA",
1, &mca_btl_openib_component.allow_max_memory_registration));
/* Help debug memory registration issues */
CHECK(reg_int("memory_registration_verbose", NULL,
"Output some verbose memory registration information "
"(0 = no output, nonzero = output)", 0,
&mca_btl_openib_component.memory_registration_verbose_level, 0));
CHECK(reg_int("ignore_locality", NULL,
"Ignore any locality information and use all devices "
"(0 = use locality informaiton and use only close devices, nonzero = ignore locality information)", 0,
&mca_btl_openib_component.ignore_locality, REGINT_GE_ZERO));
/* Info only */
tmp = mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"have_fork_support",
"Whether the OpenFabrics stack supports applications that invoke the \"fork()\" system call or not (0 = no, 1 = yes). "
"Note that this value does NOT indicate whether the system being run on supports \"fork()\" with OpenFabrics applications or not.",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_CONSTANT,
&btl_openib_have_fork_support);
mca_btl_openib_module.super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_DEFAULT;
mca_btl_openib_module.super.btl_eager_limit = 12 * 1024;
mca_btl_openib_module.super.btl_rndv_eager_limit = 12 * 1024;
mca_btl_openib_module.super.btl_max_send_size = 64 * 1024;
mca_btl_openib_module.super.btl_rdma_pipeline_send_length = 1024 * 1024;
mca_btl_openib_module.super.btl_rdma_pipeline_frag_size = 1024 * 1024;
mca_btl_openib_module.super.btl_min_rdma_pipeline_size = 256 * 1024;
mca_btl_openib_module.super.btl_flags = MCA_BTL_FLAGS_RDMA |
MCA_BTL_FLAGS_NEED_ACK | MCA_BTL_FLAGS_NEED_CSUM | MCA_BTL_FLAGS_HETEROGENEOUS_RDMA |
MCA_BTL_FLAGS_SEND;
#if HAVE_DECL_IBV_ATOMIC_HCA
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_ATOMIC_FOPS;
mca_btl_openib_module.super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_ADD | MCA_BTL_ATOMIC_SUPPORTS_CSWAP;
#endif
/* Default to bandwidth auto-detection */
mca_btl_openib_module.super.btl_bandwidth = 0;
mca_btl_openib_module.super.btl_latency = 4;
#if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
/* Default is enabling CUDA asynchronous send copies */
CHECK(reg_bool("cuda_async_send", NULL,
"Enable or disable CUDA async send copies "
"(true = async; false = sync)",
true, &mca_btl_openib_component.cuda_async_send));
/* Default is enabling CUDA asynchronous receive copies */
CHECK(reg_bool("cuda_async_recv", NULL,
"Enable or disable CUDA async recv copies "
"(true = async; false = sync)",
false, &mca_btl_openib_component.cuda_async_recv));
/* Also make the max send size larger for better GPU buffer performance */
mca_btl_openib_module.super.btl_max_send_size = 128 * 1024;
/* Turn of message coalescing - not sure if it works with GPU buffers */
mca_btl_openib_component.use_message_coalescing = 0;
/* Indicates if library was built with GPU Direct RDMA support. Not changeable. */
mca_btl_openib_component.cuda_have_gdr = OPAL_INT_TO_BOOL(OPAL_CUDA_GDR_SUPPORT);
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version, "have_cuda_gdr",
"Whether CUDA GPU Direct RDMA support is built into library or not",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_CONSTANT,
&mca_btl_openib_component.cuda_have_gdr);
/* Indicates if driver has GPU Direct RDMA support. Not changeable. */
if (OPAL_SUCCESS == opal_os_dirpath_access("/sys/kernel/mm/memory_peers/nv_mem/version", S_IRUSR)) {
mca_btl_openib_component.driver_have_gdr = 1;
} else {
mca_btl_openib_component.driver_have_gdr = 0;
}
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version, "have_driver_gdr",
"Whether Infiniband driver has GPU Direct RDMA support",
MCA_BASE_VAR_TYPE_BOOL, NULL, 0,
MCA_BASE_VAR_FLAG_DEFAULT_ONLY,
OPAL_INFO_LVL_5,
MCA_BASE_VAR_SCOPE_CONSTANT,
&mca_btl_openib_component.driver_have_gdr);
/* Default for GPU Direct RDMA is off for now */
CHECK(reg_bool("want_cuda_gdr", NULL,
"Enable or disable CUDA GPU Direct RDMA support "
"(true = enabled; false = disabled)",
false, &mca_btl_openib_component.cuda_want_gdr));
if (mca_btl_openib_component.cuda_want_gdr && !mca_btl_openib_component.cuda_have_gdr) {
opal_show_help("help-mpi-btl-openib.txt",
"CUDA_no_gdr_support", true,
opal_process_info.nodename);
return OPAL_ERROR;
}
if (mca_btl_openib_component.cuda_want_gdr && !mca_btl_openib_component.driver_have_gdr) {
opal_show_help("help-mpi-btl-openib.txt",
"driver_no_gdr_support", true,
opal_process_info.nodename);
return OPAL_ERROR;
}
#if OPAL_CUDA_GDR_SUPPORT
if (mca_btl_openib_component.cuda_want_gdr) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_GET;
mca_btl_openib_module.super.btl_cuda_eager_limit = SIZE_MAX; /* magic number - indicates set it to minimum */
mca_btl_openib_module.super.btl_cuda_rdma_limit = 30000; /* default switchover is 30,000 to pipeline */
} else {
mca_btl_openib_module.super.btl_cuda_eager_limit = 0; /* Turns off any of the GPU Direct RDMA code */
mca_btl_openib_module.super.btl_cuda_rdma_limit = 0; /* Unused */
}
#endif /* OPAL_CUDA_GDR_SUPPORT */
#endif /* OPAL_CUDA_SUPPORT */
CHECK(mca_btl_base_param_register(
&mca_btl_openib_component.super.btl_version,
&mca_btl_openib_module.super));
/* setup all the qp stuff */
/* round mid_qp_size to smallest power of two */
mid_qp_size = opal_next_poweroftwo (mca_btl_openib_module.super.btl_eager_limit / 4) >> 1;
/* mid_qp_size = MAX (mid_qp_size, 1024); ?! */
if(mid_qp_size <= 128) {
mid_qp_size = 1024;
}
opal_asprintf(&default_qps,
"S,128,256,192,128:S,%u,1024,1008,64:S,%u,1024,1008,64:S,%u,1024,1008,64",
mid_qp_size,
(uint32_t)mca_btl_openib_module.super.btl_eager_limit,
(uint32_t)mca_btl_openib_module.super.btl_max_send_size);
if (NULL == default_qps) {
/* Don't try to recover from this */
return OPAL_ERR_OUT_OF_RESOURCE;
}
if (NULL != mca_btl_openib_component.default_recv_qps) {
free(mca_btl_openib_component.default_recv_qps);
}
mca_btl_openib_component.default_recv_qps = default_qps;
CHECK(reg_string("receive_queues", NULL,
"Colon-delimited, comma-delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4",
default_qps, &mca_btl_openib_component.receive_queues,
0
));
CHECK(reg_string("if_include", NULL,
"Comma-delimited list of devices/ports to be used (e.g. \"mthca0,mthca1:2\"; empty value means to use all ports found). Mutually exclusive with btl_openib_if_exclude.",
NULL, &mca_btl_openib_component.if_include,
0));
CHECK(reg_string("if_exclude", NULL,
"Comma-delimited list of device/ports to be excluded (empty value means to not exclude any ports). Mutually exclusive with btl_openib_if_include.",
NULL, &mca_btl_openib_component.if_exclude,
0));
CHECK(reg_string("ipaddr_include", NULL,
"Comma-delimited list of IP Addresses to be used (e.g. \"192.168.1.0/24\"). Mutually exclusive with btl_openib_ipaddr_exclude.",
NULL, &mca_btl_openib_component.ipaddr_include,
0));
CHECK(reg_string("ipaddr_exclude", NULL,
"Comma-delimited list of IP Addresses to be excluded (e.g. \"192.168.1.0/24\"). Mutually exclusive with btl_openib_ipaddr_include.",
NULL, &mca_btl_openib_component.ipaddr_exclude,
0));
CHECK(reg_int("gid_index", NULL,
"GID index to use on verbs device ports",
0, &mca_btl_openib_component.gid_index,
REGINT_GE_ZERO));
CHECK(reg_bool("allow_different_subnets", NULL,
"Allow connecting processes from different IB subnets."
"(0 = do not allow; 1 = allow)",
false, &mca_btl_openib_component.allow_different_subnets));
/* Register any MCA params for the connect pseudo-components */
if (OPAL_SUCCESS == ret) {
ret = opal_btl_openib_connect_base_register();
}
return btl_openib_verify_mca_params();
}
int btl_openib_verify_mca_params (void)
{
if (mca_btl_openib_component.cq_poll_batch > MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT) {
mca_btl_openib_component.cq_poll_batch = MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT;
}
#if !HAVE_IBV_FORK_INIT
if (1 == mca_btl_openib_component.want_fork_support) {
opal_show_help("help-mpi-btl-openib.txt",
"ibv_fork requested but not supported", true,
opal_process_info.nodename);
return OPAL_ERR_BAD_PARAM;
}
#endif
mca_btl_openib_component.ib_pkey_val &= MCA_BTL_IB_PKEY_MASK;
if (mca_btl_openib_component.ib_min_rnr_timer > 31) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_min_rnr_timer > 31",
"btl_openib_ib_min_rnr_timer reset to 31");
mca_btl_openib_component.ib_min_rnr_timer = 31;
}
if (mca_btl_openib_component.ib_timeout > 31) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_timeout > 31",
"btl_openib_ib_timeout reset to 31");
mca_btl_openib_component.ib_timeout = 31;
}
if (mca_btl_openib_component.ib_retry_count > 7) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_retry_count > 7",
"btl_openib_ib_retry_count reset to 7");
mca_btl_openib_component.ib_retry_count = 7;
}
if (mca_btl_openib_component.ib_rnr_retry > 7) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_rnr_retry > 7",
"btl_openib_ib_rnr_retry reset to 7");
mca_btl_openib_component.ib_rnr_retry = 7;
}
if (mca_btl_openib_component.ib_service_level > 15) {
opal_show_help("help-mpi-btl-openib.txt", "invalid mca param value",
true, "btl_openib_ib_service_level > 15",
"btl_openib_ib_service_level reset to 15");
mca_btl_openib_component.ib_service_level = 15;
}
if(mca_btl_openib_component.buffer_alignment <= 1 ||
(mca_btl_openib_component.buffer_alignment & (mca_btl_openib_component.buffer_alignment - 1))) {
opal_show_help("help-mpi-btl-openib.txt", "wrong buffer alignment",
true, mca_btl_openib_component.buffer_alignment, opal_process_info.nodename, 64);
mca_btl_openib_component.buffer_alignment = 64;
}
#if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
if (mca_btl_openib_component.cuda_async_send) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_COPY_ASYNC_SEND;
} else {
mca_btl_openib_module.super.btl_flags &= ~MCA_BTL_FLAGS_CUDA_COPY_ASYNC_SEND;
}
if (mca_btl_openib_component.cuda_async_recv) {
mca_btl_openib_module.super.btl_flags |= MCA_BTL_FLAGS_CUDA_COPY_ASYNC_RECV;
} else {
mca_btl_openib_module.super.btl_flags &= ~MCA_BTL_FLAGS_CUDA_COPY_ASYNC_RECV;
}
#if 0 /* Disable this check for now while fork support code is worked out. */
/* Cannot have fork support and GDR on at the same time. If the user asks for both,
* then print a message and return error. If the user does not explicitly ask for
* fork support, then turn it off in the presence of GDR. */
if (mca_btl_openib_component.cuda_want_gdr && mca_btl_openib_component.cuda_have_gdr &&
mca_btl_openib_component.driver_have_gdr) {
if (1 == opal_common_verbs_want_fork_support) {
opal_show_help("help-mpi-btl-openib.txt", "no_fork_with_gdr",
true, opal_process_info.nodename);
return OPAL_ERR_BAD_PARAM;
}
}
#endif /* Workaround */
if (0 != mca_btl_openib_module.super.btl_cuda_max_send_size) {
opal_show_help("help-mpi-btl-openib.txt", "do_not_set_openib_value",
true, opal_process_info.nodename);
mca_btl_openib_module.super.btl_cuda_max_send_size = 0;
}
#endif
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,22 +0,0 @@
/*
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_MCA_H
#define MCA_BTL_IB_MCA_H
BEGIN_C_DECLS
/**
* Function to register MCA params and check for sane values
*/
int btl_openib_register_mca_params(void);
int btl_openib_verify_mca_params (void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,405 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2015 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2014-2017 Intel, Inc. All rights reserved.
* Copyright (c) 2015-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2015 Mellanox Technologies. All rights reserved.
* Copyright (c) 2016-2017 Los Alamos National Security, LLC. All rights
* reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "opal/util/arch.h"
#include "opal/mca/pmix/pmix.h"
#include "btl_openib.h"
#include "btl_openib_proc.h"
#include "connect/base.h"
#include "connect/connect.h"
static void mca_btl_openib_proc_btl_construct(mca_btl_openib_proc_btlptr_t* elem);
static void mca_btl_openib_proc_btl_destruct(mca_btl_openib_proc_btlptr_t* elem);
OBJ_CLASS_INSTANCE(mca_btl_openib_proc_btlptr_t,
opal_list_item_t, mca_btl_openib_proc_btl_construct,
mca_btl_openib_proc_btl_destruct);
static void mca_btl_openib_proc_btl_construct(mca_btl_openib_proc_btlptr_t* elem)
{
elem->openib_btl = NULL;
}
static void mca_btl_openib_proc_btl_destruct(mca_btl_openib_proc_btlptr_t* elem)
{
elem->openib_btl = NULL;
}
static void mca_btl_openib_proc_construct(mca_btl_openib_proc_t* proc);
static void mca_btl_openib_proc_destruct(mca_btl_openib_proc_t* proc);
OBJ_CLASS_INSTANCE(mca_btl_openib_proc_t,
opal_list_item_t, mca_btl_openib_proc_construct,
mca_btl_openib_proc_destruct);
void mca_btl_openib_proc_construct(mca_btl_openib_proc_t* ib_proc)
{
ib_proc->proc_opal = 0;
ib_proc->proc_ports = NULL;
ib_proc->proc_port_count = 0;
ib_proc->proc_endpoints = 0;
ib_proc->proc_endpoint_count = 0;
OBJ_CONSTRUCT(&ib_proc->proc_lock, opal_mutex_t);
OBJ_CONSTRUCT(&ib_proc->openib_btls, opal_list_t);
}
/*
* Cleanup ib proc instance
*/
void mca_btl_openib_proc_destruct(mca_btl_openib_proc_t* ib_proc)
{
/* release resources */
if(NULL != ib_proc->proc_endpoints) {
free(ib_proc->proc_endpoints);
}
if (NULL != ib_proc->proc_ports) {
int i, j;
for (i = 0; i < ib_proc->proc_port_count; ++i) {
for (j = 0; j < ib_proc->proc_ports[i].pm_cpc_data_count; ++j) {
if (NULL != ib_proc->proc_ports[i].pm_cpc_data[j].cbm_modex_message) {
free(ib_proc->proc_ports[i].pm_cpc_data[j].cbm_modex_message);
}
}
}
free(ib_proc->proc_ports);
}
OBJ_DESTRUCT(&ib_proc->proc_lock);
OPAL_LIST_DESTRUCT(&ib_proc->openib_btls);
}
/*
* Look for an existing IB process instances based on the associated
* opal_proc_t instance.
*/
static mca_btl_openib_proc_t* ibproc_lookup_no_lock(opal_proc_t* proc)
{
mca_btl_openib_proc_t* ib_proc;
OPAL_LIST_FOREACH(ib_proc, &mca_btl_openib_component.ib_procs, mca_btl_openib_proc_t) {
if(ib_proc->proc_opal == proc) {
return ib_proc;
}
}
return NULL;
}
static mca_btl_openib_proc_t* ibproc_lookup_and_lock(opal_proc_t* proc)
{
mca_btl_openib_proc_t* ib_proc;
/* get the process from the list */
opal_mutex_lock(&mca_btl_openib_component.ib_lock);
ib_proc = ibproc_lookup_no_lock(proc);
opal_mutex_unlock(&mca_btl_openib_component.ib_lock);
if( NULL != ib_proc ){
/* if we were able to find it - lock it.
* NOTE: we want to lock it outside of list locked region */
opal_mutex_lock(&ib_proc->proc_lock);
}
return ib_proc;
}
static void inline unpack8(char **src, uint8_t *value)
{
/* Copy one character */
*value = (uint8_t) **src;
/* Most the src ahead one */
++*src;
}
/*
* Create a IB process structure. There is a one-to-one correspondence
* between a opal_proc_t and a mca_btl_openib_proc_t instance. We
* cache additional data (specifically the list of
* mca_btl_openib_endpoint_t instances, and published addresses)
* associated w/ a given destination on this datastructure.
*/
mca_btl_openib_proc_t* mca_btl_openib_proc_get_locked(opal_proc_t* proc)
{
mca_btl_openib_proc_t *ib_proc = NULL, *ib_proc_ret = NULL;
size_t msg_size;
uint32_t size;
int rc, i, j;
void *message;
char *offset;
int modex_message_size;
mca_btl_openib_modex_message_t dummy;
bool is_new = false;
/* Check if we have already created a IB proc
* structure for this ompi process */
ib_proc = ibproc_lookup_and_lock(proc);
if (NULL != ib_proc) {
/* Gotcha! */
return ib_proc;
}
/* All initialization has to be an atomic operation. we do the following assumption:
* - we let all concurent threads to try to do the initialization;
* - when one has finished it locks ib_lock and checks if corresponding
* process is still missing;
* - if so - new proc is added, otherwise - initialized proc struct is released.
*/
/* First time, gotta create a new IB proc
* out of the opal_proc ... */
ib_proc = OBJ_NEW(mca_btl_openib_proc_t);
if (NULL == ib_proc) {
return NULL;
}
/* Initialize number of peer */
ib_proc->proc_endpoint_count = 0;
ib_proc->proc_opal = proc;
/* query for the peer address info */
OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version,
&proc->proc_name, &message, &msg_size);
if (OPAL_SUCCESS != rc) {
BTL_VERBOSE(("[%s:%d] opal_modex_recv failed for peer %s",
__FILE__, __LINE__,
OPAL_NAME_PRINT(proc->proc_name)));
goto no_err_exit;
}
if (0 == msg_size) {
goto no_err_exit;
}
/* Message was packed in btl_openib_component.c; the format is
listed in a comment in that file */
modex_message_size = ((char *) &(dummy.end)) - ((char*) &dummy);
/* Unpack the number of modules in the message */
offset = (char *) message;
unpack8(&offset, &(ib_proc->proc_port_count));
BTL_VERBOSE(("unpack: %d btls", ib_proc->proc_port_count));
if (ib_proc->proc_port_count > 0) {
ib_proc->proc_ports = (mca_btl_openib_proc_modex_t *)
malloc(sizeof(mca_btl_openib_proc_modex_t) *
ib_proc->proc_port_count);
} else {
ib_proc->proc_ports = NULL;
}
/* Loop over unpacking all the ports */
for (i = 0; i < ib_proc->proc_port_count; i++) {
/* Unpack the modex comment message struct */
size = modex_message_size;
memcpy(&(ib_proc->proc_ports[i].pm_port_info), offset, size);
#if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
MCA_BTL_OPENIB_MODEX_MSG_NTOH(ib_proc->proc_ports[i].pm_port_info);
#endif
offset += size;
BTL_VERBOSE(("unpacked btl %d: modex message, offset now %d",
i, (int)(offset-((char*)message))));
/* Unpack the number of CPCs that follow */
unpack8(&offset, &(ib_proc->proc_ports[i].pm_cpc_data_count));
BTL_VERBOSE(("unpacked btl %d: number of cpcs to follow %d (offset now %d)",
i, ib_proc->proc_ports[i].pm_cpc_data_count,
(int)(offset-((char*)message))));
ib_proc->proc_ports[i].pm_cpc_data = (opal_btl_openib_connect_base_module_data_t *)
calloc(ib_proc->proc_ports[i].pm_cpc_data_count,
sizeof(opal_btl_openib_connect_base_module_data_t));
if (NULL == ib_proc->proc_ports[i].pm_cpc_data) {
goto err_exit;
}
/* Unpack the CPCs */
for (j = 0; j < ib_proc->proc_ports[i].pm_cpc_data_count; ++j) {
uint8_t u8;
opal_btl_openib_connect_base_module_data_t *cpcd;
cpcd = ib_proc->proc_ports[i].pm_cpc_data + j;
unpack8(&offset, &u8);
BTL_VERBOSE(("unpacked btl %d: cpc %d: index %d (offset now %d)",
i, j, u8, (int)(offset-(char*)message)));
cpcd->cbm_component =
opal_btl_openib_connect_base_get_cpc_byindex(u8);
BTL_VERBOSE(("unpacked btl %d: cpc %d: component %s",
i, j, cpcd->cbm_component->cbc_name));
unpack8(&offset, &cpcd->cbm_priority);
unpack8(&offset, &cpcd->cbm_modex_message_len);
BTL_VERBOSE(("unpacked btl %d: cpc %d: priority %d, msg len %d (offset now %d)",
i, j, cpcd->cbm_priority,
cpcd->cbm_modex_message_len,
(int)(offset-(char*)message)));
if (cpcd->cbm_modex_message_len > 0) {
cpcd->cbm_modex_message = malloc(cpcd->cbm_modex_message_len);
if (NULL == cpcd->cbm_modex_message) {
BTL_ERROR(("Failed to malloc"));
goto err_exit;
}
memcpy(cpcd->cbm_modex_message, offset,
cpcd->cbm_modex_message_len);
offset += cpcd->cbm_modex_message_len;
BTL_VERBOSE(("unpacked btl %d: cpc %d: blob unpacked %d %x (offset now %d)",
i, j,
((uint32_t*)cpcd->cbm_modex_message)[0],
((uint32_t*)cpcd->cbm_modex_message)[1],
(int)(offset-((char*)message))));
}
}
}
if (0 == ib_proc->proc_port_count) {
ib_proc->proc_endpoints = NULL;
goto no_err_exit;
} else {
ib_proc->proc_endpoints = (volatile mca_btl_base_endpoint_t**)
malloc(ib_proc->proc_port_count *
sizeof(mca_btl_base_endpoint_t*));
}
if (NULL == ib_proc->proc_endpoints) {
goto err_exit;
}
BTL_VERBOSE(("unpacking done!"));
/* Finally add this process to the initialized procs list */
opal_mutex_lock(&mca_btl_openib_component.ib_lock);
ib_proc_ret = ibproc_lookup_no_lock(proc);
if (NULL == ib_proc_ret) {
/* if process can't be found in this list - insert it locked
* it is safe to lock ib_proc here because this thread is
* the only one who knows about it so far */
opal_mutex_lock(&ib_proc->proc_lock);
opal_list_append(&mca_btl_openib_component.ib_procs, &ib_proc->super);
ib_proc_ret = ib_proc;
is_new = true;
} else {
/* otherwise - release module_proc */
OBJ_RELEASE(ib_proc);
}
opal_mutex_unlock(&mca_btl_openib_component.ib_lock);
/* if we haven't insert the process - lock it here so we
* won't lock mca_btl_openib_component.ib_lock */
if( !is_new ){
opal_mutex_lock(&ib_proc_ret->proc_lock);
}
return ib_proc_ret;
err_exit:
BTL_ERROR(("%d: error exit from mca_btl_openib_proc_create", OPAL_PROC_MY_NAME.vpid));
no_err_exit:
OBJ_RELEASE(ib_proc);
return NULL;
}
int mca_btl_openib_proc_remove(opal_proc_t *proc,
mca_btl_base_endpoint_t *endpoint)
{
size_t i;
mca_btl_openib_proc_t* ib_proc = NULL;
/* Remove endpoint from the openib BTL version of the proc as
well */
ib_proc = ibproc_lookup_and_lock(proc);
if (NULL != ib_proc) {
for (i = 0; i < ib_proc->proc_endpoint_count; ++i) {
if (ib_proc->proc_endpoints[i] == endpoint) {
ib_proc->proc_endpoints[i] = NULL;
if (i == ib_proc->proc_endpoint_count - 1) {
--ib_proc->proc_endpoint_count;
}
opal_mutex_unlock(&ib_proc->proc_lock);
return OPAL_SUCCESS;
}
}
}
return OPAL_ERR_NOT_FOUND;
}
/*
* Note that this routine must be called with the lock on the process
* already held. Insert a btl instance into the proc array and assign
* it an address.
*/
int mca_btl_openib_proc_insert(mca_btl_openib_proc_t* module_proc,
mca_btl_base_endpoint_t* module_endpoint)
{
/* insert into endpoint array */
#ifndef WORDS_BIGENDIAN
/* if we are little endian and our peer is not so lucky, then we
need to put all information sent to him in big endian (aka
Network Byte Order) and expect all information received to
be in NBO. Since big endian machines always send and receive
in NBO, we don't care so much about that case. */
if (module_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN) {
module_endpoint->nbo = true;
}
#endif
/* only allow eager rdma if the peers agree on the size of a long */
if((module_proc->proc_opal->proc_arch & OPAL_ARCH_LONGISxx) !=
(opal_proc_local_get()->proc_arch & OPAL_ARCH_LONGISxx)) {
module_endpoint->use_eager_rdma = false;
}
module_endpoint->endpoint_proc = module_proc;
module_proc->proc_endpoints[module_proc->proc_endpoint_count++] = module_endpoint;
return OPAL_SUCCESS;
}
int mca_btl_openib_proc_reg_btl(mca_btl_openib_proc_t* ib_proc,
mca_btl_openib_module_t* openib_btl)
{
mca_btl_openib_proc_btlptr_t* elem;
OPAL_LIST_FOREACH(elem, &ib_proc->openib_btls, mca_btl_openib_proc_btlptr_t) {
if(elem->openib_btl == openib_btl) {
/* this is normal return meaning that this BTL has already touched this ib_proc */
return OPAL_ERR_RESOURCE_BUSY;
}
}
elem = OBJ_NEW(mca_btl_openib_proc_btlptr_t);
if( NULL == elem ){
return OPAL_ERR_OUT_OF_RESOURCE;
}
elem->openib_btl = openib_btl;
opal_list_append(&ib_proc->openib_btls, &elem->super);
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,115 +0,0 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2015 Mellanox Technologies. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_IB_PROC_H
#define MCA_BTL_IB_PROC_H
#include "opal/class/opal_object.h"
#include "opal/util/proc.h"
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
BEGIN_C_DECLS
/* Must forward reference this to avoid include file loop */
struct opal_btl_openib_connect_base_module_data_t;
/**
* Data received from the modex. For each openib BTL module/port in
* the peer, we'll receive two things:
*
* 1. Data about the peer's port
* 2. An array of CPCs that the peer has available on that port, each
* of which has its own meta data
*
* Hence, these two items need to be bundled together;
*/
typedef struct mca_btl_openib_proc_modex_t {
/** Information about the peer's port */
mca_btl_openib_modex_message_t pm_port_info;
/** Array of the peer's CPCs available on this port */
opal_btl_openib_connect_base_module_data_t *pm_cpc_data;
/** Length of the pm_cpc_data array */
uint8_t pm_cpc_data_count;
} mca_btl_openib_proc_modex_t;
/**
* The list element to hold pointers to openin_btls that are using this
* ib_proc.
*/
struct mca_btl_openib_proc_btlptr_t {
opal_list_item_t super;
mca_btl_openib_module_t* openib_btl;
};
typedef struct mca_btl_openib_proc_btlptr_t mca_btl_openib_proc_btlptr_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_proc_btlptr_t);
/**
* Represents the state of a remote process and the set of addresses
* that it exports. Also cache an instance of mca_btl_base_endpoint_t for
* each
* BTL instance that attempts to open a connection to the process.
*/
struct mca_btl_openib_proc_t {
/** allow proc to be placed on a list */
opal_list_item_t super;
/** pointer to corresponding opal_proc_t */
const opal_proc_t *proc_opal;
/** modex messages from this proc; one for each port in the peer */
mca_btl_openib_proc_modex_t *proc_ports;
/** length of proc_ports array */
uint8_t proc_port_count;
/** list of openib_btl's that touched this proc **/
opal_list_t openib_btls;
/** array of endpoints that have been created to access this proc */
volatile struct mca_btl_base_endpoint_t **proc_endpoints;
/** number of endpoints (length of proc_endpoints array) */
volatile size_t proc_endpoint_count;
/** lock to protect against concurrent access to proc state */
opal_mutex_t proc_lock;
};
typedef struct mca_btl_openib_proc_t mca_btl_openib_proc_t;
OBJ_CLASS_DECLARATION(mca_btl_openib_proc_t);
mca_btl_openib_proc_t* mca_btl_openib_proc_get_locked(opal_proc_t* proc);
int mca_btl_openib_proc_insert(mca_btl_openib_proc_t*, mca_btl_base_endpoint_t*);
int mca_btl_openib_proc_remove(opal_proc_t* proc,
mca_btl_base_endpoint_t* module_endpoint);
int mca_btl_openib_proc_reg_btl(mca_btl_openib_proc_t* ib_proc,
mca_btl_openib_module_t* openib_btl);
END_C_DECLS
#endif

Просмотреть файл

@ -1,175 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2013 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
* Copyright (c) 2006-2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2006-2007 Voltaire All rights reserved.
* Copyright (c) 2008-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2009 IBM Corporation. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "btl_openib_frag.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_xrc.h"
/*
* RDMA WRITE local buffer to remote buffer address.
*/
int mca_btl_openib_put (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep, void *local_address,
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata)
{
mca_btl_openib_put_frag_t *frag = NULL;
int rc, qp = order;
if (MCA_BTL_NO_ORDER == qp) {
qp = mca_btl_openib_component.rdma_qp;
}
if (OPAL_UNLIKELY((btl->btl_put_local_registration_threshold < size && !local_handle) || !remote_handle ||
size > btl->btl_put_limit)) {
return OPAL_ERR_BAD_PARAM;
}
frag = to_put_frag(alloc_send_user_frag ());
if (OPAL_UNLIKELY(NULL == frag)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* set base descriptor flags */
to_base_frag(frag)->base.order = qp;
/* free this descriptor when the operation is complete */
to_base_frag(frag)->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
/* set up scatter-gather entry */
to_com_frag(frag)->sg_entry.length = size;
if (local_handle) {
to_com_frag(frag)->sg_entry.lkey = local_handle->lkey;
} else {
/* lkey is not required for inline RDMA write */
to_com_frag(frag)->sg_entry.lkey = 0;
}
to_com_frag(frag)->sg_entry.addr = (uint64_t)(intptr_t) local_address;
to_com_frag(frag)->endpoint = ep;
/* set up rdma callback */
frag->cb.func = cbfunc;
frag->cb.context = cbcontext;
frag->cb.data = cbdata;
frag->cb.local_handle = local_handle;
/* post descriptor */
to_out_frag(frag)->sr_desc.opcode = IBV_WR_RDMA_WRITE;
to_out_frag(frag)->sr_desc.wr.rdma.remote_addr = remote_address;
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
#if OPAL_ENABLE_HETEROGENEOUS_SUPPORT
if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
!= (opal_proc_local_get()->proc_arch & OPAL_ARCH_ISBIGENDIAN)) {
to_out_frag(frag)->sr_desc.wr.rdma.rkey = opal_swap_bytes4(remote_handle->rkey);
} else
#endif
{
to_out_frag(frag)->sr_desc.wr.rdma.rkey = remote_handle->rkey;
}
if (ep->endpoint_state != MCA_BTL_IB_CONNECTED) {
OPAL_THREAD_LOCK(&ep->endpoint_lock);
rc = check_endpoint_state(ep, &to_base_frag(frag)->base, &ep->pending_put_frags);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
if (OPAL_ERR_RESOURCE_BUSY == rc) {
/* descriptor was queued pending connection */
return OPAL_SUCCESS;
}
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
MCA_BTL_IB_FRAG_RETURN (frag);
return rc;
}
}
rc = mca_btl_openib_put_internal (btl, ep, frag);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
if (OPAL_LIKELY(OPAL_ERR_OUT_OF_RESOURCE == rc)) {
rc = OPAL_SUCCESS;
/* queue the fragment for when resources are available */
OPAL_THREAD_LOCK(&ep->endpoint_lock);
opal_list_append(&ep->pending_put_frags, (opal_list_item_t*)frag);
OPAL_THREAD_UNLOCK(&ep->endpoint_lock);
} else {
MCA_BTL_IB_FRAG_RETURN (frag);
}
}
return rc;
}
int mca_btl_openib_put_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
mca_btl_openib_put_frag_t *frag)
{
int qp = to_base_frag(frag)->base.order;
struct ibv_send_wr *bad_wr;
int rc;
/* NTH: the inline send size and remote SRQ number are only available once the endpoint is
* connected. By setting these values here instead of mca_btl_openib_put we guarantee
* both fields are initialized */
to_out_frag(frag)->sr_desc.send_flags = ib_send_flags (to_com_frag(frag)->sg_entry.length,
&(ep->qps[qp]), 1);
#if HAVE_XRC
if (MCA_BTL_XRC_ENABLED && BTL_OPENIB_QP_TYPE_XRC(qp)) {
#if OPAL_HAVE_CONNECTX_XRC
to_out_frag(frag)->sr_desc.xrc_remote_srq_num = ep->rem_info.rem_srqs[qp].rem_srq_num;
#elif OPAL_HAVE_CONNECTX_XRC_DOMAINS
to_out_frag(frag)->sr_desc.qp_type.xrc.remote_srqn = ep->rem_info.rem_srqs[qp].rem_srq_num;
#else
#error "that should never happen"
#endif
}
#endif
/* check for a send wqe */
if (qp_get_wqe(ep, qp) < 0) {
qp_put_wqe(ep, qp);
return OPAL_ERR_OUT_OF_RESOURCE;
}
qp_inflight_wqe_to_frag(ep, qp, to_com_frag(frag));
qp_reset_signal_count(ep, qp);
if (0 != (rc = ibv_post_send(ep->qps[qp].qp->lcl_qp, &to_out_frag(frag)->sr_desc, &bad_wr))) {
qp_put_wqe(ep, qp);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}

Просмотреть файл

@ -1,211 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2009 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include <infiniband/verbs.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include <dlfcn.h>
#include "opal/mca/btl/base/base.h"
#include "opal/util/printf.h"
#include "btl_openib_xrc.h"
#include "btl_openib.h"
#if HAVE_XRC
#define SIZE_OF3(A, B, C) (sizeof(A) + sizeof(B) + sizeof(C))
static void ib_address_constructor(ib_address_t *ib_addr);
static void ib_address_destructor(ib_address_t *ib_addr);
OBJ_CLASS_INSTANCE(ib_address_t,
opal_list_item_t,
ib_address_constructor,
ib_address_destructor);
/* This func. opens XRC domain */
int mca_btl_openib_open_xrc_domain(struct mca_btl_openib_device_t *device)
{
int len;
char *xrc_file_name;
const char *dev_name;
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
struct ibv_xrcd_init_attr xrcd_attr;
#endif
dev_name = ibv_get_device_name(device->ib_dev);
len = opal_asprintf(&xrc_file_name,
"%s"OPAL_PATH_SEP"openib_xrc_domain_%s",
opal_process_info.job_session_dir, dev_name);
if (0 > len) {
BTL_ERROR(("Failed to allocate memomry for XRC file name: %s\n",
strerror(errno)));
return OPAL_ERROR;
}
device->xrc_fd = open(xrc_file_name, O_CREAT, S_IWUSR|S_IRUSR);
if (0 > device->xrc_fd) {
BTL_ERROR(("Failed to open XRC domain file %s, errno says %s\n",
xrc_file_name,strerror(errno)));
free(xrc_file_name);
return OPAL_ERROR;
}
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
memset(&xrcd_attr, 0, sizeof xrcd_attr);
xrcd_attr.comp_mask = IBV_XRCD_INIT_ATTR_FD | IBV_XRCD_INIT_ATTR_OFLAGS;
xrcd_attr.fd = device->xrc_fd;
xrcd_attr.oflags = O_CREAT;
device->xrcd = ibv_open_xrcd(device->ib_dev_context, &xrcd_attr);
if (NULL == device->xrcd) {
#else
device->xrc_domain = ibv_open_xrc_domain(device->ib_dev_context, device->xrc_fd, O_CREAT);
if (NULL == device->xrc_domain) {
#endif
BTL_ERROR(("Failed to open XRC domain\n"));
close(device->xrc_fd);
free(xrc_file_name);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
/* This func. closes XRC domain */
int mca_btl_openib_close_xrc_domain(struct mca_btl_openib_device_t *device)
{
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if (NULL == device->xrcd) {
#else
if (NULL == device->xrc_domain) {
#endif
/* No XRC domain, just exit */
return OPAL_SUCCESS;
}
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
if (ibv_close_xrcd(device->xrcd)) {
#else
if (ibv_close_xrc_domain(device->xrc_domain)) {
#endif
BTL_ERROR(("Failed to close XRC domain, errno %d says %s\n",
device->xrc_fd, strerror(errno)));
return OPAL_ERROR;
}
/* do we need to check exit status */
if (close(device->xrc_fd)) {
BTL_ERROR(("Failed to close XRC file descriptor, errno %d says %s\n",
device->xrc_fd, strerror(errno)));
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
static void ib_address_constructor(ib_address_t *ib_addr)
{
ib_addr->key = NULL;
ib_addr->subnet_id = 0;
ib_addr->lid = 0;
ib_addr->status = MCA_BTL_IB_ADDR_CLOSED;
ib_addr->qp = NULL;
ib_addr->max_wqe = 0;
/* NTH: make the addr_lock recursive because mca_btl_openib_endpoint_connected can call
* into the CPC with the lock held. The alternative would be to drop the lock but the
* lock is never obtained in a critical path. */
OBJ_CONSTRUCT(&ib_addr->addr_lock, opal_recursive_mutex_t);
OBJ_CONSTRUCT(&ib_addr->pending_ep, opal_list_t);
}
static void ib_address_destructor(ib_address_t *ib_addr)
{
if (NULL != ib_addr->key) {
free(ib_addr->key);
}
OBJ_DESTRUCT(&ib_addr->addr_lock);
OBJ_DESTRUCT(&ib_addr->pending_ep);
}
static int ib_address_init(ib_address_t *ib_addr, uint16_t lid, uint64_t s_id, opal_jobid_t ep_jobid)
{
ib_addr->key = malloc(SIZE_OF3(s_id, lid, ep_jobid));
if (NULL == ib_addr->key) {
BTL_ERROR(("Failed to allocate memory for key\n"));
return OPAL_ERROR;
}
memset(ib_addr->key, 0, SIZE_OF3(s_id, lid, ep_jobid));
/* creating the key = lid + s_id + ep_jobid */
memcpy(ib_addr->key, &lid, sizeof(lid));
memcpy((void*)((char*)ib_addr->key + sizeof(lid)), &s_id, sizeof(s_id));
memcpy((void*)((char*)ib_addr->key + sizeof(lid) + sizeof(s_id)),
&ep_jobid, sizeof(ep_jobid));
/* caching lid and subnet id */
ib_addr->subnet_id = s_id;
ib_addr->lid = lid;
return OPAL_SUCCESS;
}
/* Create new entry in hash table for subnet_id and lid,
* update the endpoint pointer.
* Before call to this function you need to protect with
*/
int mca_btl_openib_ib_address_add_new (uint16_t lid, uint64_t s_id,
opal_jobid_t ep_jobid, mca_btl_openib_endpoint_t *ep)
{
void *tmp;
int ret = OPAL_SUCCESS;
struct ib_address_t *ib_addr = OBJ_NEW(ib_address_t);
ret = ib_address_init(ib_addr, lid, s_id, ep_jobid);
if (OPAL_SUCCESS != ret ) {
BTL_ERROR(("XRC Internal error. Failed to init ib_addr\n"));
OBJ_DESTRUCT(ib_addr);
return ret;
}
/* is it already in the table ?*/
OPAL_THREAD_LOCK(&mca_btl_openib_component.ib_lock);
if (OPAL_SUCCESS != opal_hash_table_get_value_ptr(&mca_btl_openib_component.ib_addr_table,
ib_addr->key,
SIZE_OF3(s_id, lid, ep_jobid), &tmp)) {
/* It is new one, lets put it on the table */
ret = opal_hash_table_set_value_ptr(&mca_btl_openib_component.ib_addr_table,
ib_addr->key, SIZE_OF3(s_id, lid, ep_jobid), (void*)ib_addr);
if (OPAL_SUCCESS != ret) {
BTL_ERROR(("XRC Internal error."
" Failed to add element to mca_btl_openib_component.ib_addr_table\n"));
OPAL_THREAD_UNLOCK(&mca_btl_openib_component.ib_lock);
OBJ_DESTRUCT(ib_addr);
return ret;
}
/* update the endpoint with pointer to ib address */
ep->ib_addr = ib_addr;
} else {
/* so we have this one in the table, just add the pointer to the endpoint */
ep->ib_addr = (ib_address_t *)tmp;
assert(lid == ep->ib_addr->lid && s_id == ep->ib_addr->subnet_id);
OBJ_DESTRUCT(ib_addr);
}
OPAL_THREAD_UNLOCK(&mca_btl_openib_component.ib_lock);
return ret;
}
#endif

Просмотреть файл

@ -1,58 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2008 Mellanox Technologies. All rights reserved.
* Copyright (c) 2014 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Bull SAS. All rights reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2016 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*
* @file
*/
#ifndef MCA_BTL_OPENIB_XRC_H
#define MCA_BTL_OPENIB_XRC_H
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#if HAVE_XRC
#define MCA_BTL_XRC_ENABLED (mca_btl_openib_component.num_xrc_qps)
#else
#define MCA_BTL_XRC_ENABLED 0
#endif
typedef enum {
MCA_BTL_IB_ADDR_CONNECTING = 100,
MCA_BTL_IB_ADDR_CONNECTED,
MCA_BTL_IB_ADDR_CLOSED
} mca_btl_openib_ib_addr_state_t;
struct ib_address_t {
opal_list_item_t super;
void *key; /* the key with size 80bit - [subnet(64) LID(16bit)] */
uint64_t subnet_id; /* caching subnet_id */
uint16_t lid; /* caching lid */
opal_list_t pending_ep; /* list of endpoints that use this ib_address */
mca_btl_openib_qp_t *qp; /* pointer to qp that will be used
for communication with the
destination */
uint32_t remote_xrc_rcv_qp_num; /* remote xrc qp number */
opal_mutex_t addr_lock; /* protection */
mca_btl_openib_ib_addr_state_t status; /* ib port status */
int32_t max_wqe;
};
typedef struct ib_address_t ib_address_t;
int mca_btl_openib_open_xrc_domain(struct mca_btl_openib_device_t *device);
int mca_btl_openib_close_xrc_domain(struct mca_btl_openib_device_t *device);
int mca_btl_openib_ib_address_add_new (uint16_t lid, uint64_t s_id,
opal_jobid_t ep_jobid, mca_btl_openib_endpoint_t *ep);
#endif

Просмотреть файл

@ -1,4 +0,0 @@
# Ignore symbols in this component that are auto-generated and we
# can't do anything about them (e.g., flex/bison symbols).
btl_openib_ini_yyleng
btl_openib_ini_yytext

Просмотреть файл

@ -1,117 +0,0 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2008-2011 Mellanox Technologies. All rights reserved.
# Copyright (c) 2011 Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2015 Research Organization for Information Science
# and Technology (RIST). All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_btl_openib_POST_CONFIG([should_build])
# ------------------------------------------
AC_DEFUN([MCA_opal_btl_openib_POST_CONFIG], [
AM_CONDITIONAL([MCA_btl_openib_have_xrc], [test $1 -eq 1 && test "x$btl_openib_have_xrc" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_rdmacm], [test $1 -eq 1 && test "x$btl_openib_have_rdmacm" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_dynamic_sl], [test $1 -eq 1 && test "x$btl_openib_have_opensm_devel" = "x1"])
AM_CONDITIONAL([MCA_btl_openib_have_udcm], [test $1 -eq 1 && test "x$btl_openib_have_udcm" = "x1"])
])
# MCA_btl_openib_CONFIG([action-if-can-copalle],
# [action-if-cant-copalle])
# ------------------------------------------------
AC_DEFUN([MCA_opal_btl_openib_CONFIG],[
AC_CONFIG_FILES([opal/mca/btl/openib/Makefile])
OPAL_VAR_SCOPE_PUSH([cpcs btl_openib_LDFLAGS_save btl_openib_LIBS_save])
cpcs="oob"
OPAL_CHECK_OPENFABRICS([btl_openib],
[btl_openib_happy="yes"
OPAL_CHECK_OPENFABRICS_CM([btl_openib])],
[btl_openib_happy="no"])
OPAL_CHECK_EXP_VERBS([btl_openib], [], [])
AS_IF([test "$btl_openib_happy" = "yes"],
[# With the new openib flags, look for ibv_fork_init
btl_openib_LDFLAGS_save="$LDFLAGS"
btl_openib_LIBS_save="$LIBS"
LDFLAGS="$LDFLAGS $btl_openib_LDFLAGS"
LIBS="$LIBS $btl_openib_LIBS"
AC_CHECK_FUNCS([ibv_fork_init])
LDFLAGS="$btl_openib_LDFLAGS_save"
LIBS="$btl_openib_LIBS_save"
$1],
[$2])
AS_IF([test "$btl_openib_happy" = "yes"],
[if test "x$btl_openib_have_xrc" = "x1"; then
cpcs="$cpcs xoob"
fi
if test "x$btl_openib_have_rdmacm" = "x1"; then
cpcs="$cpcs rdmacm"
if test "$enable_openib_rdmacm_ibaddr" = "yes"; then
AC_MSG_CHECKING([IB addressing])
AC_EGREP_CPP(
yes,
[
#include <infiniband/ib.h>
#ifdef AF_IB
yes
#endif
],
[
AC_CHECK_HEADERS(
[rdma/rsocket.h],
[
AC_MSG_RESULT([yes])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 1, rdmacm IB_AF addressing support)
],
[
AC_MSG_RESULT([no])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
AC_MSG_WARN([There is no IB_AF addressing support by lib rdmacm.])
]
)],
[
AC_MSG_RESULT([no])
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
AC_MSG_WARN([There is no IB_AF addressing support by lib rdmacm.])
])
else
AC_DEFINE(BTL_OPENIB_RDMACM_IB_ADDR, 0, rdmacm without IB_AF addressing support)
fi
fi
if test "x$btl_openib_have_udcm" = "x1"; then
cpcs="$cpcs udcm"
fi
AC_MSG_CHECKING([which openib btl cpcs will be built])
AC_MSG_RESULT([$cpcs])])
# make sure that CUDA-aware checks have been done
AC_REQUIRE([OPAL_CHECK_CUDA])
# substitute in the things needed to build openib
AC_SUBST([btl_openib_CFLAGS])
AC_SUBST([btl_openib_CPPFLAGS])
AC_SUBST([btl_openib_LDFLAGS])
AC_SUBST([btl_openib_LIBS])
OPAL_VAR_SCOPE_POP
])dnl

Просмотреть файл

@ -1,105 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2013 Mellanox Technologies, Inc.
* All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_BASE_H
#define BTL_OPENIB_CONNECT_BASE_H
#include "opal/mca/btl/openib/connect/connect.h"
#ifdef OPAL_HAVE_RDMAOE
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
(((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \
(IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ? \
true : false)
#else
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ? \
true : false)
#endif
BEGIN_C_DECLS
/*
* Forward declaration to resolve circular dependency
*/
struct mca_btl_base_endpoint_t;
/*
* Open function
*/
int opal_btl_openib_connect_base_register(void);
/*
* Component-wide CPC init
*/
int opal_btl_openib_connect_base_init(void);
/*
* Query CPCs to see if they want to run on a specific module
*/
int opal_btl_openib_connect_base_select_for_local_port
(mca_btl_openib_module_t *btl);
/*
* Forward reference to avoid an include file loop
*/
struct mca_btl_openib_proc_modex_t;
/*
* Select function
*/
int opal_btl_openib_connect_base_find_match
(mca_btl_openib_module_t *btl,
struct mca_btl_openib_proc_modex_t *peer_port,
opal_btl_openib_connect_base_module_t **local_cpc,
opal_btl_openib_connect_base_module_data_t **remote_cpc_data);
/*
* Find a CPC's index so that we can send it in the modex
*/
int opal_btl_openib_connect_base_get_cpc_index
(opal_btl_openib_connect_base_component_t *cpc);
/*
* Lookup a CPC by its index (received from the modex)
*/
opal_btl_openib_connect_base_component_t *
opal_btl_openib_connect_base_get_cpc_byindex(uint8_t index);
/*
* Allocate a CTS frag
*/
int opal_btl_openib_connect_base_alloc_cts(
struct mca_btl_base_endpoint_t *endpoint);
/*
* Free a CTS frag
*/
int opal_btl_openib_connect_base_free_cts(
struct mca_btl_base_endpoint_t *endpoint);
/*
* Start a new connection to an endpoint
*/
int opal_btl_openib_connect_base_start(
opal_btl_openib_connect_base_module_t *cpc,
struct mca_btl_base_endpoint_t *endpoint);
/*
* Component-wide CPC finalize
*/
void opal_btl_openib_connect_base_finalize(void);
END_C_DECLS
#endif

Просмотреть файл

@ -1,541 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2007-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2007 Mellanox Technologies, Inc. All rights reserved.
* Copyright (c) 2012-2015 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
*
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "btl_openib.h"
#include "btl_openib_proc.h"
#include "connect/base.h"
#include "connect/btl_openib_connect_empty.h"
#if OPAL_HAVE_RDMACM
#include "connect/btl_openib_connect_rdmacm.h"
#endif
#if OPAL_HAVE_UDCM
#include "connect/btl_openib_connect_udcm.h"
#endif
#include "opal/util/argv.h"
#include "opal/util/output.h"
#include "opal/util/proc.h"
#include "opal/util/show_help.h"
#include "opal/util/printf.h"
#include "opal/util/sys_limits.h"
#include "opal/align.h"
/*
* Array of all possible connection functions
*/
static opal_btl_openib_connect_base_component_t *all[] = {
/* Always have an entry here so that the CP indexes will always be
the same: OOB has been removed, so use the "empty" CPC */
&opal_btl_openib_connect_empty,
/* Always have an entry here so that the CP indexes will always be
the same: XOOB has been removed, so use the "empty" CPC */
&opal_btl_openib_connect_empty,
/* Always have an entry here so that the CP indexes will always be
the same: if RDMA CM is not available, use the "empty" CPC */
#if OPAL_HAVE_RDMACM
&opal_btl_openib_connect_rdmacm,
#else
&opal_btl_openib_connect_empty,
#endif
/* Always have an entry here so that the CP indexes will always be
the same: if UD CM is not enabled, use the "empty" CPC */
#if OPAL_HAVE_UDCM
&opal_btl_openib_connect_udcm,
#else
&opal_btl_openib_connect_empty,
#endif
NULL
};
/* increase this count if any more cpcs are added */
static opal_btl_openib_connect_base_component_t *available[5];
static int num_available = 0;
static char *btl_openib_cpc_include;
static char *btl_openib_cpc_exclude;
/*
* Register MCA parameters
*/
int opal_btl_openib_connect_base_register(void)
{
int i, j, save;
char **temp = NULL, *string = NULL, *all_cpc_names = NULL;
/* Make an MCA parameter to select which connect module to use */
for (i = 0; NULL != all[i]; ++i) {
/* The CPC name "empty" is reserved for "fake" CPC modules */
if (0 != strcmp(all[i]->cbc_name, "empty")) {
opal_argv_append_nosize(&temp, all[i]->cbc_name);
}
}
all_cpc_names = opal_argv_join(temp, ',');
opal_argv_free(temp);
opal_asprintf(&string,
"Method used to select OpenFabrics connections (valid values: %s)",
all_cpc_names);
btl_openib_cpc_include = NULL;
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"cpc_include", string, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&btl_openib_cpc_include);
free(string);
opal_asprintf(&string,
"Method used to exclude OpenFabrics connections (valid values: %s)",
all_cpc_names);
btl_openib_cpc_exclude = NULL;
(void) mca_base_component_var_register(&mca_btl_openib_component.super.btl_version,
"cpc_exclude", string, MCA_BASE_VAR_TYPE_STRING,
NULL, 0, 0, OPAL_INFO_LVL_9,
MCA_BASE_VAR_SCOPE_READONLY,
&btl_openib_cpc_exclude);
free(string);
/* Parse the if_[in|ex]clude paramters to come up with a list of
CPCs that are available */
/* If we have an "include" list, then find all those CPCs and put
them in available[] */
if (NULL != btl_openib_cpc_include) {
mca_btl_openib_component.cpc_explicitly_defined = true;
temp = opal_argv_split(btl_openib_cpc_include, ',');
for (save = j = 0; NULL != temp[j]; ++j) {
for (i = 0; NULL != all[i]; ++i) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
opal_output(-1, "include: saving %s", all[i]->cbc_name);
available[save++] = all[i];
++num_available;
break;
}
}
if (NULL == all[i]) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"cpc name not found", true,
"include", opal_process_info.nodename,
"include", btl_openib_cpc_include, temp[j],
all_cpc_names);
opal_argv_free(temp);
free(all_cpc_names);
return OPAL_ERR_NOT_FOUND;
}
}
opal_argv_free(temp);
}
/* Otherwise, if we have an "exclude" list, take all the CPCs that
are not in that list and put them in available[] */
else if (NULL != btl_openib_cpc_exclude) {
mca_btl_openib_component.cpc_explicitly_defined = true;
temp = opal_argv_split(btl_openib_cpc_exclude, ',');
/* First: error check -- ensure that all the names are valid */
for (j = 0; NULL != temp[j]; ++j) {
for (i = 0; NULL != all[i]; ++i) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
break;
}
}
if (NULL == all[i]) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"cpc name not found", true,
"exclude", opal_process_info.nodename,
"exclude", btl_openib_cpc_exclude, temp[j],
all_cpc_names);
opal_argv_free(temp);
free(all_cpc_names);
return OPAL_ERR_NOT_FOUND;
}
}
/* Now do the exclude */
for (save = i = 0; NULL != all[i]; ++i) {
for (j = 0; NULL != temp[j]; ++j) {
if (0 == strcmp(temp[j], all[i]->cbc_name)) {
break;
}
}
if (NULL == temp[j]) {
opal_output(-1, "exclude: saving %s", all[i]->cbc_name);
available[save++] = all[i];
++num_available;
}
}
opal_argv_free(temp);
}
/* If there's no include/exclude list, copy all[] into available[] */
else {
opal_output(-1, "no include or exclude: saving all");
memcpy(available, all, sizeof(all));
num_available = (sizeof(all) /
sizeof(opal_btl_openib_connect_base_module_t *)) - 1;
}
/* Call the register function on all the CPCs so that they may
setup any MCA params specific to the connection type */
for (i = 0; NULL != available[i]; ++i) {
if (NULL != available[i]->cbc_register) {
available[i]->cbc_register();
}
}
free (all_cpc_names);
return OPAL_SUCCESS;
}
/*
* Called once during openib BTL component initialization to allow CPC
* components to initialize.
*/
int opal_btl_openib_connect_base_init(void)
{
int i, rc;
/* Call each available CPC component's open function, if it has
one. If the CPC component open function returns OPAL_SUCCESS,
keep it. If it returns ERR_NOT_SUPPORTED, remove it from the
available[] array. If it returns something else, return that
error upward. */
for (i = num_available = 0; NULL != available[i]; ++i) {
if (NULL == available[i]->cbc_init) {
available[num_available++] = available[i];
opal_output(-1, "found available cpc (NULL init): %s",
all[i]->cbc_name);
continue;
}
rc = available[i]->cbc_init();
if (OPAL_SUCCESS == rc) {
available[num_available++] = available[i];
opal_output(-1, "found available cpc (SUCCESS init): %s",
all[i]->cbc_name);
continue;
} else if (OPAL_ERR_NOT_SUPPORTED == rc) {
continue;
} else {
return rc;
}
}
available[num_available] = NULL;
return (num_available > 0) ? OPAL_SUCCESS : OPAL_ERR_NOT_AVAILABLE;
}
/*
* Find all the CPCs that are eligible for a single local port (i.e.,
* openib module).
*/
int opal_btl_openib_connect_base_select_for_local_port(mca_btl_openib_module_t *btl)
{
char *msg = NULL;
int i, rc, cpc_index, len;
opal_btl_openib_connect_base_module_t **cpcs;
cpcs = (opal_btl_openib_connect_base_module_t **) calloc(num_available,
sizeof(opal_btl_openib_connect_base_module_t *));
if (NULL == cpcs) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Go through all available CPCs and query them to see if they
want to run on this module. If they do, save them to a running
array. */
for (len = 1, i = 0; NULL != available[i]; ++i) {
len += strlen(available[i]->cbc_name) + 2;
}
msg = (char *) malloc(len);
if (NULL == msg) {
free(cpcs);
return OPAL_ERR_OUT_OF_RESOURCE;
}
msg[0] = '\0';
for (cpc_index = i = 0; NULL != available[i]; ++i) {
if (i > 0) {
strcat(msg, ", ");
}
strcat(msg, available[i]->cbc_name);
rc = available[i]->cbc_query(btl, &cpcs[cpc_index]);
if (OPAL_ERR_NOT_SUPPORTED == rc || OPAL_ERR_UNREACH == rc) {
continue;
} else if (OPAL_SUCCESS != rc) {
free(cpcs);
free(msg);
return rc;
}
opal_output(-1, "match cpc for local port: %s",
available[i]->cbc_name);
/* If the CPC wants to use the CTS protocol, check to ensure
that QP 0 is PP; if it's not, we can't use this CPC (or the
CTS protocol) */
if (cpcs[cpc_index]->cbm_uses_cts &&
!BTL_OPENIB_QP_TYPE_PP(0)) {
BTL_VERBOSE(("this CPC only supports when the first btl_openib_receive_queues QP is a PP QP"));
continue;
}
/* This CPC has indicated that it wants to run on this openib
BTL module. Woo hoo! */
++cpc_index;
}
/* If we got an empty array, then no CPCs were eligible. Doh! */
if (0 == cpc_index) {
opal_show_help("help-mpi-btl-openib-cpc-base.txt",
"no cpcs for port", true,
opal_process_info.nodename,
ibv_get_device_name(btl->device->ib_dev),
btl->port_num, msg);
free(cpcs);
free(msg);
return OPAL_ERR_NOT_SUPPORTED;
}
free(msg);
/* We got at least one eligible CPC; save the array into the
module's port_info */
btl->cpcs = cpcs;
btl->num_cpcs = cpc_index;
return OPAL_SUCCESS;
}
/*
* This function is invoked when determining whether we have a CPC in
* common with a specific remote port. We already know that the
* subnet ID is the same between a specific local port and the target
* remote port; now we need to know if we can find a CPC in common
* between the two.
*
* If yes, be sure to find the *same* CPC on both sides. We know
* which CPCs are available on each side, and we know the priorities
* that were assigned on both sides. So find a CPC that is common to
* both sides and has the highest overall priority (between both
* sides).
*
* Return the matching CPC, or NULL if not found.
*/
int
opal_btl_openib_connect_base_find_match(mca_btl_openib_module_t *btl,
mca_btl_openib_proc_modex_t *peer_port,
opal_btl_openib_connect_base_module_t **ret_local_cpc,
opal_btl_openib_connect_base_module_data_t **ret_remote_cpc_data)
{
int i, j, max = -1;
opal_btl_openib_connect_base_module_t *local_cpc, *local_selected = NULL;
opal_btl_openib_connect_base_module_data_t *local_cpcd, *remote_cpcd,
*remote_selected = NULL;
/* Iterate over all the CPCs on the local module */
for (i = 0; i < btl->num_cpcs; ++i) {
local_cpc = btl->cpcs[i];
local_cpcd = &(local_cpc->data);
/* Iterate over all the CPCs on the remote port */
for (j = 0; j < peer_port->pm_cpc_data_count; ++j) {
remote_cpcd = &(peer_port->pm_cpc_data[j]);
/* Are the components the same? */
if (local_cpcd->cbm_component == remote_cpcd->cbm_component) {
/* If so, update the max priority found so far */
if (max < local_cpcd->cbm_priority) {
max = local_cpcd->cbm_priority;
local_selected = local_cpc;
remote_selected = remote_cpcd;
}
if (max < remote_cpcd->cbm_priority) {
max = remote_cpcd->cbm_priority;
local_selected = local_cpc;
remote_selected = remote_cpcd;
}
}
}
}
/* All done! */
if (NULL != local_selected) {
*ret_local_cpc = local_selected;
*ret_remote_cpc_data = remote_selected;
opal_output(-1, "find_match: found match!");
return OPAL_SUCCESS;
} else {
opal_output(-1, "find_match: did NOT find match!");
return OPAL_ERR_NOT_FOUND;
}
}
/*
* Lookup a CPC component's index in the all[] array so that we can
* send it int the modex
*/
int opal_btl_openib_connect_base_get_cpc_index(opal_btl_openib_connect_base_component_t *cpc)
{
int i;
for (i = 0; NULL != all[i]; ++i) {
if (all[i] == cpc) {
return i;
}
}
/* Not found */
return -1;
}
/*
* Lookup a CPC by its index (received from the modex)
*/
opal_btl_openib_connect_base_component_t *
opal_btl_openib_connect_base_get_cpc_byindex(uint8_t index)
{
return (index >= (sizeof(all) /
sizeof(opal_btl_openib_connect_base_module_t *))) ?
NULL : all[index];
}
int opal_btl_openib_connect_base_alloc_cts(mca_btl_base_endpoint_t *endpoint)
{
opal_free_list_item_t *fli;
int length = sizeof(mca_btl_openib_header_t) +
sizeof(mca_btl_openib_header_coalesced_t) +
sizeof(mca_btl_openib_control_header_t) +
sizeof(mca_btl_openib_footer_t) +
mca_btl_openib_component.qp_infos[mca_btl_openib_component.credits_qp].size;
int align_it = 0;
int page_size;
page_size = opal_getpagesize();
if (length >= page_size / 2) { align_it = 1; }
if (align_it) {
// I think this is only active for ~64k+ buffers anyway, but I'm not
// positive, so I'm only increasing the buffer size and alignment if
// it's not too small. That way we'd avoid wasting excessive memory
// in case this code was active for tiny buffers.
length = OPAL_ALIGN(length, page_size, int);
}
/* Explicitly don't use the mpool registration */
fli = &(endpoint->endpoint_cts_frag.super.super.base.super);
fli->registration = NULL;
if (!align_it) {
fli->ptr = malloc(length);
} else {
posix_memalign((void**)&(fli->ptr), page_size, length);
}
if (NULL == fli->ptr) {
BTL_ERROR(("malloc failed"));
return OPAL_ERR_OUT_OF_RESOURCE;
}
endpoint->endpoint_cts_mr =
ibv_reg_mr(endpoint->endpoint_btl->device->ib_pd,
fli->ptr, length,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ);
OPAL_OUTPUT((-1, "registered memory %p, length %d", fli->ptr, length));
if (NULL == endpoint->endpoint_cts_mr) {
free(fli->ptr);
BTL_ERROR(("Failed to reg mr!"));
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* NOTE: We do not need to register this memory with the
opal_memory subsystem, because this is OMPI-controlled memory
-- we do not need to worry about this memory being freed out
from underneath us. */
/* Copy the lkey where it needs to go */
endpoint->endpoint_cts_frag.super.sg_entry.lkey =
endpoint->endpoint_cts_mr->lkey;
endpoint->endpoint_cts_frag.super.sg_entry.length = length;
/* Construct the rest of the recv_frag_t */
OBJ_CONSTRUCT(&(endpoint->endpoint_cts_frag), mca_btl_openib_recv_frag_t);
endpoint->endpoint_cts_frag.super.super.base.order =
mca_btl_openib_component.credits_qp;
endpoint->endpoint_cts_frag.super.endpoint = endpoint;
OPAL_OUTPUT((-1, "Got a CTS frag for peer %s, addr %p, length %d, lkey %d",
opal_get_proc_hostname(endpoint->endpoint_proc->proc_opal),
(void*) endpoint->endpoint_cts_frag.super.sg_entry.addr,
endpoint->endpoint_cts_frag.super.sg_entry.length,
endpoint->endpoint_cts_frag.super.sg_entry.lkey));
return OPAL_SUCCESS;
}
int opal_btl_openib_connect_base_free_cts(mca_btl_base_endpoint_t *endpoint)
{
/* NOTE: We don't need to deregister this memory with opal_memory
because it was not registered there in the first place (see
comment above, near call to ibv_reg_mr). */
if (NULL != endpoint->endpoint_cts_mr) {
ibv_dereg_mr(endpoint->endpoint_cts_mr);
endpoint->endpoint_cts_mr = NULL;
}
if (NULL != endpoint->endpoint_cts_frag.super.super.base.super.ptr) {
free(endpoint->endpoint_cts_frag.super.super.base.super.ptr);
endpoint->endpoint_cts_frag.super.super.base.super.ptr = NULL;
OPAL_OUTPUT((-1, "Freeing CTS frag"));
}
return OPAL_SUCCESS;
}
/*
* Called to start a connection
*/
int opal_btl_openib_connect_base_start(
opal_btl_openib_connect_base_module_t *cpc,
mca_btl_base_endpoint_t *endpoint)
{
/* If the CPC uses the CTS protocol, provide a frag buffer for the
CPC to post. Must allocate these frags up here in the main
thread because the FREE_LIST_WAIT is not thread safe. */
if (cpc->cbm_uses_cts) {
int rc;
rc = opal_btl_openib_connect_base_alloc_cts(endpoint);
if (OPAL_SUCCESS != rc) {
return rc;
}
}
return cpc->cbm_start_connect(cpc, endpoint);
}
/*
* Called during openib btl component close
*/
void opal_btl_openib_connect_base_finalize(void)
{
int i;
for (i = 0 ; i < num_available ; ++i) {
if (NULL != available[i]->cbc_finalize) {
available[i]->cbc_finalize();
}
}
}

Просмотреть файл

@ -1,46 +0,0 @@
/*
* Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "opal_config.h"
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#include "connect/connect.h"
static void empty_component_register(void);
static int empty_component_init(void);
static int empty_component_query(mca_btl_openib_module_t *btl,
opal_btl_openib_connect_base_module_t **cpc);
opal_btl_openib_connect_base_component_t opal_btl_openib_connect_empty = {
"empty",
empty_component_register,
empty_component_init,
empty_component_query,
NULL
};
static void empty_component_register(void)
{
/* Nothing to do */
}
static int empty_component_init(void)
{
/* Never let this CPC run */
return OPAL_ERR_NOT_SUPPORTED;
}
static int empty_component_query(mca_btl_openib_module_t *btl,
opal_btl_openib_connect_base_module_t **cpc)
{
/* Never let this CPC run */
return OPAL_ERR_NOT_SUPPORTED;
}

Просмотреть файл

@ -1,20 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_EMPTY_H
#define BTL_OPENIB_CONNECT_EMPTY_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_empty;
#endif

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,20 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_RDMACM_H
#define BTL_OPENIB_CONNECT_RDMACM_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_rdmacm;
#endif

Просмотреть файл

@ -1,469 +0,0 @@
/*
* Copyright (c) 2011 Mellanox Technologies. All rights reserved.
*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "btl_openib.h"
#include "opal/util/show_help.h"
#include "opal/util/sys_limits.h"
#include "opal/util/proc.h"
#include "connect/btl_openib_connect_sl.h"
#include <infiniband/iba/ib_types.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#define SL_NOT_PRESENT 0xFF
#define MAX_GET_SL_REC_RETRIES 20
#define GET_SL_REC_RETRIES_TIMEOUT_MS 2000000
static struct mca_btl_openib_sa_qp_cache {
/* There will be a MR with the one send and receive buffer together */
/* The send buffer is first, the receive buffer is second */
/* The receive buffer in a UD queue pair needs room for the 40 byte GRH */
/* The buffers are first in the structure for page alignment */
char send_recv_buffer[MAD_BLOCK_SIZE * 2 + 40];
struct mca_btl_openib_sa_qp_cache *next;
struct ibv_context *context;
char *device_name;
uint32_t port_num;
struct ibv_qp *qp;
struct ibv_ah *ah;
struct ibv_cq *cq;
struct ibv_mr *mr;
struct ibv_pd *pd;
struct ibv_recv_wr rwr;
struct ibv_sge rsge;
uint8_t sl_values[65536]; /* 64K */
} *sa_qp_cache = 0;
static int init_ud_qp(
struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache);
static void init_sa_mad(
struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
struct ibv_send_wr *swr,
struct ibv_sge *ssge,
uint16_t lid,
uint16_t rem_lid);
static int get_pathrecord_info(
struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
ib_sa_mad_t *sar,
struct ibv_send_wr *swr,
uint16_t lid,
uint16_t rem_lid);
static int init_device(
struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache,
uint32_t port_num);
/*=================================================================*/
static void free_sa_qp_cache(void)
{
struct mca_btl_openib_sa_qp_cache *cache, *tmp;
cache = sa_qp_cache;
while (NULL != cache) {
/* free cache data */
if (cache->device_name)
free(cache->device_name);
if (NULL != cache->qp)
ibv_destroy_qp(cache->qp);
if (NULL != cache->ah)
ibv_destroy_ah(cache->ah);
if (NULL != cache->cq)
ibv_destroy_cq(cache->cq);
if (NULL != cache->mr)
ibv_dereg_mr(cache->mr);
if (NULL != cache->pd)
ibv_dealloc_pd(cache->pd);
tmp = cache->next;
free(cache);
cache = tmp;
}
sa_qp_cache = NULL;
}
/*=================================================================*/
static int init_ud_qp(struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache)
{
struct ibv_qp_init_attr iattr;
struct ibv_qp_attr mattr;
int rc;
/* create cq */
cache->cq = ibv_create_cq(cache->context, 4, NULL, NULL, 0);
if (NULL == cache->cq) {
BTL_ERROR(("error creating cq, errno says %s", strerror(errno)));
opal_show_help("help-mpi-btl-openib.txt", "init-fail-create-q",
true, opal_process_info.nodename,
__FILE__, __LINE__, "ibv_create_cq",
strerror(errno), errno,
ibv_get_device_name(context_arg->device));
return OPAL_ERROR;
}
/* create qp */
memset(&iattr, 0, sizeof(iattr));
iattr.send_cq = cache->cq;
iattr.recv_cq = cache->cq;
iattr.cap.max_send_wr = 1;
iattr.cap.max_recv_wr = 1;
iattr.cap.max_send_sge = 1;
iattr.cap.max_recv_sge = 1;
iattr.qp_type = IBV_QPT_UD;
cache->qp = ibv_create_qp(cache->pd, &iattr);
if (NULL == cache->qp) {
BTL_ERROR(("error creating qp %s (%d)", strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_INIT */
memset(&mattr, 0, sizeof(mattr));
mattr.qp_state = IBV_QPS_INIT;
mattr.port_num = cache->port_num;
mattr.qkey = ntohl(IB_QP1_WELL_KNOWN_Q_KEY);
rc = ibv_modify_qp(cache->qp, &mattr,
IBV_QP_STATE |
IBV_QP_PKEY_INDEX |
IBV_QP_PORT |
IBV_QP_QKEY);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_INIT errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_RTR */
memset(&mattr, 0, sizeof(mattr));
mattr.qp_state = IBV_QPS_RTR;
rc = ibv_modify_qp(cache->qp, &mattr, IBV_QP_STATE);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_RTR errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
/* modify qp to IBV_QPS_RTS */
mattr.qp_state = IBV_QPS_RTS;
rc = ibv_modify_qp(cache->qp, &mattr, IBV_QP_STATE | IBV_QP_SQ_PSN);
if (rc) {
BTL_ERROR(("Error modifying QP[%x] to IBV_QPS_RTR errno says: %s [%d]",
cache->qp->qp_num, strerror(errno), errno));
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
/*=================================================================*/
static void init_sa_mad(struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *sa_mad,
struct ibv_send_wr *swr,
struct ibv_sge *ssge,
uint16_t lid,
uint16_t rem_lid)
{
ib_path_rec_t *path_record = (ib_path_rec_t*)sa_mad->data;
memset(swr, 0, sizeof(*swr));
memset(ssge, 0, sizeof(*ssge));
/* Initialize the standard MAD header. */
memset(sa_mad, 0, MAD_BLOCK_SIZE);
ib_mad_init_new((ib_mad_t *)sa_mad, /* mad header pointer */
IB_MCLASS_SUBN_ADM, /* management class */
(uint8_t) 2, /* version */
IB_MAD_METHOD_GET, /* method */
hton64((uint64_t)lid << 48 | /* transaction ID */
(uint64_t)rem_lid << 32 |
(uint64_t)cache->qp->qp_num << 8),
IB_MAD_ATTR_PATH_RECORD, /* attribute ID */
0); /* attribute modifier */
sa_mad->comp_mask = IB_PR_COMPMASK_DLID | IB_PR_COMPMASK_SLID;
path_record->dlid = htons(rem_lid);
path_record->slid = htons(lid);
swr->sg_list = ssge;
swr->num_sge = 1;
swr->opcode = IBV_WR_SEND;
swr->wr.ud.ah = cache->ah;
swr->wr.ud.remote_qpn = ntohl(IB_QP1);
swr->wr.ud.remote_qkey = ntohl(IB_QP1_WELL_KNOWN_Q_KEY);
swr->send_flags = IBV_SEND_SIGNALED | IBV_SEND_SOLICITED;
ssge->addr = (uint64_t)(void *)sa_mad;
ssge->length = MAD_BLOCK_SIZE;
ssge->lkey = cache->mr->lkey;
}
/*=================================================================*/
static int get_pathrecord_info(struct mca_btl_openib_sa_qp_cache *cache,
ib_sa_mad_t *req_mad,
ib_sa_mad_t *resp_mad,
struct ibv_send_wr *swr,
uint16_t lid,
uint16_t rem_lid)
{
struct ibv_send_wr *bswr;
struct ibv_wc wc;
struct timeval get_sl_rec_last_sent, get_sl_rec_last_poll;
struct ibv_recv_wr *brwr;
int got_sl_value, get_sl_rec_retries, rc, ne, i;
ib_path_rec_t *req_path_record = ib_sa_mad_get_payload_ptr(req_mad);
ib_path_rec_t *resp_path_record = ib_sa_mad_get_payload_ptr(resp_mad);
got_sl_value = 0;
get_sl_rec_retries = 0;
rc = ibv_post_recv(cache->qp, &(cache->rwr), &brwr);
if (0 != rc) {
BTL_ERROR(("error posting receive on QP [0x%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
while (0 == got_sl_value) {
rc = ibv_post_send(cache->qp, swr, &bswr);
if (0 != rc) {
BTL_ERROR(("error posting send on QP [0x%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
gettimeofday(&get_sl_rec_last_sent, NULL);
while (0 == got_sl_value) {
ne = ibv_poll_cq(cache->cq, 1, &wc);
if (ne > 0 && IBV_WC_RECV == wc.opcode) {
/* We only care about the status of receive work requests. */
/* If the status of the send work request was anything other */
/* than success, we'll eventually retransmit, so ignore them. */
if (0 == resp_mad->status &&
req_path_record->slid == htons(lid) &&
req_path_record->dlid == htons(rem_lid) &&
IBV_WC_SUCCESS == wc.status &&
wc.byte_len >= MAD_BLOCK_SIZE &&
resp_mad->trans_id == req_mad->trans_id) {
/* Everything matches, so we have the desired SL */
cache->sl_values[rem_lid] = ib_path_rec_sl(resp_path_record);
got_sl_value = 1;
break;
}
/* Probably bad status, unlikely bad lid match. We will */
/* ignore response and let it time out so that we do a */
/* retry, but after a delay. Need to repost receive WR. */
rc = ibv_post_recv(cache->qp, &(cache->rwr), &brwr);
if (0 != rc) {
BTL_ERROR(("error posing receive on QP[%x] rc says: %s [%d]",
cache->qp->qp_num, strerror(rc), rc));
return OPAL_ERROR;
}
} else if (0 == ne) { /* poll did not find anything */
gettimeofday(&get_sl_rec_last_poll, NULL);
i = get_sl_rec_last_poll.tv_sec - get_sl_rec_last_sent.tv_sec;
i = (i * 1000000) +
get_sl_rec_last_poll.tv_usec - get_sl_rec_last_sent.tv_usec;
if (i > GET_SL_REC_RETRIES_TIMEOUT_MS) {
get_sl_rec_retries++;
BTL_VERBOSE(("[%d/%d] retries to get PathRecord",
get_sl_rec_retries, MAX_GET_SL_REC_RETRIES));
if (get_sl_rec_retries > MAX_GET_SL_REC_RETRIES) {
BTL_ERROR(("No response from SA after %d retries",
MAX_GET_SL_REC_RETRIES));
return OPAL_ERROR;
}
/* Need to retransmit request. We must make a new TID */
/* so the SM doesn't see it as the same request. */
req_mad->trans_id += hton64(1);
break;
}
usleep(100); /* otherwise pause before polling again */
} else if (ne < 0) {
BTL_ERROR(("error polling CQ returned %d\n", ne));
return OPAL_ERROR;
}
}
}
return 0;
}
/*=================================================================*/
static int init_device(struct ibv_context *context_arg,
struct mca_btl_openib_sa_qp_cache *cache,
uint32_t port_num)
{
struct ibv_ah_attr aattr;
struct ibv_port_attr pattr;
int rc;
cache->context = ibv_open_device(context_arg->device);
if (NULL == cache->context) {
BTL_ERROR(("error obtaining device context for %s errno says %s",
ibv_get_device_name(context_arg->device), strerror(errno)));
return OPAL_ERROR;
}
cache->device_name = strdup(ibv_get_device_name(cache->context->device));
cache->port_num = port_num;
/* init all sl_values to be SL_NOT_PRESENT */
memset(&cache->sl_values, SL_NOT_PRESENT, sizeof(cache->sl_values));
cache->next = sa_qp_cache;
sa_qp_cache = cache;
/* allocate the protection domain for the device */
cache->pd = ibv_alloc_pd(cache->context);
if (NULL == cache->pd) {
BTL_ERROR(("error allocating protection domain for %s errno says %s",
ibv_get_device_name(context_arg->device), strerror(errno)));
return OPAL_ERROR;
}
/* register memory region */
cache->mr = ibv_reg_mr(cache->pd, cache->send_recv_buffer,
sizeof(cache->send_recv_buffer),
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE);
if (NULL == cache->mr) {
BTL_ERROR(("error registering memory region, errno says %s", strerror(errno)));
return OPAL_ERROR;
}
/* init the ud qp */
rc = init_ud_qp(context_arg, cache);
if (OPAL_ERROR == rc) {
return OPAL_ERROR;
}
rc = ibv_query_port(cache->context, cache->port_num, &pattr);
if (rc) {
BTL_ERROR(("error getting port attributes for device %s "
"port number %d errno says %s",
ibv_get_device_name(context_arg->device),
cache->port_num, strerror(errno)));
return OPAL_ERROR;
}
/* create address handle */
memset(&aattr, 0, sizeof(aattr));
aattr.dlid = pattr.sm_lid;
aattr.sl = pattr.sm_sl;
aattr.port_num = cache->port_num;
cache->ah = ibv_create_ah(cache->pd, &aattr);
if (NULL == cache->ah) {
BTL_ERROR(("error creating address handle: %s", strerror(errno)));
return OPAL_ERROR;
}
memset(&(cache->rwr), 0, sizeof(cache->rwr));
cache->rwr.num_sge = 1;
cache->rwr.sg_list = &(cache->rsge);
memset(&(cache->rsge), 0, sizeof(cache->rsge));
cache->rsge.addr = (uint64_t)(void *)
(cache->send_recv_buffer + MAD_BLOCK_SIZE);
cache->rsge.length = MAD_BLOCK_SIZE + 40;
cache->rsge.lkey = cache->mr->lkey;
return 0;
}
/*=================================================================*/
static int get_pathrecord_sl(struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid)
{
struct ibv_send_wr swr;
ib_sa_mad_t *req_mad, *resp_mad;
struct ibv_sge ssge;
struct mca_btl_openib_sa_qp_cache *cache;
size_t page_size = (size_t)opal_getpagesize();
int rc;
/* search for a cached item */
for (cache = sa_qp_cache; cache; cache = cache->next) {
if (0 == strcmp(cache->device_name,
ibv_get_device_name(context_arg->device))
&& cache->port_num == port_num) {
break;
}
}
if (NULL == cache) {
/* init new cache */
if (posix_memalign((void **)(&cache), page_size,
sizeof(struct mca_btl_openib_sa_qp_cache))) {
BTL_ERROR(("error in posix_memalign SA cache"));
return OPAL_ERROR;
}
/* one time setup for each device/port combination */
rc = init_device(context_arg, cache, port_num);
if (0 != rc) {
return rc;
}
}
/* if the destination lid SL value is not in the cache, go get it */
if (SL_NOT_PRESENT == cache->sl_values[rem_lid]) {
/* sa_mad is first buffer, where we build the SA Get request to send */
req_mad = (ib_sa_mad_t *)(cache->send_recv_buffer);
init_sa_mad(cache, req_mad, &swr, &ssge, lid, rem_lid);
/* resp_mad is the receive buffer (40 byte offset is for GRH) */
resp_mad = (ib_sa_mad_t *)(cache->send_recv_buffer + MAD_BLOCK_SIZE + 40);
rc = get_pathrecord_info(cache, req_mad, resp_mad, &swr, lid, rem_lid);
if (0 != rc) {
return rc;
}
}
/* now all we do is send back the value laying around */
return cache->sl_values[rem_lid];
}
/*=================================================================*/
int btl_openib_connect_get_pathrecord_sl(struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid)
{
int rc = get_pathrecord_sl(context_arg, port_num, lid, rem_lid);
if (OPAL_ERROR == rc) {
free_sa_qp_cache();
}
return rc;
}
/*=================================================================*/
void btl_openib_connect_sl_finalize()
{
free_sa_qp_cache();
}

Просмотреть файл

@ -1,26 +0,0 @@
/*
* Copyright (c) 2011 Mellanox Technologies. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_SL_H
#define BTL_OPENIB_CONNECT_SL_H
BEGIN_C_DECLS
int btl_openib_connect_get_pathrecord_sl(
struct ibv_context *context_arg,
uint32_t port_num,
uint16_t lid,
uint16_t rem_lid);
void btl_openib_connect_sl_finalize(void);
END_C_DECLS
#endif /* BTL_OPENIB_CONNECT_SL_H */

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,22 +0,0 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
* Copyright (c) 2011 Los Alamos National Security, LLC. All
* right reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_OPENIB_CONNECT_UD_H
#define BTL_OPENIB_CONNECT_UD_H
#include "opal_config.h"
#include "connect/connect.h"
extern opal_btl_openib_connect_base_component_t opal_btl_openib_connect_udcm;
#endif

Просмотреть файл

@ -1,355 +0,0 @@
/*
* Copyright (c) 2007-2008 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/**
* @file
*
* This interface is designed to hide the back-end details of how IB
* RC connections are made from the rest of the openib BTL. There are
* module-like instances of the implemented functionality (dlopen and
* friends are not used, but all the functionality is accessed through
* struct's of function pointers, so you can swap between multiple
* different implementations at run time, just like real components).
* Hence, these entities are referred to as "Connect
* Pseudo-Components" (CPCs).
*
* The CPCs are referenced by their names (e.g., "oob", "rdma_cm").
*
* CPCs are split into components and modules, similar to all other
* MCA frameworks in this code base.
*
* Before diving into the CPC interface, let's discuss some
* terminology and mappings of data structures:
*
* - a BTL module represents a network port (in the case of the openib
* BTL, a LID)
* - a CPC module represents one way to make connections to a BTL module
* - hence, a BTL module has potentially multiple CPC modules
* associated with it
* - an endpoint represnts a connection between a local BTL module and
* a remote BTL module (in the openib BTL, because of BSRQ, an
* endpoint can contain multiple QPs)
* - when an endpoint is created, one of the CPC modules associated
* with the local BTL is selected and associated with the endpoint
* (obviously, it is a CPC module that is common between the local
* and remote BTL modules)
* - endpoints may be created and destroyed during the MPI job
* - endpoints are created lazily, during the first communication
* between two peers
* - endpoints are destroyed when two MPI processes become
* disconnected (e.g., MPI-2 dynamics or MPI_FINALIZE)
* - hence, BTL modules and CPC modules outlive endpoints.
* Specifically, BTL modules and CPC modules live from MPI_INIT to
* MPI_FINALIZE. endpoints come and go as MPI semantics demand it.
* - therefore, CPC modules need to cache information on endpoints that
* are specific to that connection.
*
* Component interface:
*
* - component_register(): The openib BTL's component_open() function
* calls the connect_base_register() function, which scans all
* compiled-in CPC's. If they have component_register() functions,
* they are called (component_register() functions are only allowed to
* register MCA parameters).
*
* NOTE: The connect_base_register() function will process the
* btl_openib_cpc_include and btl_openib_cpc_exclude MCA parameters
* and automatically include/exclude CPCs as relevant. If a CPC is
* excluded, none of its other interface functions will be invoked for
* the duration of the process.
*
* - component_init(): The openib BTL's component_init() function
* calls connect_base_init(), which will invoke this query function on
* each CPC to see if it wants to run at all. CPCs can gracefully
* remove themselves from consideration in this process by returning
* OPAL_ERR_NOT_SUPPORTED.
*
* - component_query(): The openib BTL's init_one_port() calls the
* connect_base_select_for_local_port() function, which, for each LID
* on that port, calls the component_query() function on every
* available CPC on that LID. This function is intended to see if a
* CPC can run on a sepcific openib BTL module (i.e., LID). If it
* can, the CPC is supposed to create a CPC module that is specific to
* that BTL/LID and return it. If it cannot, it should return
* OPAL_ERR_NOT_SUPPORTED and be gracefully skipped for this
* OpenFabrics port.
*
* component_finalize(): The openib BTL's component_close() function
* calls connect_base_finalize(), which, in turn, calls the
* component_finalize() function on all available CPCs. Note that all
* CPC modules will have been finalized by this point; the CPC
* component_finalize() function is a chance for the CPC to clean up
* any component-specific resources.
*
* Module interface:
*
* cbm_component member: A pointer pointing to the single, global
* instance of the CPC component. This member is used for creating a
* unique index representing the modules' component so that it can be
* shared with remote peer processes.
*
* cbm_priority member: An integer between 0 and 100, inclusive,
* representing the priority of this CPC.
*
* cbm_modex_message member: A pointer to a blob buffer that will be
* included in the modex message for this port for this CPC (it is
* assumed that this blob is a) only understandable by the
* corresponding CPC in the peer process, and b) contains specific
* addressing/contact information for *this* port's CPC module).
*
* cbm_modex_message_len member: The length of the cbm_modex_message
* blob, in bytes.
*
* cbm_endpoint_init(): Called during endpoint creation, allowing a
* CPC module to cache information on the endpoint. A pointer to the
* endpoint's CPC module is already cached on the endpoint.
*
* cbm_start_connect(): initiate a connection to a remote peer. The
* CPC is responsible for setting itself up for asyncronous operation
* for progressing the outgoing connection request.
*
* cbm_endpoint_finalize(): Called during the endpoint destrouction,
* allowing the CPC module to destroy anything that it cached on the
* endpoint.
*
* cbm_finalize(): shut down all asynchronous handling and clean up
* any state that was setup for this CPC module/BTL. Some CPCs setup
* asynchronous support on a per-HCA/NIC basis (vs. per-port/LID). It
* is the reponsibility of the CPC to figure out such issues (e.g.,
* via reference counting) -- there is no notification from the
* upper-level BTL about when an entire HCA/NIC is no longer being
* used. There is only this function, which tells when a specific
* CPC/BTL module is no longer being used.
*
* cbm_uses_cts: a bool that indicates whether the CPC will use the
* CTS protocol or not.
* - if true: the CPC will post the fragment on
* endpoint->endpoint_cts_frag as a receive buffer and will *not*
* call opal_btl_openib_post_recvs().
* - if false: the CPC will call opal_btl_openib_post_recvs() before
* calling opal_btl_openib_cpc_complete().
*
* There are two functions in the main openib BTL that the CPC may
* call:
*
* - opal_btl_openib_post_recvs(endpoint): once a QP is locally
* connected to the remote side (but we don't know if the remote side
* is connected to us yet), this function is invoked to post buffers
* on the QP, setup credits for the endpoint, etc. This function is
* *only* invoked if the CPC's cbm_uses_cts is false.
*
* - opal_btl_openib_cpc_complete(endpoint): once that a CPC knows
* that a QP is connected on *both* sides, this function is invoked to
* tell the main openib BTL "ok, you can use this connection now."
* (e.g., the main openib BTL will either invoke the CTS protocol or
* start sending out fragments that were queued while the connection
* was establishing, etc.).
*/
#ifndef BTL_OPENIB_CONNECT_H
#define BTL_OPENIB_CONNECT_H
BEGIN_C_DECLS
#define BCF_MAX_NAME 64
/**
* Must forward declare these structs to avoid include file loops.
*/
struct mca_btl_openib_hca_t;
struct mca_btl_openib_module_t;
struct mca_btl_base_endpoint_t;
/**
* This is struct is defined below
*/
struct opal_btl_openib_connect_base_module_t;
/************************************************************************/
/**
* Function to register MCA params in the connect functions. It
* returns no value, so it cannot fail.
*/
typedef void (*opal_btl_openib_connect_base_component_register_fn_t)(void);
/**
* This function is invoked once by the openib BTL component during
* startup. It is intended to have CPC component-wide startup.
*
* Return value:
*
* - OPAL_SUCCESS: this CPC component will be used in selection during
* this process.
*
* - OPAL_ERR_NOT_SUPPORTED: this CPC component will be silently
* ignored in this process.
*
* - Other OPAL_ERR_* values: the error will be propagated upwards,
* likely causing a fatal error (and/or the openib BTL component
* being ignored).
*/
typedef int (*opal_btl_openib_connect_base_component_init_fn_t)(void);
/**
* Query the CPC to see if it wants to run on a specific port (i.e., a
* specific BTL module). If the component init function previously
* returned OPAL_SUCCESS, this function is invoked once per BTL module
* creation (i.e., for each port found by an MPI process). If this
* CPC wants to be used on this BTL module, it returns a CPC module
* that is specific to this BTL module.
*
* The BTL module in question is passed to the function; all of its
* attributes can be used to query to see if it's eligible for this
* CPC.
*
* If it is eligible, the CPC is responsible for creating a
* corresponding CPC module, filling in all the relevant fields on the
* modules, and for setting itself up to run (per above) and returning
* a CPC module (this is effectively the "module_init" function).
* Note that the module priority must be between 0 and 100
* (inclusive). When multiple CPCs are eligible for a single module,
* the CPC with the highest priority will be used.
*
* Return value:
*
* - OPAL_SUCCESS if this CPC is eligible for and was able to be setup
* for this BTL module. It is assumed that the CPC is now completely
* setup to run on this openib module (per description above).
*
* - OPAL_ERR_NOT_SUPPORTED if this CPC cannot support this BTL
* module. This is not an error; it's just the CPC saying "sorry, I
* cannot support this BTL module."
*
* - Other OPAL_ERR_* code: an error occurred.
*/
typedef int (*opal_btl_openib_connect_base_func_component_query_t)
(struct mca_btl_openib_module_t *btl,
struct opal_btl_openib_connect_base_module_t **cpc);
/**
* This function is invoked once by the openib BTL component during
* shutdown. It is intended to have CPC component-wide shutdown.
*/
typedef int (*opal_btl_openib_connect_base_component_finalize_fn_t)(void);
/**
* CPC component struct
*/
struct opal_btl_openib_connect_base_component_t {
/** Name of this set of connection functions */
char cbc_name[BCF_MAX_NAME];
/** Register function. Can be NULL. */
opal_btl_openib_connect_base_component_register_fn_t cbc_register;
/** CPC component init function. Can be NULL. */
opal_btl_openib_connect_base_component_init_fn_t cbc_init;
/** Query the CPC component to get a CPC module corresponding to
an openib BTL module. Cannot be NULL. */
opal_btl_openib_connect_base_func_component_query_t cbc_query;
/** CPC component finalize function. Can be NULL. */
opal_btl_openib_connect_base_component_finalize_fn_t cbc_finalize;
};
/**
* Convenience typedef
*/
typedef struct opal_btl_openib_connect_base_component_t opal_btl_openib_connect_base_component_t;
/************************************************************************/
/**
* Function called when an endpoint has been created and has been
* associated with a CPC.
*/
typedef int (*opal_btl_openib_connect_base_module_endpoint_init_fn_t)
(struct mca_btl_base_endpoint_t *endpoint);
/**
* Function to initiate a connection to a remote process.
*/
typedef int (*opal_btl_openib_connect_base_module_start_connect_fn_t)
(struct opal_btl_openib_connect_base_module_t *cpc,
struct mca_btl_base_endpoint_t *endpoint);
/**
* Function called when an endpoint is being destroyed.
*/
typedef int (*opal_btl_openib_connect_base_module_endpoint_finalize_fn_t)
(struct mca_btl_base_endpoint_t *endpoint);
/**
* Function to finalize the CPC module. It is called once when the
* CPC module's corresponding openib BTL module is being finalized.
*/
typedef int (*opal_btl_openib_connect_base_module_finalize_fn_t)
(struct mca_btl_openib_module_t *btl,
struct opal_btl_openib_connect_base_module_t *cpc);
/**
* Meta data about a CPC module. This is in a standalone struct
* because it is used in both the CPC module struct and the
* openib_btl_proc_t struct to hold information received from the
* modex.
*/
typedef struct opal_btl_openib_connect_base_module_data_t {
/** Pointer back to the component. Used by the base and openib
btl to calculate this module's index for the modex. */
opal_btl_openib_connect_base_component_t *cbm_component;
/** Priority of the CPC module (must be >=0 and <=100) */
uint8_t cbm_priority;
/** Blob that the CPC wants to include in the openib modex message
for a specific port, or NULL if the CPC does not want to
include a message in the modex. */
void *cbm_modex_message;
/** Length of the cbm_modex_message blob (0 if
cbm_modex_message==NULL). The message is intended to be short
(because the size of the modex broadcast is a function of
sum(cbm_modex_message_len[i]) for
i=(0...total_num_ports_in_MPI_job) -- e.g., IBCM imposes its
own [very short] limits (per IBTA volume 1, chapter 12). */
uint8_t cbm_modex_message_len;
} opal_btl_openib_connect_base_module_data_t;
/**
* Struct for holding CPC module and associated meta data
*/
typedef struct opal_btl_openib_connect_base_module_t {
/** Meta data about the module */
opal_btl_openib_connect_base_module_data_t data;
/** Endpoint initialization function */
opal_btl_openib_connect_base_module_endpoint_init_fn_t cbm_endpoint_init;
/** Connect function */
opal_btl_openib_connect_base_module_start_connect_fn_t cbm_start_connect;
/** Endpoint finalization function */
opal_btl_openib_connect_base_module_endpoint_finalize_fn_t cbm_endpoint_finalize;
/** Finalize the cpc module */
opal_btl_openib_connect_base_module_finalize_fn_t cbm_finalize;
/** Whether this module will use the CTS protocol or not. This
directly states whether this module will call
mca_btl_openib_endpoint_post_recvs() or not: true = this
module will *not* call _post_recvs() and instead will post the
receive buffer provided at endpoint->endpoint_cts_frag on qp
0. */
bool cbm_uses_cts;
} opal_btl_openib_connect_base_module_t;
END_C_DECLS
#endif

Просмотреть файл

@ -1,57 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2008-2009 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics IB CPC
# support.
#
[no cpcs for port]
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: %s
Local device: %s
Local port: %d
CPCs attempted: %s
#
[cpc name not found]
An invalid CPC name was specified via the btl_openib_cpc_%s MCA
parameter.
Local host: %s
btl_openib_cpc_%s value: %s
Invalid name: %s
All possible valid names: %s
#
[inline truncated]
WARNING: The btl_openib_max_inline_data MCA parameter was used to
specify how much inline data should be used, but a device reduced this
value. This is not an error; it simply means that your run will use
a smaller inline data value than was requested.
Local host: %s
Local device: %s
Local port: %d
Requested value: %d
Value used by device: %d
#
[ibv_create_qp failed]
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: %s
Local device: %s
Queue pair type: %s

Просмотреть файл

@ -1,67 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2008 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics RDMA CM
# support (the openib BTL).
#
[could not find matching endpoint]
The OpenFabrics device in an MPI process received an RDMA CM connect
request for a peer that it could not identify as part of this MPI job.
This should not happen. Your process is likely to abort; sorry.
Local host: %s
Local device: %s
Remote address: %s
Remote TCP port: %d
#
[illegal tcp port]
The btl_openib_connect_rdmacm_port MCA parameter was used to specify
an illegal TCP port value. TCP ports must be between 0 and 65536
(ports below 1024 can only be used by root).
TCP port: %d
This value was ignored.
#
[illegal retry count]
The btl_openib_connect_rdmacm_retry_count MCA parameter was used to specify
an illegal retry count.
Retry count: %d
#
[illegal timeout]
The btl_openib_connect_rdmacm_resolve_timeout parameter was used to
specify an illegal timeout value. Timeout values are specified in
miliseconds and must be greater than 0.
Timeout value: %d
This value was ignored.
#
[rdma cm device removal]
The RDMA CM returned that the device Open MPI was trying to use has
been removed.
Local host: %s
Local device: %s
Your MPI job will now abort, sorry.
#
[rdma cm event error]
The RDMA CM returned an event error while attempting to make a
connection. This type of error usually indicates a network
configuration error.
Local host: %s
Local device: %s
Error name: %s
Peer: %s
Your MPI job will now abort, sorry.

Просмотреть файл

@ -1,725 +0,0 @@
# -*- text -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2006 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2006-2011 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2007-2009 Mellanox Technologies. All rights reserved.
# Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
# Copyright (c) 2013-2014 NVIDIA Corporation. All rights reserved.
# Copyright (c) 2018 Los Alamos National Security, LLC. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for Open MPI's OpenFabrics support
# (the openib BTL).
#
[ini file:file not found]
The Open MPI OpenFabrics (openib) BTL component was unable to find or
read an INI file that was requested via the
btl_openib_device_param_files MCA parameter. Please check this file
and/or modify the btl_openib_evice_param_files MCA parameter:
%s
#
[ini file:not in a section]
In parsing the OpenFabrics (openib) BTL parameter file, values were
found that were not in a valid INI section. These values will be
ignored. Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:unexpected token]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:expected equals]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). An equals sign ("=") was expected but was not found.
Please re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:expected newline]
In parsing the OpenFabrics (openib) BTL parameter file, unexpected
tokens were found (this may cause significant portions of the INI file
to be ignored). A newline was expected but was not found. Please
re-check this file:
%s
At line %d, near the following text:
%s
#
[ini file:unknown field]
In parsing the OpenFabrics (openib) BTL parameter file, an
unrecognized field name was found. Please re-check this file:
%s
At line %d, the field named:
%s
This field, and any other unrecognized fields, will be skipped.
#
[no device params found]
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: %s
Device name: %s
Device vendor ID: 0x%04x
Device vendor part ID: %d
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
#
[init-fail-no-mem]
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: %s
OMPI source: %s:%d
Function: %s()
Device: %s
Memlock limit: %s
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
#
[init-fail-create-q]
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue. This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?). The failure occured here:
Local host: %s
OMPI source: %s:%d
Function: %s()
Error: %s (errno=%d)
Device: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[pp rnr retry exceeded]
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded. In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.
This error usually means one of two things:
1. There is something awry within the network fabric itself.
2. A bug in Open MPI has caused flow control to malfunction.
#1 is usually more likely. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[srq rnr retry exceeded]
The OpenFabrics "receiver not ready" retry count on a shared receive
queue or XRC receive queue has been exceeded. This error can occur if
the mca_btl_openib_ib_rnr_retry is set to a value less than 7 (where 7
the default value and effectively means "infinite retry"). If your
rnr_retry value is 7, there might be something awry within the network
fabric itself. In this case, you should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[pp retry exceeded]
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 20). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: %s
Local device: %s
Peer host: %s
You may need to consult with your system administrator to get this
problem fixed.
#
[no active ports found]
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: %s
#
[error in device init]
WARNING: There was an error initializing an OpenFabrics device.
Local host: %s
Local device: %s
#
[no devices right type]
WARNING: No OpenFabrics devices of the right type were found within
the requested bus distance. The OpenFabrics BTL will be ignored for
this run.
Local host: %s
Requested type: %s
If the "requested type" is "<any>", this usually means that *no*
OpenFabrics devices were found within the requested bus distance.
Note starting with Open MPI 4.0, only iWarp and RoCE devices are considered
for selection by default. Set the btl_openib_allow_ib MCA
parameter to "true" to allow use of Infiniband devices.
#
[default subnet prefix]
WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
#
[ibv_fork requested but not supported]
WARNING: fork() support was requested for the OpenFabrics (openib)
BTL, but it is not supported on the host %s. Deactivating the
OpenFabrics BTL.
#
[ibv_fork_init fail]
WARNING: fork() support was requested for the OpenFabrics (openib)
BTL, but the library call ibv_fork_init() failed on the host %s.
Deactivating the OpenFabrics BTL.
#
[wrong buffer alignment]
Wrong buffer alignment %d configured on host '%s'. Should be bigger
than zero and power of two. Use default %d instead.
#
[of error event]
The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.
Local host: %s
MPI process PID: %d
Error number: %d (%s)
This error may indicate connectivity problems within the fabric;
please contact your system administrator.
#
[of unknown event]
The OpenFabrics stack has reported an unknown network error event.
Open MPI will try to continue, but the job may end up failing.
Local host: %s
MPI process PID: %d
Error number: %d
This error may indicate that you are using an OpenFabrics library
version that is not currently supported by Open MPI. You might try
recompiling Open MPI against your OpenFabrics library installation to
get more information.
#
[specified include and exclude]
ERROR: You have specified more than one of the btl_openib_if_include,
btl_openib_if_exclude, btl_openib_ipaddr_include, or btl_openib_ipaddr_exclude
MCA parameters. These four parameters are mutually exclusive; you can only
specify one.
For reference, the values that you specified are:
btl_openib_if_include: %s
btl_openib_if_exclude: %s
btl_openib_ipaddr_include: %s
btl_openib_ipaddr_exclude: %s
#
[nonexistent port]
WARNING: One or more nonexistent OpenFabrics devices/ports were
specified:
Host: %s
MCA parameter: mca_btl_if_%sclude
Nonexistent entities: %s
These entities will be ignored. You can disable this warning by
setting the btl_openib_warn_nonexistent_if MCA parameter to 0.
#
[invalid mca param value]
WARNING: An invalid MCA parameter value was found for the OpenFabrics
(openib) BTL.
Problem: %s
Resolution: %s
#
[no qps in receive_queues]
WARNING: No queue pairs were defined in the btl_openib_receive_queues
MCA parameter. At least one queue pair must be defined. The
OpenFabrics (openib) BTL will therefore be deactivated for this run.
Local host: %s
#
[invalid qp type in receive_queues]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
Valid queue pair types are "P" for per-peer and "S" for shared receive
queue.
Local host: %s
btl_openib_receive_queues: %s
Bad specification: %s
#
[invalid pp qp specification]
WARNING: An invalid per-peer receive queue specification was detected
as part of the btl_openib_receive_queues MCA parameter. The
OpenFabrics (openib) BTL will therefore be deactivated for this run.
Per-peer receive queues require between 2 and 5 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
4. Credit window size (optional; defaults to (low_watermark / 2),
must be > 0)
5. Number of buffers reserved for credit messages (optional;
defaults to (num_buffers*2-1)/credit_window)
Example: P,128,256,128,16
- 128 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- If the number of available credits reaches 16, send an explicit
credit message to the sender
- Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages
Local host: %s
Bad queue specification: %s
#
[invalid srq specification]
WARNING: An invalid shared receive queue specification was detected as
part of the btl_openib_receive_queues MCA parameter. The OpenFabrics
(openib) BTL will therefore be deactivated for this run.
Shared receive queues can take between 2 and 6 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
4. Maximum number of outstanding sends a sender can have (optional;
defaults to (low_watermark / 4)
5. Start value of number of receive buffers that will be pre-posted (optional; defaults to (num_buffers / 4))
6. Event limit buffer count watermark (optional; defaults to (3/16 of start value of buffers number))
Example: S,1024,256,128,32,32,8
- 1024 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- A sender will not send to a peer unless it has less than 32
outstanding sends to that peer.
- 32 receive buffers will be preposted.
- When the number of unused shared receive buffers reaches 8, more
buffers (32 in this case) will be posted.
Local host: %s
Bad queue specification: %s
#
[rd_num must be > rd_low]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter must be greater than the low
buffer count watermark. The OpenFabrics (openib) BTL will therefore
be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[rd_num must be >= rd_init]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter (parameter #2) must be
greater or equal to the initial SRQ size (parameter #5).
The OpenFabrics (openib) BTL will therefore be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[srq_limit must be > rd_num]
WARNING: The number of buffers for a queue pair specified via the
btl_openib_receive_queues MCA parameter (parameter #2) must be greater than the limit
buffer count (parameter #6). The OpenFabrics (openib) BTL will therefore
be deactivated for this run.
Local host: %s
Bad queue specification: %s
#
[biggest qp size is too small]
WARNING: The largest queue pair buffer size specified in the
btl_openib_receive_queues MCA parameter is smaller than the maximum
send size (i.e., the btl_openib_max_send_size MCA parameter), meaning
that no queue is large enough to receive the largest possible incoming
message fragment. The OpenFabrics (openib) BTL will therefore be
deactivated for this run.
Local host: %s
Largest buffer size: %d
Maximum send fragment size: %d
#
[biggest qp size is too big]
WARNING: The largest queue pair buffer size specified in the
btl_openib_receive_queues MCA parameter is larger than the maximum
send size (i.e., the btl_openib_max_send_size MCA parameter). This
means that memory will be wasted because the largest possible incoming
message fragment will not fill a buffer allocated for incoming
fragments.
Local host: %s
Largest buffer size: %d
Maximum send fragment size: %d
#
[freelist too small]
WARNING: The maximum freelist size that was specified was too small
for the requested receive queue sizes. The maximum freelist size must
be at least equal to the sum of the largest number of buffers posted
to a single queue plus the corresponding number of reserved/credit
buffers for that queue. It is suggested that the maximum be quite a
bit larger than this for performance reasons.
Local host: %s
Specified freelist size: %d
Minimum required freelist size: %d
#
[XRC with PP or SRQ]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
Note that XRC ("X") queue pairs cannot be used with per-peer ("P") and
SRQ ("S") queue pairs. This restriction may be removed in future
versions of Open MPI.
Local host: %s
btl_openib_receive_queues: %s
#
[XRC with BTLs per LID]
WARNING: An invalid queue pair type was specified in the
btl_openib_receive_queues MCA parameter. The OpenFabrics (openib) BTL
will be deactivated for this run.
XRC ("X") queue pairs can not be used when (btls_per_lid > 1). This
restriction may be removed in future versions of Open MPI.
Local host: %s
btl_openib_receive_queues: %s
btls_per_lid: %d
#
[XRC on device without XRC support]
WARNING: You configured the OpenFabrics (openib) BTL to run with %d
XRC queues. The device %s does not have XRC capabilities; the
OpenFabrics btl will ignore this device. If no devices are found with
XRC capabilities, the OpenFabrics BTL will be disabled.
Local host: %s
#
[No XRC support]
WARNING: The Open MPI build was compiled without XRC support, but XRC
("X") queues were specified in the btl_openib_receive_queues MCA
parameter. The OpenFabrics (openib) BTL will therefore be deactivated
for this run.
Local host: %s
btl_openib_receive_queues: %s
#
[non optimal rd_win]
WARNING: rd_win specification is non optimal. For maximum performance it is
advisable to configure rd_win bigger than (rd_num - rd_low), but currently
rd_win = %d and (rd_num - rd_low) = %d.
#
[apm without lmc]
WARNING: You can't enable APM support with LMC bit configured to 0.
APM support will be disabled.
#
[apm with wrong lmc]
Can not provide %d alternative paths with LMC bit configured to %d.
#
[apm not enough ports]
WARNING: For APM over ports ompi require at least 2 active ports and
only single active port was found. Disabling APM over ports
#
[locally conflicting receive_queues]
Open MPI detected two devices on a single server that have different
"receive_queues" parameter values (in the openib BTL). Open MPI
currently only supports one OpenFabrics receive_queues value in an MPI
job, even if you have different types of OpenFabrics adapters on the
same host.
Device 2 (in the details shown below) will be ignored for the duration
of this MPI job.
You can fix this issue by one or more of the following:
1. Set the MCA parameter btl_openib_receive_queues to a value that
is usable by all the OpenFabrics devices that you will use.
2. Use the btl_openib_if_include or btl_openib_if_exclue MCA
parameters to select exactly which OpenFabrics devices to use in
your MPI job.
Finally, note that the "receive_queues" values may have been set by
the Open MPI device default settings file. You may want to look in
this file and see if your devices are getting receive_queues values
from this file:
%s/mca-btl-openib-device-params.ini
Here is more detailed information about the recieive_queus value
conflict:
Local host: %s
Device 1: %s (vendor 0x%x, part ID %d)
Receive queues: %s
Device 2: %s (vendor 0x%x, part ID %d)
Receive queues: %s
#
[eager RDMA and progress threads]
WARNING: The openib BTL was directed to use "eager RDMA" for short
messages, but the openib BTL was compiled with progress threads
support. Short eager RDMA is not yet supported with progress threads;
its use has been disabled in this job.
This is a warning only; you job will attempt to continue.
#
[ptmalloc2 with no threads]
WARNING: It appears that ptmalloc2 was compiled into this process via
-lopenmpi-malloc, but there is no thread support. This combination is
known to cause memory corruption in the openib BTL. Open MPI is
therefore disabling the use of the openib BTL in this process for this
run.
Local host: %s
#
[cannot raise btl error]
The OpenFabrics driver in Open MPI tried to raise a fatal error, but
failed. Hopefully there was an error message before this one that
gave some more detailed information.
Local host: %s
Source file: %s
Source line: %d
Your job is now going to abort, sorry.
#
[no iwarp support]
Open MPI does not support iWARP devices with this version of OFED.
You need to upgrade to a later version of OFED (1.3 or later) for Open
MPI to support iWARP devices.
(This message is being displayed because you told Open MPI to use
iWARP devices via the btl_openib_device_type MCA parameter)
#
[invalid ipaddr_inexclude]
WARNING: An invalid value was given for btl_openib_ipaddr_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[unsupported queues configuration]
The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other. This
generally happens when you are using OpenFabrics devices from
different vendors on the same network. You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.
Local host: %s
Local adapter: %s (vendor 0x%x, part ID %d)
Local queues: %s
Remote host: %s
Remote adapter: (vendor 0x%x, part ID %d)
Remote queues: %s
#
[conflicting transport types]
Open MPI detected two different OpenFabrics transport types in the same Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.
Local host: %s
Local adapter: %s (vendor 0x%x, part ID %d)
Local transport type: %s
Remote host: %s
Remote Adapter: (vendor 0x%x, part ID %d)
Remote transport type: %s
#
[gid index too large]
Open MPI tried to use a GID index that was too large for an
OpenFabrics device (i.e., the GID index does not exist on this
device).
Local host: %s
Local adapter: %s
Local port: %d
Requested GID index: %d (specified by the btl_openib_gid_index MCA param)
Max allowable GID index: %d
Use "ibv_devinfo -v" on the local host to see the GID table of this
device.
[reg mem limit low]
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: %s
Registerable memory: %lu MiB
Total memory: %lu MiB
%s
[CUDA_no_gdr_support]
You requested to run with CUDA GPU Direct RDMA support but the Open MPI
library was not built with that support. The Open MPI library must be
configured with CUDA 6.0 or later.
Local host: %s
[driver_no_gdr_support]
You requested to run with CUDA GPU Direct RDMA support but this OFED
installation does not have that support. Contact Mellanox to figure
out how to get an OFED stack with that support.
Local host: %s
[no_fork_with_gdr]
You cannot have fork support and CUDA GPU Direct RDMA support on at the
same time. Please disable one of them. Deactivating the openib BTL.
Local host: %s
#
[CUDA_gdr_and_nopinned]
You requested to run with CUDA GPU Direct RDMA support but also with
"leave pinned" turned off. This will result in very poor performance
with CUDA GPU Direct RDMA. Either disable GPU Direct RDMA support or
enable "leave pinned" support. Deactivating the openib BTL.
Local host: %s
#
[do_not_set_openib_value]
Open MPI has detected that you have attempted to set the btl_openib_cuda_max_send_size
value. This is not supported. Setting back to default value of 0.
Local host: %s
[ib port not selected]
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: %s
Local adapter: %s
Local port: %d
#

Просмотреть файл

@ -1,351 +0,0 @@
#
# Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2006-2011 Mellanox Technologies. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# This is the default NIC/HCA parameters file for Open MPI's OpenIB
# BTL. If NIC/HCA vendors wish to add their respective values into
# this file (that is distributed with Open MPI), please contact the
# Open MPI development team. See http://www.open-mpi.org/ for
# details.
# This file is in the "ini" style, meaning that it has sections
# identified section names enclosed in square brackets (e.g.,
# "[Section name]") followed by "key = value" pairs indicating values
# for a specific NIC/HCA vendor and model. NICs/HCAs are identified
# by their vendor ID and vendor part ID, which can be obtained by
# running the diagnostic utility command "ibv_devinfo". The fields
# "vendor_id" and "vendor_part"id" are the vendor ID and vendor part
# ID, respectively.
# The sections in this file only accept a few fields:
# vendor_id: a comma-delimited list of integers of NIC/HCA vendor IDs,
# expressed either in decimal or hexidecimal (e.g., "13" or "0xd").
# Individual values can be taken directly from the output of
# "ibv_devinfo". NIC/HCA vendor ID's correspond to IEEE OUI's, for
# which you can find the canonical list here:
# http://standards.ieee.org/regauth/oui/. Example:
#
# vendor_id = 0x05ad
#
# Note: Several vendors resell Mellanox hardware and put their own firmware
# on the cards, therefore overriding the default Mellanox vendor ID.
#
# Mellanox 0x02c9
# Cisco 0x05ad
# Silverstorm 0x066a
# Voltaire 0x08f1
# HP 0x1708
# Sun 0x03ba
# Bull 0x119f
# vendor_part_id: a comma-delimited list of integers of different
# NIC/HCA models from a single vendor, expressed in either decimal or
# hexidecimal (e.g., "13" or "0xd"). Individual values can be
# obtained from the output of the "ibv_devinfo". Example:
#
# vendor_part_id = 25208,25218
# mtu: an integer indicating the maximum transfer unit (MTU) to be
# used with this NIC/HCA. The effective MTU will be the minimum of an
# NIC's/HCA's MTU value and its peer NIC's/HCA's MTU value. Valid
# values are 256, 512, 1024, 2048, and 4096. Example:
#
# mtu = 1024
# use_eager_rdma: an integer indicating whether RDMA should be used
# for eager messages. 0 values indicate "no" (false); non-zero values
# indicate "yes" (true). This flag should only be enabled for
# NICs/HCAs that can provide guarantees about ordering of data in
# memory -- that the last byte of an incoming RDMA write will always
# be written last. Certain cards cannot provide this guarantee, while
# others can.
# use_eager_rdma = 1
# receive_queues: a list of "bucket shared receive queues" (BSRQ) that
# are opened between MPI process peer pairs for point-to-point
# communications of messages shorter than the total length required
# for RDMA transfer. The use of multiple RQs, each with different
# sized posted receive buffers can allow [much] better registered
# memory utilization -- MPI messages are sent on the QP with the
# smallest buffer size that will fit the message. Note that flow
# control messages are always sent across the QP with the smallest
# buffer size. Also note that the buffers *must* be listed in
# increasing buffer size. This parameter matches the
# mca_btl_openib_receive_queues MCA parameter; see the ompi_info help
# message and FAQ for a description of its values. BSRQ
# specifications are found in this precedence:
# highest: specifying the mca_btl_openib_receive_queues MCA param
# next: finding a value in this file
# lowest: using the default mca_btl_openib_receive_queues MCA param value
# receive_queues = P,128,256,192,128:S,65536,256,192,128
# max_inline_data: an integer specifying the maximum inline data (in
# bytes) supported by the device. -1 means to use a run-time probe to
# figure out the maximum value supported by the device.
# max_inline_data = 1024
# rdmacm_reject_causes_connect_error: a boolean indicating whether
# when an RDMA CM REJECT is issued on the device, instead of getting
# the expected REJECT event back, you might get a CONNECT_ERROR event.
# Open MPI uses RDMA CM REJECT messages in its normal wireup
# procedure; some connections are *expected* to be rejected. However,
# with some older drivers, if process A issues a REJECT, process B
# will receive a CONNECT_ERROR event instead of a REJECT event. So if
# this flag is set to true and we receive a CONNECT_ERROR event on a
# connection where we are expecting a REJECT, then just treat the
# CONNECT_ERROR exactly as we would have treated the REJECT. Setting
# this flag to true allows Open MPI to work around the behavior
# described above. It is [mostly] safe to set this flag to true even
# after a driver has been fixed; the scope of where this flag is used
# is small enough that it *shouldn't* mask real CONNECT_ERROR events.
# rdmacm_reject_causes_connect_error = 1
############################################################################
[default]
# These are the default values, identified by the vendor and part ID
# numbers of 0 and 0. If queried NIC/HCA does not return vendor and
# part ID numbers that match any of the sections in this file, the
# values in this section are used. Vendor IDs and part IDs can be hex
# or decimal.
vendor_id = 0
vendor_part_id = 0
use_eager_rdma = 0
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Tavor Infinihost]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 23108
use_eager_rdma = 1
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Arbel InfiniHost III MemFree/Tavor]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 25208,25218
use_eager_rdma = 1
mtu = 1024
max_inline_data = 128
############################################################################
[Mellanox Sinai Infinihost III]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
vendor_part_id = 25204,24204
use_eager_rdma = 1
mtu = 2048
max_inline_data = 128
############################################################################
# A.k.a. ConnectX
[Mellanox Hermon]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 25408,25418,25428,25448,26418,26428,26438,26448,26468,26478,26488,4099,4103,4100
use_eager_rdma = 1
mtu = 2048
max_inline_data = 128
############################################################################
[Mellanox ConnectIB]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4113
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[Mellanox ConnectX4]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4115,4117
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[Mellanox ConnectX5]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
vendor_part_id = 4119,4121
use_eager_rdma = 1
mtu = 4096
max_inline_data = 256
############################################################################
[IBM eHCA 4x and 12x]
vendor_id = 0x5076
vendor_part_id = 0
use_eager_rdma = 1
mtu = 2048
receive_queues = P,128,256,192,128:P,65536,256,192,128
max_inline_data = 0
############################################################################
[IBM eHCA-2 4x and 12x]
vendor_id = 0x5076
vendor_part_id = 1
use_eager_rdma = 1
mtu = 4096
receive_queues = P,128,256,192,128:P,65536,256,192,128
max_inline_data = 0
############################################################################
# See http://lists.openfabrics.org/pipermail/general/2008-June/051920.html
# 0x1fc1 and 0x1077 are PCI ID's; at least one of QL's OUIs is 0x1175
[QLogic InfiniPath 1]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 13
use_eager_rdma = 1
mtu = 2048
max_inline_data = 0
[QLogic InfiniPath 2]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 16,29216
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
[QLogic InfiniPath 3]
vendor_id = 0x1fc1,0x1077,0x1175
vendor_part_id = 16,29474
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
[QLogic FastLinQ QL41000]
vendor_id = 0x1077
vendor_part_id = 32880
receive_queues = P,65536,64
############################################################################
# Chelsio's OUI is 0x0743. 0x1425 is the PCI ID.
[Chelsio T3]
vendor_id = 0x1425
vendor_part_id = 0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0030,0x0031,0x0032,0x0035,0x0036
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,256,192,128
max_inline_data = 64
[Chelsio T4]
vendor_id = 0x1425
vendor_part_id = 0xa000,0x4400,0x4401,0x4402,0x4403,0x4404,0x4405,0x4406,0x4407,0x4408,0x4409,0x440a,0x440b,0x440c,0x440d,0x440e,0x4480,0x4481
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id = 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
[Chelsio T6]
vendor_id = 0x1425
vendor_part_id = 0x6400,0x6401,0x6402,0x6403,0x6404,0x6405,0x6406,0x6407,0x6408,0x6409,0x640d,0x6410,0x6411,0x6414,0x6415
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 280
############################################################################
# I'm *assuming* that 0x4040 is the PCI ID...
[NetXen]
vendor_id = 0x4040
vendor_part_id = 0x0001,0x0002,0x0003,0x0004,0x0005,0x0024,0x0025,0x0100
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,248,192,128
max_inline_data = 64
############################################################################
# NetEffect's OUI is 0x1255. 0x1678 is the PCI ID. ...but then
# NetEffect was bought by Intel. Intel's OUI is 0x1b21.
[NetEffect/Intel NE020]
vendor_id = 0x1678,0x1255,0x1b21
vendor_part_id = 0x0100,0x0110
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,256,192,128
max_inline_data = 64
[Intel HFI1]
vendor_id = 0x1175
vendor_part_id = 9456,9457
use_eager_rdma = 1
mtu = 4096
max_inline_data = 0
############################################################################
# Intel has several OUI's, including 0x8086. Amusing. :-) Intel has
# advised us (June, 2013) to ignore the Intel Phi OpenFabrics
# device... at least for now.
[Intel Xeon Phi]
vendor_id = 0x8086
vendor_part_id = 0
ignore_device = 1
############################################################################
# IBM Soft iWARP device.
[IBM Soft iWARP]
vendor_id = 0x626d74
vendor_part_id = 0
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
max_inline_data = 72
############################################################################
# Broadcom NetXtreme-E RDMA Ethernet Controller
[Broadcom BCM57XXX]
vendor_id = 0x14e4
vendor_part_id = 0x1605,0x1606,0x1614,0x16c0,0x16c1,0x16ce,0x16cf,0x16d6,0x16d7,0x16d8,0x16d9,0x16df,0x16e2,0x16e3,0x16e5,0x16eb,0x16ed,0x16ef,0x16f0,0x16f1
use_eager_rdma = 1
mtu = 1024
receive_queues = P,65536,256,192,128
max_inline_data = 96
[Broadcom BCM58XXX]
vendor_id = 0x14e4
vendor_part_id = 0xd800,0xd802,0xd804
use_eager_rdma = 1
mtu = 1024
receive_queues = P,65536,256,192,128
max_inline_data = 96

Просмотреть файл

@ -1,7 +0,0 @@
#
# owner/status file
# owner: institution that is responsible for this package
# status: e.g. active, maintenance, unmaintained
#
owner:Chelsio
status:maintenance