1
1

First commit of the Cisco usNIC BTL.

This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs.  A few noteworthy points:

 * This BTL does most of its own fragmentation; it tells the PML that
   it has a very high max_send_size (much higher than the network
   MTU).
 * Since UD fragments are, by definition, unreliable, the usnic BTL
   handles all of its own reliability via a sliding window approach
   using the opal_hotel construct and many tricks stolen from the
   corpus of knowledge surrounding efficient TCP.
 * There is a fun PML latency-metric based optimization for NUMA
   awareness of short messages.
 * Note that this is ''not'' a generic UD verbs BTL; it is specific to
   the Cisco usNIC device.

This commit was SVN r28879.
Этот коммит содержится в:
Jeff Squyres 2013-07-19 22:13:58 +00:00
родитель 3546163c48
Коммит 194b285447
25 изменённых файлов: 8072 добавлений и 0 удалений

79
ompi/mca/btl/usnic/Makefile.am Обычный файл
Просмотреть файл

@ -0,0 +1,79 @@
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2006 Sandia National Laboratories. All rights
# reserved.
# Copyright (c) 2010-2013 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
AM_CPPFLAGS = $(btl_usnic_CPPFLAGS)
dist_pkgdata_DATA = \
help-mpi-btl-usnic.txt
sources = \
btl_usnic_module.c \
btl_usnic_module.h \
btl_usnic.h \
btl_usnic_ack.c \
btl_usnic_ack.h \
btl_usnic_component.c \
btl_usnic_endpoint.c \
btl_usnic_endpoint.h \
btl_usnic_frag.c \
btl_usnic_frag.h \
btl_usnic_hwloc.h \
btl_usnic_mca.c \
btl_usnic_proc.c \
btl_usnic_proc.h \
btl_usnic_recv.c \
btl_usnic_recv.h \
btl_usnic_send.c \
btl_usnic_send.h \
btl_usnic_util.c \
btl_usnic_util.h
if OPAL_HAVE_HWLOC
sources += btl_usnic_hwloc.c
endif
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if MCA_BUILD_ompi_btl_usnic_DSO
lib =
lib_sources =
component = mca_btl_usnic.la
component_sources = $(sources)
else
lib = libmca_btl_usnic.la
lib_sources = $(sources)
component =
component_sources =
endif
mcacomponentdir = $(pkglibdir)
mcacomponent_LTLIBRARIES = $(component)
mca_btl_usnic_la_SOURCES = $(component_sources)
mca_btl_usnic_la_LDFLAGS = -module -avoid-version $(btl_usnic_LDFLAGS)
mca_btl_usnic_la_LIBADD = $(btl_usnic_LIBS) \
$(top_ompi_builddir)/ompi/mca/common/verbs/libmca_common_verbs.la
noinst_LTLIBRARIES = $(lib)
libmca_btl_usnic_la_SOURCES = $(lib_sources)
libmca_btl_usnic_la_LDFLAGS= -module -avoid-version $(btl_usnic_LDFLAGS)
libmca_btl_usnic_la_LIBADD = $(btl_usnic_LIBS)

286
ompi/mca/btl/usnic/README.txt Обычный файл
Просмотреть файл

@ -0,0 +1,286 @@
Design notes on usnic BTL
======================================
nomenclature
fragment - something the PML asks us to send or put, any size
segment - something we can put on the wire in a single packet
chunk - a piece of a fragment that fits into one segment
a segment can contain either an entire fragment or a chunk of a fragment
each segment and fragment has associated descriptor.
Each segment data structure has a block of registered memory associated with
it which matches MTU for that segment
ACK - acks get special small segments with only enough memory for an ACK
non-ACK segments always have a parent fragment
fragments are either large (> MTU) or small (<= MTU)
a small fragment has a segment descriptor embedded within it since it
always needs exactly one.
a large fragment has no permanently associated segments, but allocates them
as needed.
======================================
channels
a channel is a queue pair with an associated completion queue
each channel has its own MTU and r/w queue entry counts
There are 2 channels, command and data
command queue is generally for higher priority fragments
data queue is for standard data traffic
command queue should possibly be called "priority" queue
command queue is shorter and has a smaller MTU that the data queue
this makes the command queue a lot faster than the data queue, so we
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
command queue is used for ACKs and tiny fragments
data queue is used for everything else
PML fragments marked priority should perhaps use command queue
======================================
sending
Normally, all send requests are simply enqueued and then actually posted
to the NIC by the routine ompi_btl_usnic_module_progress_sends().
"fastpath" tiny sends are the exception.
Each module maintains a queue of endpoints that are ready to send.
An endpoint is ready to send if all of the following are met:
- the endpoint has fragments to send
- the endpoint has send credits
- the endpoint's send window is "open" (not full of un-ACKed segments)
Each module also maintains a list of segments that need to be retransmitted.
Note that the list of pending retrans is per-module, not per-endpoint.
send progression first posts any pending retransmissions, always using the
data channel. (reason is that if we start getting heavy congestion and
there are lots of retransmits, it becomes more important than ever to
prioritize ACKs, clogging command channel with retrans data makes things worse,
not better)
Next, progression loops sending segments to the endpoint at the top of
the "endpoints_with_sends" queue. When an endpoint exhausts its send
credits or fills its send window or runs out of segments to send, it removes
itself from the endpoint_with_sends list. Any pending ACKs will be
picked up and piggy-backed on these sends.
Finally, any endpoints that still need ACKs whose timer has expired will
be sent explicit ACK packets.
[double-click fragment sending]
The middle part of the progression loop handles both small (single-segment)
and large (multi-segment) sends.
For small fragments, the verbs descriptor within the embedded segment is
updated with length, BTL header is updated, then we call
ompi_btl_usnic_endpoint_send_segment() to send the segment.
After posting, we make a PML callback if needed.
For large fragments, a little more is needed. segments froma large
fragment have a slightly larger BTL header which contains a fragment ID,
and offset, and a size. The fragment ID is allocated when the first chunk
the fragment is sent. A segment gets allocated, next blob of data is
copied into this segment, segment is posted. If last chunk of fragment
sent, perform callback if needed, then remove fragment from endpoint
send queue.
[double-click ompi_btl_usnic_endpoint_send_segment()]
This is common posting code for large or small segments. It assigns a
sequence number to a segment, checks for an ACK to piggy-back,
posts the segment to the NIC, and then starts the retransmit timer
by checking the segment into hotel. Send credits are consumed here.
======================================
send dataflow
PML control messages with no user data are sent via:
desc = usnic_alloc(size)
usnic_send(desc)
user messages less than eager limit and 1st part of larger
messages are sent via:
desc = usnic_prepare_src(convertor, size)
usnic_send(desc)
larger msgs
desc = usnic_prepare_src(convertor, size)
usnic_put(desc)
usnic_alloc() currently asserts the length is "small", allocates and
fills in a small fragment. src pointer will point to start of
associated registered mem + sizeof BTL header, and PML will put its
data there.
usnic_prepare_src() allocated either a large or small fragment based on size
The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
place where PML should construct its header. If the data convertor says
data is contiguous, 2nd SG entry points to user buffer, else it is null and
sf_convertor is filled in with address of convertor.
usnic_send()
If the fragment being sent is small enough, has contiguous data, and
"very few" command queue send WQEs have been consumed, usnic_send() does
a fastpath send. This means it posts the segment immediately to the NIC
with INLINE flag set.
If all of the conditions for fastpath send are not met, and this is a small
fragment, the user data is copied into the associated registered memory at this
time and the SG list in the descriptor is collapsed to one entry.
After the checks above are done, the fragment is enqueued to be sent
via ompi_btl_usnic_endpoint_enqueue_frag()
usnic_put()
PML will have filled in destination address in descriptor. This is saved
and the fragment is enqueued for processing.
ompi_btl_usnic_endpoint_enqueue_frag()
This appends the fragment to the "to be sent" list of the endpoint and
conditionally adds the endpoint to the list of endpoints with data to send
via ompi_btl_usnic_check_rts()
======================================
receive dataflow
BTL packets has one of 3 types in header: frag, chunk, or ack.
A frag packet is a full PML fragment.
A chunk packet is a piece of a fragment that needs to be reassembled.
An ack packet is header only with a sequence number being ACKed.
Both frag and chunk packets go through some of the same processing.
Both may carry piggy-backed ACKs which may need to be processed.
Both have sequence numbers which must be processed and may result in
dropping the packet and/or queueing an ACK to the sender.
frag packets may be either regular PML fragments or PUT segments.
If the "put_addr" field of the BTL header is set, this is a PUT and
the data is copied directly to the user buffer. If this field is NULL,
the segment is passed up to the PML. The PML is expected to do everything
it needs with this packet in the callback, including copying data out if
needed. Once the callback is complete, the receive buffer is recycled.
chunk packets are parts of a larger fragment. If an active fragment receive
for the matching fragment ID cannot be found, and new fragment info
descriptor is allocated. If this is not a PUT (put_addr == NULL), we
malloc() data to reassemble the fragment into. Each subsequent chunk
is copied either into this reassembly buffer or directly into user memory.
When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
then the fragment info descriptor is released.
======================================
reliability:
every packet has sequence #
each endpoint has a "send window" , currently 4096 entries.
once a segment is sent, it is saved in window array until ACK is received
ACKs acknowledge all packets <= specified sequence #
rcvr only ACKs a sequence # when all packets up to that sequence have arrived
each pkt has dflt retrans timer of 100ms
packet will be scheduled for retrans if timer expires
Once a segment is sent, it always has its retransmit timer started.
This is accomplished by opal_hotel_checkin()
Any time a segment is posted to the NIC for retransmit, it is checked out
of the hotel (timer stopped).
So, a send segment is always in one of 4 states:
- on free list, unallocated
- on endpoint to-send list in the case of segment associated with small fragment
- posted to NIC and in hotel awaiting ACK
- on module re-send list awaiting retransmission
rcvr:
- if a pkt with seq >= expected seq is received, schedule ack of largest
in-order sequence received if not already scheduled. dflt time is 50us
- if a packet with seq < expected seq arrives, we send an ACK immediately,
as this indicates a lost ACK
sender:
duplicate ACK triggers immediate retrans if one is not pending for that segment
======================================
Reordering induced by two queues and piggy-backing:
ACKs can be reordered-
not an issue at all, old ACKs are simply ignored
Sends can be reordered-
(small send can jump far ahead of large sends)
large send followed by lots of small sends could trigger many retrans
of the large sends. smalls would have to be paced pretty precisely to
keep command queue empty enough and also beat out the large sends.
send credits limit how many larges can be queued on the sender, but there
could be many on the receiver
======================================
optim: round large buffer alloc up to cache line size?
optim: ompi_btl_usnic_endpoint_send_segment() could have stuff
removed, moved into shadow of send. ACK piggy-backing could be
broken in half, and some moved into shadow.
optim: inline ompi_btl_usnic_endpoint_send_segment
todo: move small send callback from progress to usnic_send
todo: PUTs do not need fragment IDs - each chunk can be standalone, completion
is detected on sender by last byte ACKed, not on receiver
todo: check warmup impact
todo: improve sender lookup mechanism
todo: RD WD size weirdness (e.g.:255 good, 256 bad)
todo: do not IBV_SEND_SIGNALED every time
todo: sf_size redundant with ack_bytes_left ?
todo: BW hole in -np 32 Exchange on 32 nodes
dip right at eager limit
exchange wants different higher limit...
todo: odd results with -np 16 Gather -npmin 16 on 16 nodes
something changes at 1024 bytes
todo: test with packet loss/reording
todo: get QA running IMB with .DCHECK
todo: do our own 256 process .DCHECK run on 32 nodes
todo: registration cache w/o ummunotify
todo: reg cache with ummunotify
todo: thorough review of retransmission policy vs reordering
todo: progression thread?
todo: weird startup delay issue with periodic stats enabled
todo: maintaining verbs SG list and PML SG list in parallel is awkward
probably best to just fill in verbs SG list all at once at last possible
moment instead of piecemeal? or always use verbs SG internally?
or use compile-time wizardry to make OMPI SG list and verbs SG list
be byte compatible?
todo: update proc.c:match_modex() to use same kind of IP comparison as
in TCP BTL (i.e., subroutine-ize the TCP BTL comparison)
todo: implement "wide match" btl_usnic_if_in|exclude (i.e., let a
specified mask of 10.0.0.0/8 match an interface with
10.20.0.0/16).

174
ompi/mca/btl/usnic/btl_usnic.h Обычный файл
Просмотреть файл

@ -0,0 +1,174 @@
/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2011-2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/**
* @file
*/
#ifndef OMPI_BTL_USNIC_H
#define OMPI_BTL_USNIC_H
#include "ompi_config.h"
#include <sys/types.h>
#include <infiniband/verbs.h>
#include "opal/class/opal_hash_table.h"
#include "opal/class/opal_hash_table.h"
#include "opal/mca/event/event.h"
#include "ompi/class/ompi_free_list.h"
#include "ompi/mca/btl/btl.h"
#include "ompi/mca/btl/base/btl_base_error.h"
#include "ompi/mca/btl/base/base.h"
#include "ompi/mca/mpool/grdma/mpool_grdma.h"
BEGIN_C_DECLS
/*
* We're simulating a clock as best we can without resorting to the
* system. The clock is used to defer ACKs, and ticks will be incremented
* when progression gets called. It could be incremented by different amounts
* at other times as needed or as tuning dictates.
*/
extern uint64_t usnic_ticks;
static inline uint64_t
get_nsec(void)
{
return usnic_ticks;
}
#define container_of(ptr, type, member) ( \
(type *)( ((char *)(ptr)) - offsetof(type,member) ))
/* Isolate code differences between v1.6, v1.7, etc. */
#define USNIC_OUT ompi_btl_base_framework.framework_output
/* Set to 1 to turn out a LOT of debugging ouput */
#define MSGDEBUG2 (MSGDEBUG1||0) /* temp */
#define MSGDEBUG1 (MSGDEBUG||0) /* temp */
#define MSGDEBUG 0
/* Set to 1 to get frag history */
#define HISTORY 0
/* Set to >0 to randomly drop received frags. The higher the number,
the more frequent the drops. */
#define WANT_RECV_FRAG_DROPS 0
/* Set to >0 to randomly fail to send an ACK, mimicing a lost ACK.
The higher the number, the more frequent the failed-to-send-ACK. */
#define WANT_FAIL_TO_SEND_ACK 0
/* Set to >0 to randomly fail to resend a frag (causing it to be
requed to be sent later). The higher the number, the more frequent
the failed-to-resend-frag. */
#define WANT_FAIL_TO_RESEND_FRAG 0
#if WANT_RECV_FRAG_DROPS > 0
#define FAKE_RECV_FRAG_DROP (rand() < WANT_RECV_FRAG_DROPS)
#else
#define FAKE_RECV_FRAG_DROP 0
#endif
#if WANT_FAIL_TO_SEND_ACK > 0
#define FAKE_FAIL_TO_SEND_ACK (rand() < WANT_FAIL_TO_SEND_ACK)
#else
#define FAKE_FAIL_TO_SEND_ACK 0
#endif
#if WANT_FAIL_TO_RESEND_FRAG > 0
#define FAKE_FAIL_TO_RESEND_FRAG (rand() < WANT_FAIL_TO_RESEND_FRAG)
#else
#define FAKE_FAIL_TO_RESEND_FRAG 0
#endif
/**
* Verbs UD BTL component.
*/
typedef struct ompi_btl_usnic_component_t {
/** base BTL component */
mca_btl_base_component_2_0_0_t super;
/** Maximum number of BTL modules */
uint32_t max_modules;
/** Number of available/initialized BTL modules */
uint32_t num_modules;
char *if_include;
char *if_exclude;
uint32_t *vendor_part_ids;
/* Cached hashed version of my ORTE proc name (to stuff in
protocol headers) */
uint64_t my_hashed_orte_name;
/** array of available BTLs */
struct ompi_btl_usnic_module_t* usnic_modules;
/** list of usnic proc structures */
opal_list_t usnic_procs;
/** name of memory pool */
char* usnic_mpool_name;
/** Want stats? */
bool stats_enabled;
bool stats_relative;
int stats_frequency;
/** GID index to use */
int gid_index;
/** Whether we want to use NUMA distances to choose which usNIC
devices to use for short messages */
bool want_numa_device_assignment;
/** max send descriptors to post per module */
int32_t sd_num;
/** max receive descriptors per module */
int32_t rd_num;
/** max send/receive desriptors for priority channel */
int32_t prio_sd_num;
int32_t prio_rd_num;
/** max completion queue entries per module */
int32_t cq_num;
/** retrans characteristics */
int retrans_timeout;
} ompi_btl_usnic_component_t;
OMPI_MODULE_DECLSPEC extern ompi_btl_usnic_component_t mca_btl_usnic_component;
typedef mca_btl_base_recv_reg_t ompi_btl_usnic_recv_reg_t;
/**
* Size for sequence numbers (just to ensure we use the same size
* everywhere)
*/
typedef uint64_t ompi_btl_usnic_seq_t;
#define UDSEQ "lu"
/**
* Register the usnic BTL MCA params
*/
int ompi_btl_usnic_component_register(void);
END_C_DECLS
#endif

271
ompi/mca/btl/usnic/btl_usnic_ack.c Обычный файл
Просмотреть файл

@ -0,0 +1,271 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include "opal/util/output.h"
#include "opal/class/opal_hotel.h"
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
#include "btl_usnic_ack.h"
#include "btl_usnic_util.h"
#include "btl_usnic_send.h"
/*
* Force a retrans of a segment
*/
static void
ompi_btl_usnic_force_retrans(
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_seq_t ack_seq)
{
ompi_btl_usnic_send_segment_t *sseg;
int is;
is = WINDOW_SIZE_MOD(ack_seq+1);
sseg = endpoint->endpoint_sent_segs[is];
if (sseg == NULL || sseg->ss_hotel_room == -1) {
return;
}
/* cancel retrans timer */
opal_hotel_checkout(&endpoint->endpoint_hotel, sseg->ss_hotel_room);
sseg->ss_hotel_room = -1;
/* Queue up this frag to be resent */
opal_list_append(&(endpoint->endpoint_module->pending_resend_segs),
&(sseg->ss_base.us_list.super));
++endpoint->endpoint_module->num_fast_retrans;
}
/*
* We have received an ACK for a given sequence number (either standalone
* or via piggy-back on a regular send)
*/
void
ompi_btl_usnic_handle_ack(
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_seq_t ack_seq)
{
ompi_btl_usnic_seq_t is;
ompi_btl_usnic_send_segment_t *sseg;
ompi_btl_usnic_send_frag_t *frag;
ompi_btl_usnic_module_t *module;
uint32_t bytes_acked;
module = endpoint->endpoint_module;
/* ignore if this is an old ACK */
if (ack_seq < endpoint->endpoint_ack_seq_rcvd) {
#if MSGDEBUG
opal_output(0, "Got OLD DUP ACK seq %d < %d\n",
ack_seq, endpoint->endpoint_ack_seq_rcvd);
#endif
++module->num_old_dup_acks;
return;
/* A duplicate ACK means next seg was lost */
} else if (ack_seq == endpoint->endpoint_ack_seq_rcvd) {
++module->num_dup_acks;
ompi_btl_usnic_force_retrans(endpoint, ack_seq);
return;
}
/* Does this ACK have a new sequence number that we haven't
seen before? */
for (is = endpoint->endpoint_ack_seq_rcvd + 1; is <= ack_seq; ++is) {
sseg = endpoint->endpoint_sent_segs[WINDOW_SIZE_MOD(is)];
#if MSGDEBUG1
opal_output(0, " Checking ACK/sent_segs window %p, index %lu, seq %lu, occupied=%p, seg_room=%d",
(void*) endpoint->endpoint_sent_segs,
WINDOW_SIZE_MOD(is), is, sseg, (sseg?sseg->ss_hotel_room:-2));
#endif
assert(sseg != NULL);
assert(sseg->ss_base.us_btl_header->seq == is);
#if MSGDEBUG
if (sseg->ss_hotel_room == -1) {
opal_output(0, "=== ACKed frag in sent_frags array is not in hotel/enqueued, module %p, endpoint %p, seg %p, seq %" UDSEQ ", slot %lu",
(void*) module, (void*) endpoint,
(void*) sseg, is, WINDOW_SIZE_MOD(is));
}
#endif
/* Check the sending segment out from the hotel. NOTE: The
segment might not actually be in a hotel room if it has
already been evicted and queued for resend.
If it's not in the hotel, don't check it out! */
if (OPAL_LIKELY(sseg->ss_hotel_room != -1)) {
opal_hotel_checkout(&endpoint->endpoint_hotel, sseg->ss_hotel_room);
sseg->ss_hotel_room = -1;
/* hotel_room == -1 means queued for resend, remove it */
} else {
opal_list_remove_item((&module->pending_resend_segs),
&sseg->ss_base.us_list.super);
}
/* update the owning fragment */
bytes_acked = sseg->ss_base.us_btl_header->payload_len;
frag = sseg->ss_parent_frag;
/* when no bytes left to ACK, fragment send is truly done */
frag->sf_ack_bytes_left -= bytes_acked;
#if MSGDEBUG1
opal_output(0, " ACKED seg %p, frag %p, ack_bytes=%d, left=%d\n",
sseg, frag, bytes_acked, frag->sf_ack_bytes_left);
#endif
/* perform completion callback for PUT here */
if (frag->sf_ack_bytes_left == 0 &&
frag->sf_base.uf_dst_seg[0].seg_addr.pval != NULL) {
#if MSGDEBUG1
opal_output(0, "Calling back %p for PUT completion, frag=%p\n",
frag->sf_base.uf_base.des_cbfunc, frag);
#endif
frag->sf_base.uf_base.des_cbfunc(&module->super, frag->sf_endpoint,
&frag->sf_base.uf_base, OMPI_SUCCESS);
}
/* OK to return this fragment? */
ompi_btl_usnic_send_frag_return_cond(module, frag);
/* free this segment */
sseg->ss_ack_pending = false;
if (sseg->ss_base.us_type == OMPI_BTL_USNIC_SEG_CHUNK &&
sseg->ss_send_posted == 0) {
ompi_btl_usnic_chunk_segment_return(module, sseg);
}
/* indicate this segment has been ACKed */
endpoint->endpoint_sent_segs[WINDOW_SIZE_MOD(is)] = NULL;
}
/* update ACK received */
endpoint->endpoint_ack_seq_rcvd = ack_seq;
/* send window may have opened, possibly make endpoint ready-to-send */
ompi_btl_usnic_check_rts(endpoint);
}
/*
* Send an ACK
*/
void
ompi_btl_usnic_ack_send(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_endpoint_t *endpoint)
{
ompi_btl_usnic_ack_segment_t *ack;
#if MSGDEBUG1
uint8_t mac[6];
char src_mac[32];
char dest_mac[32];
#endif
/* Get an ACK frag. If we don't get one, just discard this ACK. */
ack = ompi_btl_usnic_ack_segment_alloc(module);
if (OPAL_UNLIKELY(NULL == ack)) {
opal_output(0, "====================== No frag for sending the ACK -- skipped");
abort();
}
/* send the seq of the lowest item in the window that
we've received */
ack->ss_base.us_btl_header->ack_seq =
endpoint->endpoint_next_contig_seq_to_recv - 1;
ack->ss_base.us_sg_entry[0].length =
sizeof(ompi_btl_usnic_btl_header_t);
#if MSGDEBUG1
memset(src_mac, 0, sizeof(src_mac));
memset(dest_mac, 0, sizeof(dest_mac));
ompi_btl_usnic_sprintf_mac(src_mac, module->if_mac);
ompi_btl_usnic_gid_to_mac(&endpoint->endpoint_remote_addr.gid, mac);
ompi_btl_usnic_sprintf_mac(dest_mac, mac);
opal_output(0, "--> Sending ACK, sg_entry length %d, seq %" UDSEQ " to %s, qp %u",
ack->ss_base.us_sg_entry[0].length,
ack->ss_base.us_btl_header->ack_seq, dest_mac,
endpoint->endpoint_remote_addr.qp_num[ack->ss_channel]);
#endif
/* send the ACK */
ompi_btl_usnic_post_segment(module, endpoint, ack);
/* Stats */
++module->num_ack_sends;
return;
}
/*
* Sending an ACK has completed, return the segment to the free list
*/
void
ompi_btl_usnic_ack_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_ack_segment_t *ack)
{
ompi_btl_usnic_ack_segment_return(module, ack);
++module->mod_channels[ack->ss_channel].sd_wqe;
}
/*****************************************************************************/
/*
* Callback for when a send times out without receiving a
* corresponding ACK.
*/
void
ompi_btl_usnic_ack_timeout(
opal_hotel_t *hotel,
int room_num,
void *occupant)
{
ompi_btl_usnic_send_segment_t *seg;
ompi_btl_usnic_endpoint_t *endpoint;
ompi_btl_usnic_module_t *module;
seg = (ompi_btl_usnic_send_segment_t*) occupant;
endpoint = seg->ss_parent_frag->sf_endpoint;
module = endpoint->endpoint_module;
#if MSGDEBUG1
{
static int num_timeouts = 0;
opal_output(0, "Send timeout! seg %p, room %d, seq %" UDSEQ "\n",
seg, seg->ss_hotel_room,
seg->ss_base.us_btl_header->seq);
}
#endif
/* timeout checks us out, note this */
seg->ss_hotel_room = -1;
/* Queue up this frag to be resent */
opal_list_append(&(module->pending_resend_segs),
&(seg->ss_base.us_list.super));
/* Stats */
++module->num_timeout_retrans;
}

67
ompi/mca/btl/usnic/btl_usnic_ack.h Обычный файл
Просмотреть файл

@ -0,0 +1,67 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_USNIC_ACK_H
#define BTL_USNIC_ACK_H
#include "ompi_config.h"
#include "opal/class/opal_hotel.h"
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_endpoint.h"
/*
* Reap an ACK send that is complete
*/
void ompi_btl_usnic_ack_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_ack_segment_t *ack);
/*
* Send an ACK
*/
void ompi_btl_usnic_ack_send(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_endpoint_t *endpoint);
/*
* Callback for when a send times out without receiving a
* corresponding ACK
*/
void ompi_btl_usnic_ack_timeout(opal_hotel_t *hotel, int room_num,
void *occupant);
/*
* Handle an incoming ACK
*/
void ompi_btl_usnic_handle_ack(ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_seq_t ack_seq);
static inline void
ompi_btl_usnic_piggyback_ack(
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_send_segment_t *sseg)
{
/* If ACK is needed, piggy-back it here and send it on */
if (endpoint->endpoint_ack_needed) {
ompi_btl_usnic_remove_from_endpoints_needing_ack(endpoint);
sseg->ss_base.us_btl_header->ack_seq =
endpoint->endpoint_next_contig_seq_to_recv - 1;
#if MSGDEBUG1
opal_output(0, "Piggy-backing ACK for sequence %d\n",
sseg->ss_base.us_btl_header->ack_seq);
#endif
} else {
sseg->ss_base.us_btl_header->ack_seq = 0;
}
}
#endif /* BTL_USNIC_ACK_H */

1218
ompi/mca/btl/usnic/btl_usnic_component.c Обычный файл

Разница между файлами не показана из-за своего большого размера Загрузить разницу

172
ompi/mca/btl/usnic/btl_usnic_endpoint.c Обычный файл
Просмотреть файл

@ -0,0 +1,172 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2007 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include "opal/prefetch.h"
#include "orte/util/show_help.h"
#include "ompi/types.h"
#include "btl_usnic.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_proc.h"
#include "btl_usnic_util.h"
#include "btl_usnic_ack.h"
#include "btl_usnic_send.h"
/*
* Construct/destruct an endpoint structure.
*/
static void endpoint_construct(mca_btl_base_endpoint_t* endpoint)
{
int i;
endpoint->endpoint_module = NULL;
endpoint->endpoint_proc = NULL;
endpoint->endpoint_proc_index = -1;
endpoint->endpoint_exiting = false;
for (i=0; i<USNIC_NUM_CHANNELS; ++i) {
endpoint->endpoint_remote_addr.qp_num[i] = 0;
}
endpoint->endpoint_remote_addr.gid.global.subnet_prefix = 0;
endpoint->endpoint_remote_addr.gid.global.interface_id = 0;
endpoint->endpoint_remote_ah = NULL;
endpoint->endpoint_send_credits = 8;
/* list of fragments queued to be sent */
OBJ_CONSTRUCT(&endpoint->endpoint_frag_send_queue, opal_list_t);
endpoint->endpoint_next_seq_to_send = 10;
endpoint->endpoint_ack_seq_rcvd = 9;
endpoint->endpoint_next_contig_seq_to_recv = 10;
endpoint->endpoint_highest_seq_rcvd = 9;
endpoint->endpoint_next_frag_id = 1;
endpoint->endpoint_acktime = 0;
endpoint->endpoint_rfstart = endpoint->endpoint_next_contig_seq_to_recv;
/* endpoint starts not-ready-to-send */
endpoint->endpoint_ready_to_send = 0;
endpoint->endpoint_ack_needed = false;
/* clear sent/received sequence number array */
memset(endpoint->endpoint_sent_segs, 0,
sizeof(endpoint->endpoint_sent_segs));
memset(endpoint->endpoint_rcvd_segs, 0,
sizeof(endpoint->endpoint_rcvd_segs));
/*
* Make a new OPAL hotel for this module
* "hotel" is a construct used for triggering segment retransmission
* due to timeout
*/
OBJ_CONSTRUCT(&endpoint->endpoint_hotel, opal_hotel_t);
opal_hotel_init(&endpoint->endpoint_hotel,
WINDOW_SIZE,
mca_btl_usnic_component.retrans_timeout,
0,
ompi_btl_usnic_ack_timeout);
/* Setup this endpoint's list links */
OBJ_CONSTRUCT(&(endpoint->endpoint_ack_li), opal_list_item_t);
OBJ_CONSTRUCT(&(endpoint->endpoint_endpoint_li), opal_list_item_t);
endpoint->endpoint_ack_needed = false;
/* fragment reassembly info */
endpoint->endpoint_rx_frag_info =
calloc(sizeof(struct ompi_btl_usnic_rx_frag_info_t), MAX_ACTIVE_FRAGS);
assert(NULL != endpoint->endpoint_rx_frag_info);
if (OPAL_UNLIKELY(endpoint->endpoint_rx_frag_info == NULL)) {
BTL_ERROR(("calloc returned NULL -- this should not happen!"));
ompi_btl_usnic_exit();
/* Does not return */
}
}
static void endpoint_destruct(mca_btl_base_endpoint_t* endpoint)
{
ompi_btl_usnic_proc_t *proc;
OBJ_DESTRUCT(&(endpoint->endpoint_ack_li));
OBJ_DESTRUCT(&(endpoint->endpoint_endpoint_li));
if (endpoint->endpoint_hotel.rooms != NULL) {
OBJ_DESTRUCT(&(endpoint->endpoint_hotel));
}
OBJ_DESTRUCT(&endpoint->endpoint_frag_send_queue);
/* release owning proc */
proc = endpoint->endpoint_proc;
proc->proc_endpoints[endpoint->endpoint_proc_index] = NULL;
OBJ_RELEASE(proc);
free(endpoint->endpoint_rx_frag_info);
if (NULL != endpoint->endpoint_remote_ah) {
ibv_destroy_ah(endpoint->endpoint_remote_ah);
}
}
OBJ_CLASS_INSTANCE(ompi_btl_usnic_endpoint_t,
opal_list_item_t,
endpoint_construct,
endpoint_destruct);
/*
* Forcibly drain all pending output on an endpoint, without waiting for
* actual completion.
*/
void
ompi_btl_usnic_flush_endpoint(
ompi_btl_usnic_endpoint_t *endpoint)
{
ompi_btl_usnic_send_frag_t *frag;
/* First, free all pending fragments */
while (!opal_list_is_empty(&endpoint->endpoint_frag_send_queue)) {
frag = (ompi_btl_usnic_send_frag_t *)opal_list_remove_first(
&endpoint->endpoint_frag_send_queue);
/* _cond still needs to check ownership, but make sure the
* fragment is marked as done.
*/
frag->sf_ack_bytes_left = 0;
frag->sf_seg_post_cnt = 0;
ompi_btl_usnic_send_frag_return_cond(endpoint->endpoint_module, frag);
}
/* Now, ACK everything that is pending */
ompi_btl_usnic_handle_ack(endpoint, endpoint->endpoint_next_seq_to_send-1);
}

182
ompi/mca/btl/usnic/btl_usnic_endpoint.h Обычный файл
Просмотреть файл

@ -0,0 +1,182 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2006 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef OMPI_BTL_USNIC_ENDPOINT_H
#define OMPI_BTL_USNIC_ENDPOINT_H
#include <infiniband/verbs.h>
#include "opal/class/opal_list.h"
#include "opal/class/opal_hotel.h"
#include "opal/mca/event/event.h"
#include "btl_usnic.h"
BEGIN_C_DECLS
/*
* Forward declarations to avoid include loops
*/
struct ompi_btl_usnic_module_t;
struct ompi_btl_usnic_send_segment_t;
/*
* Have the window size as a compile-time constant that is a power of
* two so that we can take advantage of fast bit operations.
*/
#define WINDOW_SIZE 4096
#define WINDOW_SIZE_MOD(a) (((a) & (WINDOW_SIZE - 1)))
#define WINDOW_OPEN(E) ((E)->endpoint_next_seq_to_send < \
((E)->endpoint_ack_seq_rcvd + WINDOW_SIZE))
#define WINDOW_EMPTY(E) ((E)->endpoint_ack_seq_rcvd == \
((E)->endpoint_next_seq_to_send-1))
/*
* Returns true when an endpoint has nothing left to send
*/
#define ENDPOINT_DRAINED(E) (WINDOW_EMPTY(E) && \
opal_list_is_empty(&(E)->endpoint_frag_send_queue))
/*
* Channel IDs
*/
typedef enum ompi_btl_usnic_channel_id_t {
USNIC_PRIORITY_CHANNEL,
USNIC_DATA_CHANNEL,
USNIC_NUM_CHANNELS
} ompi_btl_usnic_channel_id_t;
typedef struct ompi_btl_usnic_addr_t {
uint32_t qp_num[USNIC_NUM_CHANNELS];
union ibv_gid gid;
uint32_t ipv4_addr;
uint32_t cidrmask;
uint8_t mac[6];
int mtu;
} ompi_btl_usnic_addr_t;
struct ompi_btl_usnic_send_segment_t;
struct ompi_btl_usnic_proc_t;
/*
* This is a descriptor for an incoming fragment that is broken
* into chunks. When the first reference to this frag_id is seen,
* memory is allocated for it. When the last byte arrives, the assembled
* fragment is passed to the PML.
*
* The endpoint structure has space for WINDOW_SIZE/2 simultaneous fragments.
* This is the largest number of fragments that can possibly be in-flight
* to us from a particular endpoint because eash chunked fragment will occupy
* at least two segments, and only WINDOW_SIZE segments can be in flight.
* OK, so there is an extremely pathological case where we could see
* (WINDOW_SIZE/2)+1 "in flight" at once, but just dropping that last one
* and waiting for retrans is just fine in this hypothetical hyper-pathological
* case, which is what we'll do.
*/
#define MAX_ACTIVE_FRAGS (WINDOW_SIZE/2)
typedef struct ompi_btl_usnic_rx_frag_info_t {
uint32_t rfi_frag_id; /* ID for this fragment */
uint32_t rfi_frag_size; /* bytes in this fragment */
uint32_t rfi_bytes_left; /* bytes remaining to RX in fragment */
char *rfi_data; /* pointer to assembly area */
int rfi_data_pool; /* if 0, data malloced, else rx buf pool */
} ompi_btl_usnic_rx_frag_info_t;
/**
* An abstraction that represents a connection to a remote process.
* An instance of mca_btl_base_endpoint_t is associated with each
* (btl_usnic_proc_t, btl_usnic_module_t) tuple and address
* information is exchanged at startup. The usnic BTL is
* connectionless, so no connection is ever established.
*/
typedef struct mca_btl_base_endpoint_t {
opal_list_item_t super;
/** BTL module that created this connection */
struct ompi_btl_usnic_module_t *endpoint_module;
/** proc that owns this endpoint */
struct ompi_btl_usnic_proc_t *endpoint_proc;
int endpoint_proc_index; /* index in owning proc's endpoint array */
/** True when proc has been deleted, but still have sends that need ACKs */
bool endpoint_exiting;
/** List item for linking into module "all_endpoints" */
opal_list_item_t endpoint_endpoint_li;
/** List item for linking into "need ack" */
opal_list_item_t endpoint_ack_li;
/** Remote address information */
ompi_btl_usnic_addr_t endpoint_remote_addr;
/** Remote address handle */
struct ibv_ah* endpoint_remote_ah;
/** Send-related data */
bool endpoint_ready_to_send;
opal_list_t endpoint_frag_send_queue;
int32_t endpoint_send_credits;
uint32_t endpoint_next_frag_id;
/** Receive-related data */
struct ompi_btl_usnic_rx_frag_info_t *endpoint_rx_frag_info;
/** OPAL hotel to track outstanding stends */
opal_hotel_t endpoint_hotel;
/** Sliding window parameters for this peer */
/* Values for the current proc to send to this endpoint on the
peer proc */
ompi_btl_usnic_seq_t endpoint_next_seq_to_send; /* n_t */
ompi_btl_usnic_seq_t endpoint_ack_seq_rcvd; /* n_a */
struct ompi_btl_usnic_send_segment_t *endpoint_sent_segs[WINDOW_SIZE];
/* Values for the current proc to receive from this endpoint on
the peer proc */
bool endpoint_ack_needed;
/* When we receive a packet that needs an ACK, set this
* to delay the ACK to allow for piggybacking
*/
uint64_t endpoint_acktime;
ompi_btl_usnic_seq_t endpoint_next_contig_seq_to_recv; /* n_r */
ompi_btl_usnic_seq_t endpoint_highest_seq_rcvd; /* n_s */
bool endpoint_rcvd_segs[WINDOW_SIZE];
uint32_t endpoint_rfstart;
} mca_btl_base_endpoint_t;
typedef mca_btl_base_endpoint_t ompi_btl_usnic_endpoint_t;
OBJ_CLASS_DECLARATION(ompi_btl_usnic_endpoint_t);
/*
* Flush all pending sends and resends from and endpoint
*/
void
ompi_btl_usnic_flush_endpoint(
ompi_btl_usnic_endpoint_t *endpoint);
END_C_DECLS
#endif

367
ompi/mca/btl/usnic/btl_usnic_frag.c Обычный файл
Просмотреть файл

@ -0,0 +1,367 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <string.h>
#include "ompi/proc/proc.h"
#include "btl_usnic.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_ack.h"
static void
common_send_seg_helper(
ompi_btl_usnic_send_segment_t *seg)
{
ompi_btl_usnic_segment_t *bseg;
bseg = &seg->ss_base;
bseg->us_btl_header = (ompi_btl_usnic_btl_header_t *)bseg->us_list.ptr;
bseg->us_btl_header->sender = mca_btl_usnic_component.my_hashed_orte_name;
/* build verbs work request descriptor */
seg->ss_send_desc.wr_id = (unsigned long) seg;
seg->ss_send_desc.sg_list = bseg->us_sg_entry;
seg->ss_send_desc.num_sge = 1;
seg->ss_send_desc.opcode = IBV_WR_SEND;
seg->ss_send_desc.next = NULL;
seg->ss_send_desc.send_flags = IBV_SEND_SIGNALED;
seg->ss_send_posted = 0;
seg->ss_ack_pending = false;
/* verbs SG entry, len will be filled in just before send */
bseg->us_sg_entry[0].addr = (unsigned long) bseg->us_btl_header;
}
static void
chunk_seg_constructor(
ompi_btl_usnic_send_segment_t *seg)
{
ompi_btl_usnic_segment_t *bseg;
bseg = &seg->ss_base;
bseg->us_type = OMPI_BTL_USNIC_SEG_CHUNK;
/* some more common initializaiton */
common_send_seg_helper(seg);
seg->ss_flags = 0;
/* payload starts next byte beyond BTL chunk header */
bseg->us_payload.raw = (uint8_t *)(bseg->us_btl_chunk_header + 1);
bseg->us_btl_header->payload_type = OMPI_BTL_USNIC_PAYLOAD_TYPE_CHUNK;
}
static void
frag_seg_constructor(
ompi_btl_usnic_send_segment_t *seg)
{
ompi_btl_usnic_segment_t *bseg;
bseg = &seg->ss_base;
bseg->us_type = OMPI_BTL_USNIC_SEG_FRAG;
/* some more common initializaiton */
common_send_seg_helper(seg);
seg->ss_flags = 0;
/* payload starts next byte beyond BTL header */
bseg->us_payload.raw = (uint8_t *)(bseg->us_btl_header + 1);
bseg->us_btl_header->payload_type = OMPI_BTL_USNIC_PAYLOAD_TYPE_FRAG;
}
static void
ack_seg_constructor(
ompi_btl_usnic_send_segment_t *ack)
{
ompi_btl_usnic_segment_t *bseg;
bseg = &ack->ss_base;
bseg->us_type = OMPI_BTL_USNIC_SEG_ACK;
/* some more common initializaiton */
common_send_seg_helper(ack);
/* ACK value embedded in BTL header */
bseg->us_btl_header->payload_type = OMPI_BTL_USNIC_PAYLOAD_TYPE_ACK;
bseg->us_btl_header->payload_len = 0;
bseg->us_sg_entry[0].length = sizeof(bseg->us_btl_header);
}
static void
recv_seg_constructor(
ompi_btl_usnic_recv_segment_t *seg)
{
ompi_btl_usnic_segment_t *bseg;
bseg = &seg->rs_base;
bseg->us_type = OMPI_BTL_USNIC_SEG_RECV;
/* on receive, BTL header starts after protocol header */
seg->rs_protocol_header = bseg->us_list.ptr;
bseg->us_btl_header =
(ompi_btl_usnic_btl_header_t *) (seg->rs_protocol_header + 1);
/* initialize verbs work request */
seg->rs_recv_desc.wr_id = (unsigned long) seg;
seg->rs_recv_desc.sg_list = bseg->us_sg_entry;
seg->rs_recv_desc.num_sge = 1;
/* verbs SG entry, len filled in by caller b/c we don't have value */
bseg->us_sg_entry[0].addr = (unsigned long) seg->rs_protocol_header;
/* initialize mca descriptor */
seg->rs_desc.des_dst = &seg->rs_segment;
seg->rs_desc.des_dst_cnt = 1;
seg->rs_desc.des_src = NULL;
seg->rs_desc.des_src_cnt = 0;
/* PML want to see its header
* This pointer is only correct for incoming segments of type
* OMPI_BTL_USNIC_PAYLOAD_TYPE_FRAG, but that's the only time
* we ever give segment directly to PML, so its OK
*/
bseg->us_payload.pml_header = (mca_btl_base_header_t *)
(bseg->us_btl_header+1);
seg->rs_segment.seg_addr.pval = bseg->us_payload.pml_header;
}
static void
send_frag_constructor(ompi_btl_usnic_send_frag_t *frag)
{
mca_btl_base_descriptor_t *desc;
/* Fill in source descriptor */
desc = &frag->sf_base.uf_base;
desc->des_src = frag->sf_base.uf_src_seg;
desc->des_src_cnt = 2;
desc->des_dst = frag->sf_base.uf_dst_seg;
desc->des_dst_cnt = 0;
desc->order = MCA_BTL_NO_ORDER;
desc->des_flags = 0;
frag->sf_seg_post_cnt = 0;
}
static void
small_send_frag_constructor(ompi_btl_usnic_small_send_frag_t *frag)
{
ompi_btl_usnic_frag_segment_t *fseg;
send_frag_constructor(&frag->ssf_base);
/* construct the embedded segment */
fseg = &frag->ssf_segment;
fseg->ss_base.us_list.ptr = frag->ssf_base.sf_base.uf_base.super.ptr;
frag_seg_constructor(fseg);
/* set us as parent in dedicated frag */
fseg->ss_parent_frag = (struct ompi_btl_usnic_send_frag_t *)frag;
frag->ssf_base.sf_base.uf_type = OMPI_BTL_USNIC_FRAG_SMALL_SEND;
/* save data pointer for PML */
frag->ssf_base.sf_base.uf_src_seg[0].seg_addr.pval =
fseg->ss_base.us_payload.raw;
}
static void
large_send_frag_constructor(ompi_btl_usnic_large_send_frag_t *lfrag)
{
send_frag_constructor(&lfrag->lsf_base);
lfrag->lsf_base.sf_base.uf_type = OMPI_BTL_USNIC_FRAG_LARGE_SEND;
/* save data pointer for PML */
lfrag->lsf_base.sf_base.uf_src_seg[0].seg_addr.pval =
&lfrag->lsf_pml_header;
}
static void
put_dest_frag_constructor(ompi_btl_usnic_put_dest_frag_t *pfrag)
{
pfrag->uf_type = OMPI_BTL_USNIC_FRAG_PUT_DEST;
/* point dest to our utility segment */
pfrag->uf_base.des_dst = pfrag->uf_dst_seg;
pfrag->uf_base.des_dst_cnt = 1;
}
OBJ_CLASS_INSTANCE(ompi_btl_usnic_segment_t,
ompi_free_list_item_t,
NULL,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_frag_segment_t,
ompi_btl_usnic_segment_t,
frag_seg_constructor,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_chunk_segment_t,
ompi_btl_usnic_segment_t,
chunk_seg_constructor,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_recv_segment_t,
ompi_btl_usnic_segment_t,
recv_seg_constructor,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_ack_segment_t,
ompi_btl_usnic_segment_t,
ack_seg_constructor,
NULL);
/*
* Fragments
*/
OBJ_CLASS_INSTANCE(ompi_btl_usnic_frag_t,
mca_btl_base_descriptor_t,
NULL,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_send_frag_t,
ompi_btl_usnic_frag_t,
NULL,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_large_send_frag_t,
ompi_btl_usnic_send_frag_t,
large_send_frag_constructor,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_small_send_frag_t,
ompi_btl_usnic_send_frag_t,
small_send_frag_constructor,
NULL);
OBJ_CLASS_INSTANCE(ompi_btl_usnic_put_dest_frag_t,
ompi_btl_usnic_frag_t,
put_dest_frag_constructor,
NULL);
/*******************************************************************************/
#if MSGDEBUG
static void dump_ack_frag(ompi_btl_usnic_frag_t* frag)
{
char out[256];
memset(out, 0, sizeof(out));
snprintf(out, sizeof(out),
"=== ACK frag %p (MCW %d): alloced %d",
(void*) frag,
ompi_proc_local()->proc_name.vpid,
FRAG_STATE_ISSET(frag, FRAG_ALLOCED));
opal_output(0, out);
}
static void dump_send_frag(ompi_btl_usnic_frag_t* frag)
{
char out[256];
memset(out, 0, sizeof(out));
snprintf(out, sizeof(out),
"=== SEND frag %p (MCW %d): alloced %d send_wr %d acked %d enqueued %d pml_callback %d hotel %d || seq %lu",
(void*) frag,
ompi_proc_local()->proc_name.vpid,
FRAG_STATE_ISSET(frag, FRAG_ALLOCED),
frag->send_wr_posted,
FRAG_STATE_ISSET(frag, FRAG_SEND_ACKED),
FRAG_STATE_ISSET(frag, FRAG_SEND_ENQUEUED),
FRAG_STATE_ISSET(frag, FRAG_PML_CALLED_BACK),
FRAG_STATE_ISSET(frag, FRAG_IN_HOTEL),
FRAG_STATE_ISSET(frag, FRAG_ALLOCED) ?
frag->btl_header->seq : (ompi_btl_usnic_seq_t) ~0
);
opal_output(0, out);
}
static void dump_recv_frag(ompi_btl_usnic_frag_t* frag)
{
char out[256];
memset(out, 0, sizeof(out));
snprintf(out, sizeof(out),
"=== RECV frag %p (MCW %d): alloced %d posted %d",
(void*) frag,
ompi_proc_local()->proc_name.vpid,
FRAG_STATE_ISSET(frag, FRAG_ALLOCED),
FRAG_STATE_ISSET(frag, FRAG_RECV_WR_POSTED));
opal_output(0, out);
}
void ompi_btl_usnic_frag_dump(ompi_btl_usnic_frag_t *frag)
{
switch(frag->type) {
case OMPI_BTL_USNIC_FRAG_ACK:
dump_ack_frag(frag);
break;
case OMPI_BTL_USNIC_FRAG_SEND:
dump_send_frag(frag);
break;
case OMPI_BTL_USNIC_FRAG_RECV:
dump_recv_frag(frag);
break;
default:
opal_output(0, "=== UNKNOWN type frag %p: (!)", (void*) frag);
break;
}
}
#endif
/*******************************************************************************/
#if HISTORY
void ompi_btl_usnic_frag_history(ompi_btl_usnic_frag_t *frag,
char *file, int line,
const char *message)
{
int i = frag->history_next;
ompi_btl_usnic_frag_history_t *h = &(frag->history[i]);
memset(h, 0, sizeof(*h));
strncpy(h->file, file, sizeof(h->file));
h->line = line;
strncpy(h->message, message, sizeof(h->message));
frag->history_next = (frag->history_next + 1) % NUM_FRAG_HISTORY;
if (frag->history_start == frag->history_next) {
frag->history_start = (frag->history_start + 1) % NUM_FRAG_HISTORY;
}
}
#endif

513
ompi/mca/btl/usnic/btl_usnic_frag.h Обычный файл
Просмотреть файл

@ -0,0 +1,513 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2006 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef OMPI_BTL_USNIC_FRAG_H
#define OMPI_BTL_USNIC_FRAG_H
#define OMPI_BTL_USNIC_FRAG_ALIGN (8)
#include <infiniband/verbs.h>
#include "btl_usnic.h"
#include "btl_usnic_module.h"
BEGIN_C_DECLS
/*
* Forward declarations to avoid include loops
*/
struct ompi_btl_usnic_module_t;
/*
* Some definitions:
* frag - what the PML later hands us to send, may be large or small
* segment - one packet on the wire
* chunk - when a fragment is too big to fit into one segment, it is
* broken into chunks, each chunk fitting in one segment
*/
/**
* Fragment types
* The PML may give us very large "fragements" to send, larger than
* an MTU. We break fragments into segments for sending, a segment being
* defined to fit within an MTU.
*/
typedef enum {
OMPI_BTL_USNIC_FRAG_LARGE_SEND,
OMPI_BTL_USNIC_FRAG_SMALL_SEND,
OMPI_BTL_USNIC_FRAG_PUT_DEST
} ompi_btl_usnic_frag_type_t;
typedef enum {
OMPI_BTL_USNIC_SEG_ACK,
OMPI_BTL_USNIC_SEG_FRAG,
OMPI_BTL_USNIC_SEG_CHUNK,
OMPI_BTL_USNIC_SEG_RECV
} ompi_btl_usnic_seg_type_t;
typedef struct ompi_btl_usnic_reg_t {
mca_mpool_base_registration_t base;
struct ibv_mr* mr;
} ompi_btl_usnic_reg_t;
/*
* Header that is the beginning of every usnic packet buffer.
*/
typedef struct {
/* Verbs UD global resource header (GRH), which appears on the
receiving side only. */
struct ibv_grh grh;
} ompi_btl_usnic_protocol_header_t;
/**
* usnic header type
*/
typedef enum {
OMPI_BTL_USNIC_PAYLOAD_TYPE_ACK = 1,
OMPI_BTL_USNIC_PAYLOAD_TYPE_FRAG = 2, /* an entire PML fragment */
OMPI_BTL_USNIC_PAYLOAD_TYPE_CHUNK = 3 /* one chunk of PML frag */
} ompi_btl_usnic_payload_type_t;
/**
* BTL header that goes after the protocol header. Since this is not
* a stream, we can put the fields in whatever order make the least
* holes.
*/
typedef struct {
/* Hashed ORTE process name of the sender */
uint64_t sender;
/* Sliding window sequence number (echoed back in an ACK). This
is 64 bits. */
ompi_btl_usnic_seq_t seq;
ompi_btl_usnic_seq_t ack_seq; /* for piggy-backing ACKs */
/* payload legnth (in bytes). We unfortunately have to include
this in our header because the L2 layer may artifically inflate
the length of the packet to meet a minimum size */
uint16_t payload_len;
/* If this is an emulated PUT, store at this address on receiver */
char *put_addr;
/* Type of BTL header (see enum, above) */
uint8_t payload_type;
/* Yuck */
uint8_t padding;
} ompi_btl_usnic_btl_header_t;
/**
* BTL header for a chunk of a fragment
*/
typedef struct {
ompi_btl_usnic_btl_header_t ch_hdr;
uint32_t ch_frag_id; /* ID for collecting segments of same frag */
uint32_t ch_frag_size; /* total frag len */
uint32_t ch_frag_offset; /* where in fragment this goes */
} ompi_btl_usnic_btl_chunk_header_t;
/*
* Enums for the states of frags
*/
typedef enum {
/* Frag states: all frags */
FRAG_ALLOCED = 0x01,
/* Frag states: send frags */
FRAG_SEND_ACKED = 0x02,
FRAG_SEND_ENQUEUED = 0x04,
FRAG_PML_CALLED_BACK = 0x08,
FRAG_IN_HOTEL = 0x10,
/* Frag states: receive frags */
FRAG_RECV_WR_POSTED = 0x40,
FRAG_MAX = 0xff
} ompi_btl_usnic_frag_state_flags_t;
/*
* Convenience macros for states
*/
#define FRAG_STATE_SET(frag, state) (frag)->state_flags |= (state)
#define FRAG_STATE_CLR(frag, state) (frag)->state_flags &= ~(state)
#define FRAG_STATE_GET(frag, state) ((frag)->state_flags & (state))
#define FRAG_STATE_ISSET(frag, state) (((frag)->state_flags & (state)) != 0)
/**
* Descriptor for a common segment. This is exactly one packet and may
* be send or receive
*/
typedef struct ompi_btl_usnic_segment_t {
ompi_free_list_item_t us_list;
ompi_btl_usnic_seg_type_t us_type;
/* allow for 2 SG entries */
struct ibv_sge us_sg_entry[2];
/* header for chunked frag is different */
union {
ompi_btl_usnic_btl_header_t *uus_btl_header;
ompi_btl_usnic_btl_chunk_header_t *uus_btl_chunk_header;
} us_hdr;
#define us_btl_header us_hdr.uus_btl_header
#define us_btl_chunk_header us_hdr.uus_btl_chunk_header
union {
uint8_t *raw;
mca_btl_base_header_t *pml_header;
} us_payload;
} ompi_btl_usnic_segment_t;
struct ompi_btl_usnic_endpoint_t;
/**
* Descriptor for a recv segment. This is exactly one packet and may
* be part of a large or small send or may be an ACK
*/
typedef struct ompi_btl_usnic_recv_segment_t {
ompi_btl_usnic_segment_t rs_base;
mca_btl_base_descriptor_t rs_desc;
mca_btl_base_segment_t rs_segment;
/* receive segments have protocol header prepended */
ompi_btl_usnic_protocol_header_t *rs_protocol_header;
ompi_btl_usnic_endpoint_t *rs_endpoint;
/* verbs recv desc */
struct ibv_recv_wr rs_recv_desc;
} ompi_btl_usnic_recv_segment_t;
/**
* Descriptor for a send segment. This is exactly one packet and may
* be part of a large or small send or may be an ACK
*/
typedef struct ompi_btl_usnic_send_segment_t {
ompi_btl_usnic_segment_t ss_base;
/* verbs send desc */
struct ibv_send_wr ss_send_desc;
/* channel upon which send was posted */
ompi_btl_usnic_channel_id_t ss_channel;
uint32_t ss_flags;
struct ompi_btl_usnic_send_frag_t *ss_parent_frag;
int ss_hotel_room; /* current retrans room, or -1 if none */
/* How many times is this frag on a hardware queue? */
uint32_t ss_send_posted;
bool ss_ack_pending; /* true until this segment is ACKed */
} ompi_btl_usnic_send_segment_t;
/**
* Common part of usNIC fragment descriptor
*/
typedef struct ompi_btl_usnic_frag_t {
mca_btl_base_descriptor_t uf_base;
/* fragment descriptor type */
ompi_btl_usnic_frag_type_t uf_type;
/* utility segments */
mca_btl_base_segment_t uf_src_seg[2];
mca_btl_base_segment_t uf_dst_seg[1];
/* freelist this came from */
ompi_free_list_t *uf_freelist;
} ompi_btl_usnic_frag_t;
/**
* Common part of usNIC send fragment descriptor
*/
typedef struct ompi_btl_usnic_send_frag_t {
ompi_btl_usnic_frag_t sf_base;
struct mca_btl_base_endpoint_t *sf_endpoint;
size_t sf_size; /* total_fragment size (PML + user payload) */
/* original message data if convertor required */
struct opal_convertor_t* sf_convertor;
uint32_t sf_seg_post_cnt; /* total segs currently posted for this frag */
size_t sf_ack_bytes_left; /* bytes remaining to be ACKed */
struct ompi_btl_usnic_send_frag_t *sf_next;
} ompi_btl_usnic_send_frag_t;
/**
* Descriptor for a large fragment
* Large fragment uses two SG entries - one points to PML header,
* other points to data.
*/
typedef struct ompi_btl_usnic_large_send_frag_t {
ompi_btl_usnic_send_frag_t lsf_base;
char lsf_pml_header[64]; /* space for PML header */
uint32_t lsf_frag_id; /* fragment ID for reassembly */
size_t lsf_cur_offset; /* current offset into message */
size_t lsf_bytes_left; /* bytes remaining to send */
} ompi_btl_usnic_large_send_frag_t;
/**
* small send fragment
* Small send will optimistically use 2 SG entries in hopes of performing
* an inline send, but will convert to a single SG entry is inline cannot
* be done and data must be copied.
* First segment will point to registered memory of associated segment to
* hold BTL and PML headers.
* Second segment will point directly to user data. If inlining fails, we
* will copy user data into the registered memory after the PML header and
* convert to a single segment.
*/
typedef struct ompi_btl_usnic_small_send_frag_t {
ompi_btl_usnic_send_frag_t ssf_base;
/* small fragments have embedded segs */
ompi_btl_usnic_send_segment_t ssf_segment;
} ompi_btl_usnic_small_send_frag_t;
/**
* descriptor for a put destination
*/
typedef ompi_btl_usnic_frag_t ompi_btl_usnic_put_dest_frag_t;
OBJ_CLASS_DECLARATION(ompi_btl_usnic_send_frag_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_small_send_frag_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_large_send_frag_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_put_dest_frag_t);
typedef ompi_btl_usnic_send_segment_t ompi_btl_usnic_frag_segment_t;
typedef ompi_btl_usnic_send_segment_t ompi_btl_usnic_chunk_segment_t;
OBJ_CLASS_DECLARATION(ompi_btl_usnic_segment_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_frag_segment_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_chunk_segment_t);
OBJ_CLASS_DECLARATION(ompi_btl_usnic_recv_segment_t);
typedef ompi_btl_usnic_send_segment_t ompi_btl_usnic_ack_segment_t;
OBJ_CLASS_DECLARATION(ompi_btl_usnic_ack_segment_t);
/*
* Alloc a send frag from the send pool
*/
static inline ompi_btl_usnic_small_send_frag_t *
ompi_btl_usnic_small_send_frag_alloc(ompi_btl_usnic_module_t *module)
{
ompi_free_list_item_t *item;
ompi_btl_usnic_small_send_frag_t *frag;
OMPI_FREE_LIST_GET_MT(&(module->small_send_frags), item);
if (OPAL_UNLIKELY(NULL == item)) {
return NULL;
}
frag = (ompi_btl_usnic_small_send_frag_t*) item;
/* this belongs in constructor... */
frag->ssf_base.sf_base.uf_freelist = &(module->small_send_frags);
/* always clear flag */
frag->ssf_segment.ss_send_desc.send_flags = IBV_SEND_SIGNALED;
assert(frag);
assert(OMPI_BTL_USNIC_FRAG_SMALL_SEND == frag->ssf_base.sf_base.uf_type);
return frag;
}
static inline ompi_btl_usnic_large_send_frag_t *
ompi_btl_usnic_large_send_frag_alloc(ompi_btl_usnic_module_t *module)
{
ompi_free_list_item_t *item;
ompi_btl_usnic_large_send_frag_t *frag;
OMPI_FREE_LIST_GET_MT(&(module->large_send_frags), item);
if (OPAL_UNLIKELY(NULL == item)) {
return NULL;
}
frag = (ompi_btl_usnic_large_send_frag_t*) item;
/* this belongs in constructor... */
frag->lsf_base.sf_base.uf_freelist = &(module->large_send_frags);
assert(frag);
assert(OMPI_BTL_USNIC_FRAG_LARGE_SEND == frag->lsf_base.sf_base.uf_type);
return frag;
}
static inline ompi_btl_usnic_put_dest_frag_t *
ompi_btl_usnic_put_dest_frag_alloc(
struct ompi_btl_usnic_module_t *module)
{
ompi_free_list_item_t *item;
ompi_btl_usnic_put_dest_frag_t *frag;
OMPI_FREE_LIST_GET_MT(&(module->put_dest_frags), item);
if (OPAL_UNLIKELY(NULL == item)) {
return NULL;
}
frag = (ompi_btl_usnic_put_dest_frag_t*) item;
/* this belongs in constructor... */
frag->uf_freelist = &(module->put_dest_frags);
assert(frag);
assert(OMPI_BTL_USNIC_FRAG_PUT_DEST == frag->uf_type);
return frag;
}
/*
* A send frag can be returned to the freelist when all of the
* following are true:
*
* 1. PML is freeing it (via module.free())
* 2. Or all of these:
* a) it finishes sending all its segments
* b) all of its segments have been ACKed
* c) it is owned by the BTL
*/
static inline bool
ompi_btl_usnic_send_frag_ok_to_return(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_frag_t *frag)
{
assert(frag);
if (OPAL_LIKELY(frag->sf_base.uf_base.des_flags &
MCA_BTL_DES_FLAGS_BTL_OWNERSHIP) &&
0 == frag->sf_ack_bytes_left &&
0 == frag->sf_seg_post_cnt) {
return true;
}
return false;
}
static inline void
ompi_btl_usnic_frag_return(
struct ompi_btl_usnic_module_t *module,
ompi_btl_usnic_frag_t *frag)
{
OMPI_FREE_LIST_RETURN_MT(frag->uf_freelist, &(frag->uf_base.super));
}
/*
* Return a send frag if it's all done and owned by BTL
*/
static inline void
ompi_btl_usnic_send_frag_return_cond(
struct ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_frag_t *frag)
{
if (ompi_btl_usnic_send_frag_ok_to_return(module, frag)) {
ompi_btl_usnic_frag_return(module, &frag->sf_base);
}
}
static inline ompi_btl_usnic_chunk_segment_t *
ompi_btl_usnic_chunk_segment_alloc(
ompi_btl_usnic_module_t *module)
{
ompi_free_list_item_t *item;
ompi_btl_usnic_send_segment_t *seg;
OMPI_FREE_LIST_GET_MT(&(module->chunk_segs), item);
if (OPAL_UNLIKELY(NULL == item)) {
return NULL;
}
seg = (ompi_btl_usnic_send_segment_t*) item;
seg->ss_channel = USNIC_DATA_CHANNEL;
assert(seg);
assert(OMPI_BTL_USNIC_SEG_CHUNK == seg->ss_base.us_type);
return seg;
}
static inline void
ompi_btl_usnic_chunk_segment_return(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_chunk_segment_t *seg)
{
assert(seg);
assert(OMPI_BTL_USNIC_SEG_CHUNK == seg->ss_base.us_type);
OMPI_FREE_LIST_RETURN_MT(&(module->chunk_segs), &(seg->ss_base.us_list));
}
/*
* Alloc an ACK segment
*/
static inline ompi_btl_usnic_ack_segment_t *
ompi_btl_usnic_ack_segment_alloc(ompi_btl_usnic_module_t *module)
{
ompi_free_list_item_t *item;
ompi_btl_usnic_send_segment_t *ack;
OMPI_FREE_LIST_GET_MT(&(module->ack_segs), item);
if (OPAL_UNLIKELY(NULL == item)) {
return NULL;
}
ack = (ompi_btl_usnic_ack_segment_t*) item;
ack->ss_channel = USNIC_PRIORITY_CHANNEL;
assert(ack);
assert(OMPI_BTL_USNIC_SEG_ACK == ack->ss_base.us_type);
return ack;
}
/*
* Return an ACK segment
*/
static inline void
ompi_btl_usnic_ack_segment_return(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_ack_segment_t *ack)
{
assert(ack);
assert(OMPI_BTL_USNIC_SEG_ACK == ack->ss_base.us_type);
OMPI_FREE_LIST_RETURN_MT(&(module->ack_segs), &(ack->ss_base.us_list));
}
END_C_DECLS
#endif

217
ompi/mca/btl/usnic/btl_usnic_hwloc.c Обычный файл
Просмотреть файл

@ -0,0 +1,217 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/*
* This file is only compiled (via AM_CONDITIONAL) if OPAL_HAVE_HWLOC
* is set.
*/
#include "ompi_config.h"
#include <infiniband/verbs.h>
/* Define this before including hwloc.h so that we also get the hwloc
verbs helper header file, too. We have to do this level of
indirection because the hwloc subsystem is a component -- we don't
know its exact path. We have to rely on the framework header files
to find the right hwloc verbs helper file for us. */
#define OPAL_HWLOC_WANT_VERBS_HELPER 1
#include "opal/mca/hwloc/hwloc.h"
#include "ompi/constants.h"
#include "ompi/mca/btl/base/base.h"
#include "ompi/mca/common/verbs/common_verbs.h"
#include "btl_usnic_hwloc.h"
/*
* Local variables
*/
static hwloc_obj_t my_numa_node = NULL;
static int num_numa_nodes = 0;
static const struct hwloc_distances_s *matrix = NULL;
/*
* Get the hwloc distance matrix (if we don't already have it).
*
* Note that the matrix data structure belongs to hwloc; we are not
* responsibile for freeing it.
*/
static int get_distance_matrix(void)
{
if (NULL == matrix) {
matrix = hwloc_get_whole_distance_matrix_by_type(opal_hwloc_topology,
HWLOC_OBJ_NODE);
}
return (NULL == matrix) ? OMPI_ERROR : OMPI_SUCCESS;
}
/*
* Find the NUMA node that covers a given cpuset
*/
static hwloc_obj_t find_numa_node(hwloc_bitmap_t cpuset)
{
hwloc_obj_t obj;
obj =
hwloc_get_first_largest_obj_inside_cpuset(opal_hwloc_topology, cpuset);
/* Go upwards until we hit the NUMA node or run out of parents */
while (obj->type > HWLOC_OBJ_NODE &&
NULL != obj->parent) {
obj = obj->parent;
}
/* Make sure we ended up on the NUMA node */
if (obj->type != HWLOC_OBJ_NODE) {
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:filter_numa: could not find NUMA node where this process is bound; filtering by NUMA distance not possible");
return NULL;
}
/* Finally, make sure that our cpuset doesn't span more than 1
NUMA node */
if (hwloc_get_nbobjs_inside_cpuset_by_type(opal_hwloc_topology,
cpuset, HWLOC_OBJ_NODE) > 1) {
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:filter_numa: this process is bound to more than 1 NUMA node; filtering by NUMA distance not possible");
return NULL;
}
return obj;
}
/*
* Find my NUMA node in the hwloc topology. This is a Cisco
* UCS-specific BTL, so I know that I'll always have a NUMA node
* (i.e., not some unknown server type that may not have or report a
* NUMA node).
*
* Note that the my_numa_node value we find is just a handle; we
* aren't responsible for freeing it.
*/
static int find_my_numa_node(void)
{
hwloc_obj_t obj;
hwloc_bitmap_t cpuset;
if (NULL != my_numa_node) {
return OMPI_SUCCESS;
}
/* Get this process' binding */
cpuset = hwloc_bitmap_alloc();
if (NULL == cpuset) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
if (0 != hwloc_get_cpubind(opal_hwloc_topology, cpuset, 0)) {
hwloc_bitmap_free(cpuset);
return OMPI_ERR_NOT_AVAILABLE;
}
/* Get the largest object type in the cpuset */
obj = find_numa_node(cpuset);
hwloc_bitmap_free(cpuset);
if (NULL == obj) {
return OMPI_ERR_NOT_AVAILABLE;
}
/* Happiness */
my_numa_node = obj;
num_numa_nodes = hwloc_get_nbobjs_by_type(opal_hwloc_topology,
HWLOC_OBJ_NODE);
return OMPI_SUCCESS;
}
/*
* Find a NUMA node covering the device associated with this module
*/
static hwloc_obj_t find_device_numa(ompi_btl_usnic_module_t *module)
{
hwloc_obj_t obj;
hwloc_bitmap_t cpuset;
/* Bozo checks */
assert(NULL != matrix);
assert(NULL != my_numa_node);
/* Find the NUMA node for the device */
cpuset = hwloc_bitmap_alloc();
if (NULL == cpuset) {
return NULL;
}
if (0 != hwloc_ibv_get_device_cpuset(opal_hwloc_topology,
module->device,
cpuset)) {
hwloc_bitmap_free(cpuset);
return NULL;
}
obj = find_numa_node(cpuset);
hwloc_bitmap_free(cpuset);
return obj;
}
/*
* Public entry point: find the hwloc NUMA distance from this process
* to the usnic device in the specified module.
*/
int ompi_btl_usnic_hwloc_distance(ompi_btl_usnic_module_t *module)
{
int ret;
hwloc_obj_t dev_numa;
/* Bozo check */
assert(NULL != module);
/* Is this process bound? */
if (!ompi_rte_proc_is_bound) {
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:filter_numa: not sorting devices by NUMA distance (process not bound)");
return OMPI_SUCCESS;
}
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:filter_numa: filtering devices by NUMA distance");
/* Get the hwloc distance matrix for all NUMA nodes */
if (OMPI_SUCCESS != (ret = get_distance_matrix())) {
return ret;
}
/* Find my NUMA node */
if (OMPI_SUCCESS != (ret = find_my_numa_node())) {
return ret;
}
/* If my_numa_node is still NULL, that means we span more than 1
NUMA node. So... no sorting/pruning for you! */
if (NULL == my_numa_node) {
return OMPI_SUCCESS;
}
/* Find the NUMA node covering this module's device */
dev_numa = find_device_numa(module);
/* Lookup the distance between my NUMA node and the NUMA node of
the device */
if (NULL != dev_numa) {
module->numa_distance =
matrix->latency[dev_numa->logical_index * num_numa_nodes +
my_numa_node->logical_index];
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:filter_numa: %s is distance %d from me",
ibv_get_device_name(module->device),
module->numa_distance);
}
return OMPI_SUCCESS;
}

22
ompi/mca/btl/usnic/btl_usnic_hwloc.h Обычный файл
Просмотреть файл

@ -0,0 +1,22 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_USNIC_HWLOC_H
#define BTL_USNIC_HWLOC_H
#include "ompi_config.h"
#include "btl_usnic_module.h"
#if OPAL_HAVE_HWLOC
int ompi_btl_usnic_hwloc_distance(ompi_btl_usnic_module_t *module);
#endif
#endif /* BTL_USNIC_HWLOC_H */

255
ompi/mca/btl/usnic/btl_usnic_mca.c Обычный файл
Просмотреть файл

@ -0,0 +1,255 @@
/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2008-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#ifdef HAVE_STRING_H
#include <string.h>
#endif
#include <errno.h>
#include <infiniband/verbs.h>
#include "opal/mca/base/mca_base_var.h"
#include "opal/util/argv.h"
#include "ompi/constants.h"
#include "ompi/mca/btl/btl.h"
#include "ompi/mca/btl/base/base.h"
#include "ompi/mca/common/verbs/common_verbs.h"
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
/*
* Local flags
*/
enum {
REGINT_NEG_ONE_OK = 0x01,
REGINT_GE_ZERO = 0x02,
REGINT_GE_ONE = 0x04,
REGINT_NONZERO = 0x08,
REGINT_MAX = 0x88
};
enum {
REGSTR_EMPTY_OK = 0x01,
REGSTR_MAX = 0x88
};
/*
* utility routine for string parameter registration
*/
static int reg_string(const char* param_name,
const char* help_string,
const char* default_value, char **storage,
int flags, int level)
{
*storage = (char*) default_value;
mca_base_component_var_register(&mca_btl_usnic_component.super.btl_version,
param_name, help_string,
MCA_BASE_VAR_TYPE_STRING,
NULL,
0,
0,
level,
MCA_BASE_VAR_SCOPE_READONLY,
storage);
if (0 == (flags & REGSTR_EMPTY_OK) &&
(NULL == *storage || 0 == strlen(*storage))) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OMPI_ERR_BAD_PARAM;
}
return OMPI_SUCCESS;
}
/*
* utility routine for integer parameter registration
*/
static int reg_int(const char* param_name,
const char* help_string,
int default_value, int *storage, int flags, int level)
{
*storage = default_value;
mca_base_component_var_register(&mca_btl_usnic_component.super.btl_version,
param_name, help_string,
MCA_BASE_VAR_TYPE_INT,
NULL,
0,
0,
level,
MCA_BASE_VAR_SCOPE_READONLY,
storage);
if (0 != (flags & REGINT_NEG_ONE_OK) && -1 == *storage) {
return OMPI_SUCCESS;
}
if ((0 != (flags & REGINT_GE_ZERO) && *storage < 0) ||
(0 != (flags & REGINT_GE_ONE) && *storage < 1) ||
(0 != (flags & REGINT_NONZERO) && 0 == *storage)) {
opal_output(0, "Bad parameter value for parameter \"%s\"",
param_name);
return OMPI_ERR_BAD_PARAM;
}
return OMPI_SUCCESS;
}
int ompi_btl_usnic_component_register(void)
{
int i, tmp, ret = 0;
char *str, **parts;
static int max_modules;
static int stats_relative;
static int want_numa_device_assignment;
static int sd_num;
static int rd_num;
static int prio_sd_num;
static int prio_rd_num;
static int cq_num;
static int max_tiny_payload;
static int eager_limit;
static int rndv_eager_limit;
static char *vendor_part_ids;
#define CHECK(expr) do {\
tmp = (expr); \
if (OMPI_SUCCESS != tmp) ret = tmp; \
} while (0)
CHECK(reg_int("max_btls",
"Maximum number of usNICs to use (default: 0 = as many as are available)",
0, &max_modules,
REGINT_GE_ZERO, OPAL_INFO_LVL_2));
mca_btl_usnic_component.max_modules = (size_t) max_modules;
CHECK(reg_string("if_include",
"Comma-delimited list of devices/networks to be used (e.g. \"usnic_0,10.10.0.0/16\"; empty value means to use all available usNICs). Mutually exclusive with btl_usnic_if_exclude.",
NULL, &mca_btl_usnic_component.if_include,
REGSTR_EMPTY_OK, OPAL_INFO_LVL_1));
CHECK(reg_string("if_exclude",
"Comma-delimited list of devices/networks to be excluded (empty value means to not exclude any usNICs). Mutually exclusive with btl_usnic_if_include.",
NULL, &mca_btl_usnic_component.if_exclude,
REGSTR_EMPTY_OK, OPAL_INFO_LVL_1));
/* Cisco Sereno-based VICs are part ID 207 */
vendor_part_ids = NULL;
CHECK(reg_string("vendor_part_ids",
"Comma-delimited list verbs vendor part IDs to search for/use",
"207", &vendor_part_ids, 0, OPAL_INFO_LVL_5));
parts = opal_argv_split(vendor_part_ids, ',');
mca_btl_usnic_component.vendor_part_ids =
calloc(sizeof(uint32_t), opal_argv_count(parts) + 1);
if (NULL == mca_btl_usnic_component.vendor_part_ids) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
for (i = 0, str = parts[0]; NULL != str; str = parts[++i]) {
mca_btl_usnic_component.vendor_part_ids[i] = (uint32_t) atoi(str);
}
opal_argv_free(parts);
CHECK(reg_int("stats",
"A non-negative integer specifying the frequency at which each USNIC BTL will output statistics (default: 0 seconds, meaning that statistics are disabled)",
0, &mca_btl_usnic_component.stats_frequency, 0,
OPAL_INFO_LVL_4));
mca_btl_usnic_component.stats_enabled =
(bool) (mca_btl_usnic_component.stats_frequency > 0);
CHECK(reg_int("stats_relative",
"If stats are enabled, output relative stats between the timestemps (vs. cumulative stats since the beginning of the job) (default: 0 -- i.e., absolute)",
0, &stats_relative, 0, OPAL_INFO_LVL_4));
mca_btl_usnic_component.stats_relative = (bool) stats_relative;
CHECK(reg_string("mpool", "Name of the memory pool to be used",
"grdma", &mca_btl_usnic_component.usnic_mpool_name, 0,
OPAL_INFO_LVL_5));
want_numa_device_assignment = OPAL_HAVE_HWLOC ? 1 : -1;
CHECK(reg_int("want_numa_device_assignment",
"If 1, use only Cisco VIC ports thare are a minimum NUMA distance from the MPI process for short messages. If 0, use all available Cisco VIC ports for short messages. This parameter is meaningless (and ignored) unless MPI proceses are bound to processor cores. Defaults to 1 if NUMA support is included in Open MPI; -1 otherwise.",
want_numa_device_assignment,
&want_numa_device_assignment,
0, OPAL_INFO_LVL_5));
mca_btl_usnic_component.want_numa_device_assignment =
(1 == want_numa_device_assignment) ? true : false;
CHECK(reg_int("sd_num", "Maximum send descriptors to post (-1 = pre-set defaults; depends on number and type of devices available)",
-1, &sd_num, REGINT_NEG_ONE_OK, OPAL_INFO_LVL_5));
mca_btl_usnic_component.sd_num = (int32_t) sd_num;
CHECK(reg_int("rd_num", "Number of pre-posted receive buffers (-1 = pre-set defaults; depends on number and type of devices available)",
-1, &rd_num, REGINT_NEG_ONE_OK, OPAL_INFO_LVL_5));
mca_btl_usnic_component.rd_num = (int32_t) rd_num;
CHECK(reg_int("prio_sd_num", "Maximum priority send descriptors to post (-1 = pre-set defaults; depends on number and type of devices available)",
-1, &prio_sd_num, REGINT_NEG_ONE_OK, OPAL_INFO_LVL_5));
mca_btl_usnic_component.prio_sd_num = (int32_t) prio_sd_num;
CHECK(reg_int("prio_rd_num", "Number of pre-posted priority receive buffers (-1 = pre-set defaults; depends on number and type of devices available)",
-1, &prio_rd_num, REGINT_NEG_ONE_OK, OPAL_INFO_LVL_5));
mca_btl_usnic_component.prio_rd_num = (int32_t) prio_rd_num;
CHECK(reg_int("cq_num", "Number of completion queue entries (-1 = pre-set defaults; depends on number and type of devices available; will error if (sd_num+rd_num)>cq_num)",
-1, &cq_num, REGINT_NEG_ONE_OK, OPAL_INFO_LVL_5));
mca_btl_usnic_component.cq_num = (int32_t) cq_num;
CHECK(reg_int("retrans_timeout", "Number of microseconds before retransmitting a frame",
1000, &mca_btl_usnic_component.retrans_timeout,
REGINT_GE_ONE, OPAL_INFO_LVL_5));
CHECK(reg_int("priority_limit", "Max size of \"priority\" messages (0 = use pre-set defaults; depends on number and type of devices available)",
0, &max_tiny_payload,
REGINT_GE_ZERO, OPAL_INFO_LVL_5));
ompi_btl_usnic_module_template.max_tiny_payload =
(size_t) max_tiny_payload;
CHECK(reg_int("eager_limit", "Eager send limit (0 = use pre-set defaults; depends on number and type of devices available)",
0, &eager_limit, REGINT_GE_ZERO, OPAL_INFO_LVL_5));
ompi_btl_usnic_module_template.super.btl_eager_limit = eager_limit;
CHECK(reg_int("rndv_eager_limit", "Eager rendezvous limit (0 = use pre-set defaults; depends on number and type of devices available)",
0, &rndv_eager_limit, REGINT_GE_ZERO, OPAL_INFO_LVL_5));
ompi_btl_usnic_module_template.super.btl_rndv_eager_limit =
rndv_eager_limit;
/* Default to bandwidth auto-detection */
ompi_btl_usnic_module_template.super.btl_bandwidth = 0;
ompi_btl_usnic_module_template.super.btl_latency = 4;
/* Register some synonyms to the ompi common verbs component */
ompi_common_verbs_mca_register(&mca_btl_usnic_component.super.btl_version);
return ret;
}

1880
ompi/mca/btl/usnic/btl_usnic_module.c Обычный файл

Разница между файлами не показана из-за своего большого размера Загрузить разницу

287
ompi/mca/btl/usnic/btl_usnic_module.h Обычный файл
Просмотреть файл

@ -0,0 +1,287 @@
/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2011-2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/**
* @file
*/
#ifndef OMPI_BTL_USNIC_MODULE_H
#define OMPI_BTL_USNIC_MODULE_H
#include "opal/class/opal_pointer_array.h"
#include "ompi/mca/common/verbs/common_verbs.h"
#include "btl_usnic_endpoint.h"
/*
* Default limits.
*
* These values obtained from empirical testing on Intel E5-2690
* machines with Sereno/Lexington cards through an N3546 switch.
*/
#define USNIC_DFLT_EAGER_LIMIT_1DEVICE (150 * 1024)
#define USNIC_DFLT_EAGER_LIMIT_NDEVICES (25 * 1024)
#define USNIC_DFLT_RNDV_EAGER_LIMIT 500
BEGIN_C_DECLS
/*
* Forward declarations to avoid include loops
*/
struct ompi_btl_usnic_send_segment_t;
/*
* Abstraction of a set of IB queues
*/
typedef struct ompi_btl_usnic_channel_t {
struct ibv_cq *cq;
int chan_mtu;
int chan_rd_num;
int chan_sd_num;
/** available send WQ entries */
int32_t sd_wqe;
/* fastsend enabled if sd_wqe >= fastsend_wqe_thresh */
int fastsend_wqe_thresh;
/** queue pair */
struct ibv_qp* qp;
/** receive segments & buffers */
ompi_free_list_t recv_segs;
/* statistics */
uint32_t num_channel_sends;
} ompi_btl_usnic_channel_t;
/**
* UD verbs BTL interface
*/
typedef struct ompi_btl_usnic_module_t {
mca_btl_base_module_t super;
/* Cache for use during component_init to associate a module with
the ompi_common_verbs_port_item_t that it came from. */
ompi_common_verbs_port_item_t *port;
mca_btl_base_module_error_cb_fn_t pml_error_callback;
/* Information about the usNIC verbs device */
uint8_t port_num;
struct ibv_device *device;
struct ibv_context *device_context;
struct event device_async_event;
bool device_async_event_active;
struct ibv_pd *pd;
int numa_distance; /* hwloc NUMA distance from this process */
/* Information about the IP interface corresponding to this USNIC
interface */
char if_name[64];
uint32_t if_ipv4_addr; /* in network byte order */
uint32_t if_cidrmask; /* X in "/X" CIDR addr fmt, host byte order */
uint8_t if_mac[6];
int if_mtu;
/** desired send, receive, and completion queue entries (from MCA
params; cached here on the component because the MCA param
might == 0, which means "max supported on that device") */
int sd_num;
int rd_num;
int cq_num;
int prio_sd_num;
int prio_rd_num;
/*
* Fragments larger than max_frag_payload will be broken up into
* multiple chunks. The amount that can be held in a single chunk
* segment is slightly less than what can be held in frag segment due
* to fragment reassembly info.
*/
int tiny_mtu;
size_t max_frag_payload; /* most that fits in a frag segment */
size_t max_chunk_payload; /* most that can fit in chunk segment */
size_t max_tiny_payload; /* threshold for using inline send */
/** Hash table to keep track of senders */
opal_hash_table_t senders;
/** local address information */
struct ompi_btl_usnic_addr_t local_addr;
/** list of all endpoints */
opal_list_t all_endpoints;
/** array of procs used by this module (can't use a list because a
proc can be used by multiple modules) */
opal_pointer_array_t all_procs;
/** send fragments & buffers */
ompi_free_list_t small_send_frags;
ompi_free_list_t large_send_frags;
ompi_free_list_t put_dest_frags;
ompi_free_list_t chunk_segs;
/** receive buffer pools */
int first_pool;
int last_pool;
ompi_free_list_t *module_recv_buffers;
/** list of endpoints with data to send */
/* this list uses base endpoint ptr */
opal_list_t endpoints_with_sends;
/** list of send frags that are waiting to be resent (they
previously deferred because of lack of resources) */
opal_list_t pending_resend_segs;
/** ack segments */
ompi_free_list_t ack_segs;
/** list of endpoints to which we need to send ACKs */
/* this list uses endpoint->endpoint_ack_li */
opal_list_t endpoints_that_need_acks;
/* abstract queue-pairs into channels */
ompi_btl_usnic_channel_t mod_channels[USNIC_NUM_CHANNELS];
uint32_t qp_max_inline;
/* Debugging statistics */
bool final_stats;
uint64_t stats_report_num;
uint64_t num_total_sends;
uint64_t num_resends;
uint64_t num_timeout_retrans;
uint64_t num_fast_retrans;
uint64_t num_chunk_sends;
uint64_t num_frag_sends;
uint64_t num_ack_sends;
uint64_t num_total_recvs;
uint64_t num_unk_recvs;
uint64_t num_dup_recvs;
uint64_t num_oow_low_recvs;
uint64_t num_oow_high_recvs;
uint64_t num_frag_recvs;
uint64_t num_chunk_recvs;
uint64_t num_badfrag_recvs;
uint64_t num_ack_recvs;
uint64_t num_old_dup_acks;
uint64_t num_dup_acks;
uint64_t num_recv_reposts;
uint64_t num_crc_errors;
uint64_t max_sent_window_size;
uint64_t max_rcvd_window_size;
uint64_t pml_module_sends;
uint64_t pml_send_callbacks;
opal_event_t stats_timer_event;
struct timeval stats_timeout;
} ompi_btl_usnic_module_t;
struct ompi_btl_usnic_frag_t;
extern ompi_btl_usnic_module_t ompi_btl_usnic_module_template;
/*
* Manipulate the "endpoints_that_need_acks" list
*/
/* get first endpoint needing ACK */
static inline ompi_btl_usnic_endpoint_t *
ompi_btl_usnic_get_first_endpoint_needing_ack(
ompi_btl_usnic_module_t *module)
{
opal_list_item_t *item;
ompi_btl_usnic_endpoint_t *endpoint;
item = opal_list_get_first(&module->endpoints_that_need_acks);
if (item != opal_list_get_end(&module->endpoints_that_need_acks)) {
endpoint = container_of(item, mca_btl_base_endpoint_t, endpoint_ack_li);
return endpoint;
} else {
return NULL;
}
}
/* get next item in chain */
static inline ompi_btl_usnic_endpoint_t *
ompi_btl_usnic_get_next_endpoint_needing_ack(
ompi_btl_usnic_endpoint_t *endpoint)
{
opal_list_item_t *item;
ompi_btl_usnic_module_t *module;
module = endpoint->endpoint_module;
item = opal_list_get_next(&(endpoint->endpoint_ack_li));
if (item != opal_list_get_end(&module->endpoints_that_need_acks)) {
endpoint = container_of(item, mca_btl_base_endpoint_t, endpoint_ack_li);
return endpoint;
} else {
return NULL;
}
}
static inline void
ompi_btl_usnic_remove_from_endpoints_needing_ack(
ompi_btl_usnic_endpoint_t *endpoint)
{
opal_list_remove_item(
&(endpoint->endpoint_module->endpoints_that_need_acks),
&endpoint->endpoint_ack_li);
endpoint->endpoint_ack_needed = false;
endpoint->endpoint_acktime = 0;
#if MSGDEBUG1
opal_output(0, "clear ack_needed on %p\n", endpoint);
#endif
}
static inline void
ompi_btl_usnic_add_to_endpoints_needing_ack(
ompi_btl_usnic_endpoint_t *endpoint)
{
opal_list_append(&(endpoint->endpoint_module->endpoints_that_need_acks),
&endpoint->endpoint_ack_li);
endpoint->endpoint_ack_needed = true;
#if MSGDEBUG1
opal_output(0, "set ack_needed on %p\n", endpoint);
#endif
}
/*
* Initialize a module
*/
int ompi_btl_usnic_module_init(ompi_btl_usnic_module_t* module);
/*
* Progress pending sends on a module
*/
void ompi_btl_usnic_module_progress_sends(ompi_btl_usnic_module_t *module);
END_C_DECLS
#endif

421
ompi/mca/btl/usnic/btl_usnic_proc.c Обычный файл
Просмотреть файл

@ -0,0 +1,421 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <infiniband/verbs.h>
#include "opal_stdint.h"
#include "opal/util/arch.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/util/show_help.h"
#include "ompi/runtime/ompi_module_exchange.h"
#include "ompi/constants.h"
#include "btl_usnic.h"
#include "btl_usnic_proc.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
#include "btl_usnic_util.h"
static void proc_construct(ompi_btl_usnic_proc_t* proc)
{
proc->proc_ompi = 0;
proc->proc_modex = NULL;
proc->proc_modex_count = 0;
proc->proc_modex_claimed = NULL;
proc->proc_endpoints = NULL;
proc->proc_endpoint_count = 0;
/* add to list of all proc instance */
opal_list_append(&mca_btl_usnic_component.usnic_procs, &proc->super);
}
static void proc_destruct(ompi_btl_usnic_proc_t* proc)
{
/* remove from list of all proc instances */
opal_list_remove_item(&mca_btl_usnic_component.usnic_procs, &proc->super);
/* release resources */
if (NULL != proc->proc_modex) {
free(proc->proc_modex);
proc->proc_modex = NULL;
}
if (NULL != proc->proc_modex_claimed) {
free(proc->proc_modex_claimed);
proc->proc_modex_claimed = NULL;
}
/* Release all endpoints associated with this proc */
if (NULL != proc->proc_endpoints) {
free(proc->proc_endpoints);
proc->proc_endpoints = NULL;
}
}
OBJ_CLASS_INSTANCE(ompi_btl_usnic_proc_t,
opal_list_item_t,
proc_construct,
proc_destruct);
/*
* Look for an existing usnic process instance based on the
* associated ompi_proc_t instance.
*/
ompi_btl_usnic_proc_t *
ompi_btl_usnic_proc_lookup_ompi(ompi_proc_t* ompi_proc)
{
ompi_btl_usnic_proc_t* usnic_proc;
for (usnic_proc = (ompi_btl_usnic_proc_t*)
opal_list_get_first(&mca_btl_usnic_component.usnic_procs);
usnic_proc != (ompi_btl_usnic_proc_t*)
opal_list_get_end(&mca_btl_usnic_component.usnic_procs);
usnic_proc = (ompi_btl_usnic_proc_t*)
opal_list_get_next(usnic_proc)) {
if (usnic_proc->proc_ompi == ompi_proc) {
return usnic_proc;
}
}
return NULL;
}
/*
* Look for an existing usnic proc based on a hashed ORTE process
* name.
*/
ompi_btl_usnic_endpoint_t *
ompi_btl_usnic_proc_lookup_endpoint(ompi_btl_usnic_module_t *receiver,
uint64_t sender_hashed_orte_name)
{
size_t i;
uint32_t mynet, peernet;
ompi_btl_usnic_proc_t *proc;
ompi_btl_usnic_endpoint_t *endpoint;
for (proc = (ompi_btl_usnic_proc_t*)
opal_list_get_first(&mca_btl_usnic_component.usnic_procs);
proc != (ompi_btl_usnic_proc_t*)
opal_list_get_end(&mca_btl_usnic_component.usnic_procs);
proc = (ompi_btl_usnic_proc_t*)
opal_list_get_next(proc)) {
if (orte_util_hash_name(&proc->proc_ompi->proc_name) ==
sender_hashed_orte_name) {
break;
}
}
/* If we didn't find the sending proc (!), return NULL */
if (opal_list_get_end(&mca_btl_usnic_component.usnic_procs) ==
(opal_list_item_t*) proc) {
return NULL;
}
/* Look through all the endpoints on sender's proc and find one
that we can reach. For the moment, do the same test as in
match_modex: check to see if we have compatible IPv4
networks. */
mynet = ompi_btl_usnic_get_ipv4_subnet(receiver->if_ipv4_addr,
receiver->if_cidrmask);
for (i = 0; i < proc->proc_endpoint_count; ++i) {
endpoint = proc->proc_endpoints[i];
peernet = ompi_btl_usnic_get_ipv4_subnet(endpoint->endpoint_remote_addr.ipv4_addr,
endpoint->endpoint_remote_addr.cidrmask);
/* If we match, we're done */
if (mynet == peernet) {
return endpoint;
}
}
/* Didn't find it */
return NULL;
}
/*
* Create an ompi_btl_usnic_proc_t and initialize it with modex info
* and an empty array of endpoints.
*/
static ompi_btl_usnic_proc_t *create_proc(ompi_proc_t *ompi_proc)
{
ompi_btl_usnic_proc_t *proc = NULL;
size_t size;
int rc;
/* Create the proc if it doesn't already exist */
proc = OBJ_NEW(ompi_btl_usnic_proc_t);
if (NULL == proc) {
return NULL;
}
/* Initialize number of peers */
proc->proc_endpoint_count = 0;
proc->proc_ompi = ompi_proc;
/* query for the peer address info */
rc = ompi_modex_recv(&mca_btl_usnic_component.super.btl_version,
ompi_proc, (void*)&proc->proc_modex,
&size);
if (OMPI_SUCCESS != rc) {
orte_show_help("help-mpi-btl-usnic.txt", "internal error during init",
true,
orte_process_info.nodename,
"<none>", 0,
"ompi_modex_recv() failed", __FILE__, __LINE__,
opal_strerror(rc));
return NULL;
}
if ((size % sizeof(ompi_btl_usnic_addr_t)) != 0) {
char msg[1024];
snprintf(msg, sizeof(msg),
"sizeof(modex for peer %s data) == %d, expected multiple of %d",
ORTE_NAME_PRINT(&ompi_proc->proc_name),
(int) size, (int) sizeof(ompi_btl_usnic_addr_t));
orte_show_help("help-mpi-btl-usnic.txt", "internal error during init",
true,
orte_process_info.nodename,
"<none>", 0,
"invalid modex data", __FILE__, __LINE__,
msg);
OBJ_RELEASE(proc);
return NULL;
}
proc->proc_modex_count = size / sizeof(ompi_btl_usnic_addr_t);
if (0 == proc->proc_modex_count) {
proc->proc_endpoints = NULL;
OBJ_RELEASE(proc);
return NULL;
}
proc->proc_modex_claimed = (bool*)
calloc(proc->proc_modex_count, sizeof(bool));
if (NULL == proc->proc_modex_claimed) {
ORTE_ERROR_LOG(OMPI_ERR_OUT_OF_RESOURCE);
OBJ_RELEASE(proc);
return NULL;
}
proc->proc_endpoints = (mca_btl_base_endpoint_t**)
calloc(proc->proc_modex_count, sizeof(mca_btl_base_endpoint_t*));
if (NULL == proc->proc_endpoints) {
ORTE_ERROR_LOG(OMPI_ERR_OUT_OF_RESOURCE);
OBJ_RELEASE(proc);
return NULL;
}
return proc;
}
/*
* For a specific module, see if this proc has matching address/modex
* info. If so, create an endpoint and return it.
*/
static int match_modex(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_proc_t *proc)
{
size_t i;
/* Each module can claim an address in the proc's modex info that
no other local module is using. See if we can find an unused
address that's on this module's subnet. */
for (i = 0; i < proc->proc_modex_count; ++i) {
if (!proc->proc_modex_claimed[i]) {
char my_ip_string[32], peer_ip_string[32];
uint32_t mynet, peernet;
inet_ntop(AF_INET, &module->if_ipv4_addr,
my_ip_string, sizeof(my_ip_string));
inet_ntop(AF_INET, &proc->proc_modex[i].ipv4_addr,
peer_ip_string, sizeof(peer_ip_string));
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:match_modex: checking my IP address/subnet (%s/%d) vs. peer (%s/%d)",
my_ip_string, module->if_cidrmask,
peer_ip_string, proc->proc_modex[i].cidrmask);
/* JMS For the moment, do an abbreviated comparison. Just
compare the CIDR-masked IP address to see if they're on
the same network. If so, we're good. Need to
eventually replace this with the same type of IP
address matching that is in the TCP BTL (probably want
to move that routine down to opal/util/if.c...?) */
mynet = ompi_btl_usnic_get_ipv4_subnet(module->if_ipv4_addr,
module->if_cidrmask);
peernet = ompi_btl_usnic_get_ipv4_subnet(proc->proc_modex[i].ipv4_addr,
proc->proc_modex[i].cidrmask);
/* If we match, we're done */
if (mynet == peernet) {
opal_output_verbose(5, USNIC_OUT, "btl:usnic:match_modex: IP networks match -- yay!");
break;
}
else {
opal_output_verbose(5, USNIC_OUT, "btl:usnic:match_modex: IP networks DO NOT match -- mynet=%#x peernet=%#x",
mynet, peernet);
}
}
}
/* Return -1 on failure */
if (i >= proc->proc_modex_count) {
return -1;
}
/* If MTU does not match, throw an error */
if (proc->proc_modex[i].mtu != module->if_mtu) {
char *peer_hostname;
if (NULL != proc->proc_ompi->proc_hostname) {
peer_hostname = proc->proc_ompi->proc_hostname;
} else {
peer_hostname =
"<unknown -- please run with mpi_keep_peer_hostnames=1>";
}
orte_show_help("help-mpi-btl-usnic.txt", "MTU mismatch",
true,
orte_process_info.nodename,
ibv_get_device_name(module->device),
module->port_num,
module->if_mtu,
peer_hostname,
proc->proc_modex[i].mtu);
return -1;
}
return i;
}
/*
* Create an endpoint and claim the matched modex slot
*/
int
ompi_btl_usnic_create_endpoint(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_proc_t *proc,
ompi_btl_usnic_endpoint_t **endpoint_o)
{
int modex_index;
struct ibv_ah_attr ah_attr;
ompi_btl_usnic_endpoint_t *endpoint;
/* look for matching modex info */
modex_index = match_modex(module, proc);
if (modex_index < 0) {
opal_output_verbose(5, USNIC_OUT,
"btl:usnic:create_endpoint: did not find usnic modex info for peer %s",
ORTE_NAME_PRINT(&proc->proc_ompi->proc_name));
return OMPI_ERR_NOT_FOUND;
}
endpoint = OBJ_NEW(ompi_btl_usnic_endpoint_t);
if (NULL == endpoint) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
/* Initalize the endpoint */
endpoint->endpoint_module = module;
endpoint->endpoint_remote_addr = proc->proc_modex[modex_index];
/* Create the address handle on this endpoint from the modex info.
memset to both silence valgrind warnings (since the attr struct
ends up getting written down an fd to the kernel) and actually
zero out all the fields that we don't care about / want to be
logically false. */
memset(&ah_attr, 0, sizeof(ah_attr));
ah_attr.is_global = 1;
ah_attr.port_num = 1;
ah_attr.grh.dgid = endpoint->endpoint_remote_addr.gid;
endpoint->endpoint_remote_ah = ibv_create_ah(module->pd, &ah_attr);
if (NULL == endpoint->endpoint_remote_ah) {
orte_show_help("help-mpi-btl-usnic.txt", "ibv API failed",
true,
orte_process_info.nodename,
ibv_get_device_name(module->device),
module->port_num,
"ibv_create_ah()", __FILE__, __LINE__,
"Failed to create an address handle");
OBJ_RELEASE(endpoint);
return OMPI_ERR_OUT_OF_RESOURCE;
}
/* Now claim that modex slot */
proc->proc_modex_claimed[modex_index] = true;
/* Save the endpoint on this proc's array of endpoints */
proc->proc_endpoints[proc->proc_endpoint_count] = endpoint;
endpoint->endpoint_proc_index = proc->proc_endpoint_count;
endpoint->endpoint_proc = proc;
++proc->proc_endpoint_count;
OBJ_RETAIN(proc);
/* also add endpoint to module's list of endpoints */
opal_list_append(&(module->all_endpoints),
&(endpoint->endpoint_endpoint_li));
*endpoint_o = endpoint;
return OMPI_SUCCESS;
}
/*
* If we haven't done so already, receive the modex info for the
* specified ompi_proc. Search that proc's modex info; if we can find
* matching address info, then create an endpoint.
*
* If we don't find a match, it's not an error: just return "not
* found".
*
* There is a one-to-one correspondence between a ompi_proc_t and a
* ompi_btl_usnic_proc_t instance. We cache additional data on the
* ompi_btl_usnic_proc_t: specifically, the list of
* ompi_btl_usnic_endpoint_t instances, and published addresses/modex
* info.
*/
int ompi_btl_usnic_proc_match(ompi_proc_t *ompi_proc,
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_proc_t **proc)
{
/* Check if we have already created a proc structure for this peer
ompi process */
*proc = ompi_btl_usnic_proc_lookup_ompi(ompi_proc);
if (*proc != NULL) {
OBJ_RETAIN(*proc);
} else {
/* If not, go make one */
*proc = create_proc(ompi_proc);
if (NULL == *proc) {
return OMPI_ERR_NOT_FOUND;
}
}
return OMPI_SUCCESS;
}

82
ompi/mca/btl/usnic/btl_usnic_proc.h Обычный файл
Просмотреть файл

@ -0,0 +1,82 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BTL_USNIC_PROC_H
#define MCA_BTL_USNIC_PROC_H
#include "opal/class/opal_object.h"
#include "ompi/proc/proc.h"
#include "btl_usnic.h"
#include "btl_usnic_endpoint.h"
BEGIN_C_DECLS
/**
* Represents the state of a remote process and the set of addresses
* that it exports. An array of these are cached on the usnic
* component (not the usnic module).
*
* Also cache an instance of mca_btl_base_endpoint_t for each BTL
* module that attempts to open a connection to the process.
*/
typedef struct ompi_btl_usnic_proc_t {
/** allow proc to be placed on a list */
opal_list_item_t super;
/** pointer to corresponding ompi_proc_t */
ompi_proc_t *proc_ompi;
/** Addresses received via modex for this remote proc */
ompi_btl_usnic_addr_t* proc_modex;
/** Number of entries in the proc_modex array */
size_t proc_modex_count;
/** Whether the modex entry is "claimed" by a module or not */
bool *proc_modex_claimed;
/** Array of endpoints that have been created to access this proc */
struct mca_btl_base_endpoint_t **proc_endpoints;
/** Number of entries in the proc_endpoints array */
size_t proc_endpoint_count;
} ompi_btl_usnic_proc_t;
OBJ_CLASS_DECLARATION(ompi_btl_usnic_proc_t);
ompi_btl_usnic_proc_t *ompi_btl_usnic_proc_lookup_ompi(ompi_proc_t* ompi_proc);
struct ompi_btl_usnic_module_t;
ompi_btl_usnic_endpoint_t *
ompi_btl_usnic_proc_lookup_endpoint(struct ompi_btl_usnic_module_t *receiver,
uint64_t sender_hashed_orte_name);
int ompi_btl_usnic_proc_match(ompi_proc_t* ompi_proc,
struct ompi_btl_usnic_module_t *module,
ompi_btl_usnic_proc_t **proc);
int
ompi_btl_usnic_create_endpoint(struct ompi_btl_usnic_module_t *module,
ompi_btl_usnic_proc_t *proc,
ompi_btl_usnic_endpoint_t **endpoint_o);
END_C_DECLS
#endif

580
ompi/mca/btl/usnic/btl_usnic_recv.c Обычный файл
Просмотреть файл

@ -0,0 +1,580 @@
/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2008-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <infiniband/verbs.h>
#include <unistd.h>
#include "opal_stdint.h"
#include "opal/mca/memchecker/base/base.h"
#include "ompi/constants.h"
#include "ompi/mca/btl/btl.h"
#include "ompi/mca/btl/base/base.h"
#include "ompi/mca/common/verbs/common_verbs.h"
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_endpoint.h"
#include "btl_usnic_module.h"
#include "btl_usnic_proc.h"
#include "btl_usnic_ack.h"
#include "btl_usnic_recv.h"
#include "btl_usnic_util.h"
/*
* Given an incoming segment, lookup the endpoint that sent it
*/
static inline ompi_btl_usnic_endpoint_t *
lookup_sender(ompi_btl_usnic_module_t *module, ompi_btl_usnic_segment_t *seg)
{
int ret;
ompi_btl_usnic_endpoint_t *sender;
/* Use the hashed ORTE process name in the BTL header to uniquely
identify the sending process (using the MAC/hardware address
only identifies the sending server -- not the sending ORTE
process). */
/* JMS Cesare suggests using a handshake before sending any data
so that instead of looking up a hash on the btl_header->sender,
echo back the ptr to the sender's ompi_proc */
ret = opal_hash_table_get_value_uint64(&module->senders,
seg->us_btl_header->sender,
(void**) &sender);
if (OPAL_LIKELY(OPAL_SUCCESS == ret)) {
return sender;
}
/* The sender wasn't in the hash table, so do a slow lookup and
put the result in the hash table */
sender = ompi_btl_usnic_proc_lookup_endpoint(module,
seg->us_btl_header->sender);
if (NULL != sender) {
opal_hash_table_set_value_uint64(&module->senders,
seg->us_btl_header->sender, sender);
return sender;
}
/* Whoa -- not found at all! */
return NULL;
}
static inline int
ompi_btl_usnic_check_rx_seq(
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_recv_segment_t *seg,
uint32_t *window_index)
{
uint32_t i;
ompi_btl_usnic_seq_t seq;
/*
* Handle piggy-backed ACK if present
*/
if (seg->rs_base.us_btl_header->ack_seq != 0) {
#if MSGDEBUG1
opal_output(0, "Handle piggy-packed ACK seq %d\n", seg->rs_base.us_btl_header->ack_seq);
#endif
ompi_btl_usnic_handle_ack(endpoint,
seg->rs_base.us_btl_header->ack_seq);
}
/* Do we have room in the endpoint's receiver window?
Receiver window:
|-------- WINDOW_SIZE ----------|
+---------------------------------+
| highest_seq_rcvd |
| somewhere in this range |
+^--------------------------------+
|
+-- next_contig_seq_to_recv: the window left edge;
will always be less than highest_seq_rcvd
The good condition is
next_contig_seq_to_recv <= seq < next_contig_seq_to_recv + WINDOW_SIZE
And the bad condition is
seq < next_contig_seq_to_recv
or
seq >= next_contig_seg_to_recv + WINDOW_SIZE
*/
seq = seg->rs_base.us_btl_header->seq;
if (seq < endpoint->endpoint_next_contig_seq_to_recv ||
seq >= endpoint->endpoint_next_contig_seq_to_recv + WINDOW_SIZE) {
#if MSGDEBUG
opal_output(0, "<-- Received FRAG/CHUNK ep %p, seq %" UDSEQ " outside of window (%" UDSEQ " - %" UDSEQ "), %p, module %p -- DROPPED\n",
(void*)endpoint, seg->rs_base.us_btl_header->seq,
endpoint->endpoint_next_contig_seq_to_recv,
(endpoint->endpoint_next_contig_seq_to_recv +
WINDOW_SIZE - 1),
(void*) seg,
(void*) endpoint->endpoint_module);
#endif
/* Stats */
if (seq < endpoint->endpoint_next_contig_seq_to_recv) {
++endpoint->endpoint_module->num_oow_low_recvs;
} else {
++endpoint->endpoint_module->num_oow_high_recvs;
}
goto dup_needs_ack;
}
/* Ok, this segment is within the receiver window. Have we
already received it? It's possible that the sender has
re-sent a segment that we've already received (but not yet
ACKed).
We have saved all un-ACKed segment in an array on the
endpoint that is the same legnth as the receiver's window
(i.e., WINDOW_SIZE). We can use the incoming segment sequence
number to find its position in the array. It's a little
tricky because the left edge of the receiver window keeps
moving, so we use a starting reference point in the array
that is updated when we sent ACKs (and therefore move the
left edge of the receiver's window).
So this segment's index into the endpoint array is:
rel_posn_in_recv_win = seq - next_contig_seq_to_recv
array_posn = (rel_posn_in_recv_win + rfstart) % WINDOW_SIZE
rfstart is then updated when we send ACKs:
rfstart = (rfstart + num_acks_sent) % WINDOW_SIZE
*/
i = seq - endpoint->endpoint_next_contig_seq_to_recv;
i = WINDOW_SIZE_MOD(i + endpoint->endpoint_rfstart);
if (endpoint->endpoint_rcvd_segs[i]) {
#if MSGDEBUG
opal_output(0, "<-- Received FRAG/CHUNK ep %p, seq %" UDSEQ " from %s to %s, seg %p: duplicate -- DROPPED\n",
(void*) endpoint, bseg->us_btl_header->seq, src_mac, dest_mac,
(void*) seg);
#endif
/* highest_seq_rcvd is for debug stats only; it's not used
in any window calculations */
assert(seq <= endpoint->endpoint_highest_seq_rcvd);
/* next_contig_seq_to_recv-1 is the ack number we'll
send */
assert (seq > endpoint->endpoint_next_contig_seq_to_recv - 1);
/* Stats */
++endpoint->endpoint_module->num_dup_recvs;
goto dup_needs_ack;
}
/* Stats: is this the highest sequence number we've received? */
if (seq > endpoint->endpoint_highest_seq_rcvd) {
endpoint->endpoint_highest_seq_rcvd = seq;
}
*window_index = i;
return true;
dup_needs_ack:
if (!endpoint->endpoint_ack_needed) {
ompi_btl_usnic_add_to_endpoints_needing_ack(endpoint);
}
return false;
}
/*
* Packet has been fully processed, update the receive window
* to indicate that it and possible following contiguous sequence
* numbers have been received.
*/
static inline void
ompi_btl_usnic_update_window(
ompi_btl_usnic_endpoint_t *endpoint,
uint32_t window_index)
{
uint32_t i;
/* Enable ACK reply if not enabled */
#if MSGDEBUG1
opal_output(0, "ep: %p, ack_needed = %s\n", endpoint, endpoint->endpoint_ack_needed?"true":"false");
#endif
if (!endpoint->endpoint_ack_needed) {
ompi_btl_usnic_add_to_endpoints_needing_ack(endpoint);
}
/* give this process a chance to send something before ACKing */
if (0 == endpoint->endpoint_acktime) {
endpoint->endpoint_acktime = get_nsec() + 50000; /* 50 usec */
}
/* Save this incoming segment in the received segmentss array on the
endpoint. */
/* JMS Another optimization: make rcvd_segs be a bitmask (i.e.,
more cache friendly) */
endpoint->endpoint_rcvd_segs[window_index] = true;
/* See if the leftmost segment in the receiver window is
occupied. If so, advance the window. Repeat until we hit
an unoccupied position in the window. */
i = endpoint->endpoint_rfstart;
while (endpoint->endpoint_rcvd_segs[i]) {
endpoint->endpoint_rcvd_segs[i] = false;
endpoint->endpoint_next_contig_seq_to_recv++;
i = WINDOW_SIZE_MOD(i + 1);
#if MSGDEBUG
opal_output(0, "Advance window to %d; next seq to send %" UDSEQ, i,
endpoint->endpoint_next_contig_seq_to_recv);
#endif
}
endpoint->endpoint_rfstart = i;
}
/*
* We have received a segment, take action based on the
* packet type in the BTL header
*/
void ompi_btl_usnic_recv(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_recv_segment_t *seg,
struct ibv_recv_wr **repost_recv_head)
{
ompi_btl_usnic_segment_t *bseg;
mca_btl_active_message_callback_t* reg;
ompi_btl_usnic_endpoint_t *endpoint;
ompi_btl_usnic_btl_chunk_header_t *chunk_hdr;
uint32_t window_index;
#if MSGDEBUG1
char src_mac[32];
char dest_mac[32];
#endif
bseg = &seg->rs_base;
++module->num_total_recvs;
/* Valgrind help */
opal_memchecker_base_mem_defined((void*)(seg->rs_recv_desc.sg_list[0].addr),
seg->rs_recv_desc.sg_list[0].length);
#if MSGDEBUG1
memset(src_mac, 0, sizeof(src_mac));
memset(dest_mac, 0, sizeof(dest_mac));
ompi_btl_usnic_sprintf_gid_mac(src_mac,
&seg->rs_protocol_header->grh.sgid);
ompi_btl_usnic_sprintf_gid_mac(dest_mac,
&seg->rs_protocol_header->grh.dgid);
#if MSGDEBUG
opal_output(0, "Got message from MAC %s", src_mac);
opal_output(0, "Looking for sender: 0x%016lx",
bseg->us_btl_header->sender);
#endif
#endif
/* Find out who sent this segment */
endpoint = lookup_sender(module, bseg);
seg->rs_endpoint = endpoint;
if (FAKE_RECV_FRAG_DROP || OPAL_UNLIKELY(NULL == endpoint)) {
/* No idea who this was from, so drop it */
#if MSGDEBUG1
opal_output(0, "=== Unknown sender; dropped: from MAC %s to MAC %s, seq %" UDSEQ,
src_mac,
dest_mac,
bseg->us_btl_header->seq);
#endif
++module->num_unk_recvs;
goto repost_no_endpoint;
}
/***********************************************************************/
/* Segment is an incoming frag */
if (OMPI_BTL_USNIC_PAYLOAD_TYPE_FRAG == bseg->us_btl_header->payload_type) {
/* Is incoming sequence # ok? */
if (!ompi_btl_usnic_check_rx_seq(endpoint, seg, &window_index)) {
goto repost;
}
#if MSGDEBUG1
opal_output(0, "<-- Received FRAG ep %p, seq %" UDSEQ ", len=%d\n",
(void*) endpoint,
seg->rs_base.us_btl_header->seq,
seg->rs_base.us_btl_header->payload_len);
#if 0
opal_output(0, "<-- Received FRAG ep %p, seq %" UDSEQ " from %s to %s: GOOD! (rel seq %d, lowest seq %" UDSEQ ", highest seq: %" UDSEQ ", rwstart %d) seg %p, module %p\n",
(void*) endpoint,
seg->rs_base.us_btl_header->seq,
src_mac, dest_mac,
window_index,
endpoint->endpoint_next_contig_seq_to_recv,
endpoint->endpoint_highest_seq_rcvd,
endpoint->endpoint_rfstart,
(void*) seg, (void*) module);
if (seg->rs_base.us_btl_header->put_addr != NULL) {
opal_output(0, " put_addr = %p\n",
seg->rs_base.us_btl_header->put_addr);
}
#endif
#endif
/*
* update window before callback because callback might
* generate a send, and we'd like to piggy-back ACK if possible
*/
ompi_btl_usnic_update_window(endpoint, window_index);
/* Stats */
++module->num_frag_recvs;
/* If this it not a PUT, Pass this segment up to the PML.
* Be sure to get the payload length from the BTL header because
* the L2 layer may artificially inflate (or otherwise change)
* the frame length to meet minimum sizes, add protocol information,
* etc.
*/
if (seg->rs_base.us_btl_header->put_addr == NULL) {
reg = mca_btl_base_active_message_trigger +
bseg->us_payload.pml_header->tag;
seg->rs_segment.seg_len = bseg->us_btl_header->payload_len;
reg->cbfunc(&module->super, bseg->us_payload.pml_header->tag,
&seg->rs_desc, reg->cbdata);
/*
* If this is a PUT, need to copy it to user buffer
*/
} else {
#if MSGDEBUG1
opal_output(0, "Copy %d PUT bytes to %p\n",
seg->rs_base.us_btl_header->payload_len,
chunk_hdr->ch_hdr.put_addr);
#endif
memcpy(seg->rs_base.us_btl_header->put_addr,
seg->rs_base.us_payload.raw,
seg->rs_base.us_btl_header->payload_len);
}
goto repost;
}
/***********************************************************************/
/* Segment is an incoming chunk */
if (OMPI_BTL_USNIC_PAYLOAD_TYPE_CHUNK == bseg->us_btl_header->payload_type) {
int frag_index;
ompi_btl_usnic_rx_frag_info_t *fip;
/* Is incoming sequence # ok? */
if (!ompi_btl_usnic_check_rx_seq(endpoint, seg, &window_index)) {
goto repost;
}
#if MSGDEBUG1
opal_output(0, "<-- Received CHUNK fid %d ep %p, seq %" UDSEQ " from %s to %s: GOOD! (rel seq %d, lowest seq %" UDSEQ ", highest seq: %" UDSEQ ", rwstart %d) seg %p, module %p\n",
seg->rs_base.us_btl_chunk_header->ch_frag_id,
(void*) endpoint,
seg->rs_base.us_btl_chunk_header->ch_hdr.seq,
src_mac, dest_mac,
window_index,
endpoint->endpoint_next_contig_seq_to_recv,
endpoint->endpoint_highest_seq_rcvd,
endpoint->endpoint_rfstart,
(void*) seg, (void*) module);
#endif
/* start a new fragment if not one in progress
* alloc memory, etc. when last byte arrives, dealloc the
* frag_id and pass data to PML
*/
chunk_hdr = seg->rs_base.us_btl_chunk_header;
frag_index = chunk_hdr->ch_frag_id % MAX_ACTIVE_FRAGS;
fip = &(endpoint->endpoint_rx_frag_info[frag_index]);
/* frag_id == 0 means this slot it empty, grab it! */
if (0 == fip->rfi_frag_id) {
fip->rfi_frag_id = chunk_hdr->ch_frag_id;
fip->rfi_frag_size = chunk_hdr->ch_frag_size;
if (chunk_hdr->ch_hdr.put_addr == NULL) {
int pool;
fip->rfi_data = NULL;
/* See which data pool this should come from,
* or if it should be malloc()ed
*/
pool = fls(chunk_hdr->ch_frag_size-1);
if (pool >= module->first_pool &&
pool <= module->last_pool) {
ompi_free_list_item_t* item;
OMPI_FREE_LIST_GET_MT(&module->module_recv_buffers[pool],
item);
if (OPAL_LIKELY(NULL != item)) {
fip->rfi_data = (char *)item;
fip->rfi_data_pool = pool;
}
}
if (fip->rfi_data == NULL) {
fip->rfi_data = malloc(chunk_hdr->ch_frag_size);
fip->rfi_data_pool = 0;
}
if (fip->rfi_data == NULL) {
abort();
}
#if MSGDEBUG2
opal_output(0, "Start large recv to %p, size=%d\n",
fip->rfi_data, chunk_hdr->ch_frag_size);
#endif
} else {
#if MSGDEBUG2
opal_output(0, "Start PUT to %p\n", chunk_hdr->ch_hdr.put_addr);
#endif
fip->rfi_data = chunk_hdr->ch_hdr.put_addr;
}
fip->rfi_bytes_left = chunk_hdr->ch_frag_size;
fip->rfi_frag_id = chunk_hdr->ch_frag_id;
/* frag_id is not 0 - it must match, drop if not */
} else if (fip->rfi_frag_id != chunk_hdr->ch_frag_id) {
++module->num_badfrag_recvs;
goto repost;
}
#if MSGDEBUG1
opal_output(0, "put_addr=%p, copy_addr=%p, off=%d\n",
chunk_hdr->ch_hdr.put_addr,
fip->rfi_data+chunk_hdr->ch_frag_offset,
chunk_hdr->ch_frag_offset);
#endif
/* Stats */
++module->num_chunk_recvs;
/* validate offset and len to be within fragment */
assert(chunk_hdr->ch_frag_offset + chunk_hdr->ch_hdr.payload_len <=
fip->rfi_frag_size);
assert(fip->rfi_frag_size == chunk_hdr->ch_frag_size);
/* copy the data into place */
memcpy(fip->rfi_data + chunk_hdr->ch_frag_offset, (char *)(chunk_hdr+1),
chunk_hdr->ch_hdr.payload_len);
/* update sliding window */
ompi_btl_usnic_update_window(endpoint, window_index);
fip->rfi_bytes_left -= chunk_hdr->ch_hdr.payload_len;
if (0 == fip->rfi_bytes_left) {
mca_btl_base_header_t *pml_header;
mca_btl_base_descriptor_t desc;
mca_btl_base_segment_t segment;
/* Get access to PML header in assembled fragment so we
* can pull out the tag
*/
pml_header = (mca_btl_base_header_t *)(fip->rfi_data);
segment.seg_addr.pval = pml_header;
segment.seg_len = fip->rfi_frag_size;
desc.des_dst = &segment;
desc.des_dst_cnt = 1;
/* only up to PML if this was not a put */
if (chunk_hdr->ch_hdr.put_addr == NULL) {
/* Pass this segment up to the PML */
#if MSGDEBUG2
opal_output(0, " large FRAG complete, pass up %p, %d bytes, tag=%d\n",
desc.des_dst->seg_addr.pval, desc.des_dst->seg_len,
pml_header->tag);
#endif
reg = mca_btl_base_active_message_trigger + pml_header->tag;
/* mca_pml_ob1_recv_frag_callback_frag() */
reg->cbfunc(&module->super, pml_header->tag,
&desc, reg->cbdata);
/* free temp buffer for non-put */
if (0 == fip->rfi_data_pool) {
free(fip->rfi_data);
} else {
OMPI_FREE_LIST_RETURN_MT(
&module->module_recv_buffers[fip->rfi_data_pool],
(ompi_free_list_item_t *)fip->rfi_data);
}
#if MSGDEBUG2
} else {
opal_output(0, "PUT complete, suppressing callback\n");
#endif
}
/* release the fragment ID */
fip->rfi_frag_id = 0;
/* force immediate ACK */
endpoint->endpoint_acktime = 0;
}
goto repost;
}
/***********************************************************************/
/* Frag is an incoming ACK */
else if (OPAL_LIKELY(OMPI_BTL_USNIC_PAYLOAD_TYPE_ACK ==
bseg->us_btl_header->payload_type)) {
ompi_btl_usnic_seq_t ack_seq;
/* sequence being ACKed */
ack_seq = bseg->us_btl_header->ack_seq;
/* Stats */
++module->num_ack_recvs;
#if MSGDEBUG1
opal_output(0, " Received ACK for sequence number %" UDSEQ " from %s to %s\n",
bseg->us_btl_header->ack_seq, src_mac, dest_mac);
#endif
ompi_btl_usnic_handle_ack(endpoint, ack_seq);
goto repost;
}
/***********************************************************************/
/* Have no idea what the frag is; drop it */
else {
++module->num_unk_recvs;
opal_output(0, "==========================unknown 2");
goto repost;
}
/***********************************************************************/
repost:
/* if endpoint exiting, and all ACKs received, release the endpoint */
if (endpoint->endpoint_exiting && ENDPOINT_DRAINED(endpoint)) {
OBJ_RELEASE(endpoint);
}
repost_no_endpoint:
++module->num_recv_reposts;
/* Add recv to linked list for reposting */
seg->rs_recv_desc.next = *repost_recv_head;
*repost_recv_head = &seg->rs_recv_desc;
}

23
ompi/mca/btl/usnic/btl_usnic_recv.h Обычный файл
Просмотреть файл

@ -0,0 +1,23 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_USNIC_RECV_H
#define BTL_USNIC_RECV_H
#include <infiniband/verbs.h>
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
void ompi_btl_usnic_recv(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_recv_segment_t *rseg,
struct ibv_recv_wr **repost_recv_head);
#endif /* BTL_USNIC_RECV_H */

199
ompi/mca/btl/usnic/btl_usnic_send.c Обычный файл
Просмотреть файл

@ -0,0 +1,199 @@
/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2011 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Sandia National Laboratories. All rights
* reserved.
* Copyright (c) 2008-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <infiniband/verbs.h>
#include "opal_stdint.h"
#include "ompi/constants.h"
#include "ompi/mca/btl/btl.h"
#include "ompi/mca/btl/base/base.h"
#include "ompi/mca/common/verbs/common_verbs.h"
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_util.h"
#include "btl_usnic_send.h"
#include "btl_usnic_ack.h"
/*
* This function is called when a send of a full-fragment segment completes
* Return the WQE and also return the segment if no ACK pending
*/
void
ompi_btl_usnic_frag_send_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_segment_t *sseg)
{
ompi_btl_usnic_send_frag_t *frag;
frag = sseg->ss_parent_frag;
/* Reap a frag that was sent */
--sseg->ss_send_posted;
--frag->sf_seg_post_cnt;
/* checks for returnability made inside */
ompi_btl_usnic_send_frag_return_cond(module, frag);
/* do bookkeeping */
++frag->sf_endpoint->endpoint_send_credits;
++module->mod_channels[sseg->ss_channel].sd_wqe;
/* see if this endpoint needs to be made ready-to-send */
ompi_btl_usnic_check_rts(frag->sf_endpoint);
}
/*
* This function is called when a send segment completes
* Return the WQE and also return the segment if no ACK pending
*/
void
ompi_btl_usnic_chunk_send_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_segment_t *sseg)
{
ompi_btl_usnic_send_frag_t *frag;
frag = sseg->ss_parent_frag;
/* Reap a frag that was sent */
--sseg->ss_send_posted;
--frag->sf_seg_post_cnt;
if (sseg->ss_send_posted == 0 && !sseg->ss_ack_pending) {
ompi_btl_usnic_chunk_segment_return(module, sseg);
}
/* done with whole fragment? */
/* checks for returnability made inside */
ompi_btl_usnic_send_frag_return_cond(module, frag);
/* do bookkeeping */
++frag->sf_endpoint->endpoint_send_credits;
++module->mod_channels[sseg->ss_channel].sd_wqe;
/* see if this endpoint needs to be made ready-to-send */
ompi_btl_usnic_check_rts(frag->sf_endpoint);
}
/*
* This routine handles the non-fastpath part of usnic_send().
* The reason it is here is to prevent it getting inlined with
* the rest of the function.
*/
int
ompi_btl_usnic_send_slower(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_send_frag_t *frag,
mca_btl_base_tag_t tag)
{
int rc;
ompi_btl_usnic_small_send_frag_t *sfrag;
ompi_btl_usnic_send_segment_t *sseg;
/*
* If this is small, need to do the copyin now.
* We don't do this earlier in case we got lucky and were
* able to do an inline send. We did not, so here we are...
*/
if (frag->sf_base.uf_type == OMPI_BTL_USNIC_FRAG_SMALL_SEND) {
sfrag = (ompi_btl_usnic_small_send_frag_t *)frag;
sseg = &sfrag->ssf_segment;
/*
* copy in user data if there is any, collapsing 2 segments into 1
*/
if (frag->sf_base.uf_base.des_src_cnt > 1) {
/* If not convertor, just copy */
if (frag->sf_convertor == NULL) {
memcpy(((char *)frag->sf_base.uf_src_seg[0].seg_addr.lval +
frag->sf_base.uf_src_seg[0].seg_len),
frag->sf_base.uf_src_seg[1].seg_addr.pval,
frag->sf_base.uf_src_seg[1].seg_len);
/* convertor needed, do the unpack */
} else {
struct iovec iov;
uint32_t iov_count;
size_t max_data;
int rc;
/* put user data just after end of 1st seg (PML header) */
iov.iov_len = frag->sf_base.uf_src_seg[1].seg_len;
iov.iov_base = (IOVBASE_TYPE*)
(frag->sf_base.uf_src_seg[0].seg_addr.lval +
frag->sf_base.uf_src_seg[0].seg_len);
iov_count = 1;
max_data = iov.iov_len;
rc = opal_convertor_pack(frag->sf_convertor,
&iov, &iov_count, &max_data);
if (OPAL_UNLIKELY(rc < 0)) {
ompi_btl_usnic_send_frag_return_cond(module, frag);
abort(); /* XXX */
}
}
/* update 1st segment length */
frag->sf_base.uf_base.des_src_cnt = 1;
frag->sf_base.uf_src_seg[0].seg_len +=
frag->sf_base.uf_src_seg[1].seg_len;
}
/* set up VERBS SG list */
sseg->ss_send_desc.num_sge = 1;
sseg->ss_base.us_sg_entry[0].length =
sizeof(ompi_btl_usnic_btl_header_t) + frag->sf_size;
/* use standard channel */
sseg->ss_channel = USNIC_DATA_CHANNEL;
#if MSGDEBUG2
opal_output(0, " small frag %d segs %p(%d) + %p(%d)\n",
(int)frag->sf_base.uf_base.des_src_cnt,
frag->sf_base.uf_src_seg[0].seg_addr.pval,
(int)frag->sf_base.uf_src_seg[0].seg_len,
frag->sf_base.uf_src_seg[1].seg_addr.pval,
(int)frag->sf_base.uf_src_seg[1].seg_len);
opal_output(0, " inline seg %d segs %p(%d) + %p(%d)\n",
sseg->ss_send_desc.num_sge,
(void *)sseg->ss_send_desc.sg_list[0].addr,
sseg->ss_send_desc.sg_list[0].length,
(void *)sseg->ss_send_desc.sg_list[1].addr,
sseg->ss_send_desc.sg_list[1].length);
#endif
}
/* queue this fragment into the send engine */
rc = ompi_btl_usnic_endpoint_enqueue_frag(endpoint, frag);
frag->sf_base.uf_base.des_flags |= MCA_BTL_DES_SEND_ALWAYS_CALLBACK;
/* Stats */
++(((ompi_btl_usnic_module_t*)module)->pml_module_sends);
return rc;
}

257
ompi/mca/btl/usnic/btl_usnic_send.h Обычный файл
Просмотреть файл

@ -0,0 +1,257 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_USNIC_SEND_H
#define BTL_USNIC_SEND_H
#include <infiniband/verbs.h>
#include "btl_usnic.h"
#include "btl_usnic_frag.h"
#include "btl_usnic_ack.h"
#if MSGDEBUG1
#include "btl_usnic_util.h"
#endif
/*
* Check if conditions are right, and if so, put endpoint on
* list of endpoints that have sends to be done
*/
static inline void
ompi_btl_usnic_check_rts(
ompi_btl_usnic_endpoint_t *endpoint)
{
/*
* If endpoint not already ready,
* and has packets to send,
* and it has send credits,
* and its retransmission window is open,
* make it ready
*/
if (!endpoint->endpoint_ready_to_send &&
!opal_list_is_empty(&endpoint->endpoint_frag_send_queue) &&
endpoint->endpoint_send_credits > 0 &&
WINDOW_OPEN(endpoint)) {
opal_list_append(&endpoint->endpoint_module->endpoints_with_sends,
&endpoint->super);
endpoint->endpoint_ready_to_send = true;
#if MSGDEBUG1
opal_output(0, "make endpoint %p RTS\n", endpoint);
} else {
opal_output(0, "rts:%d empty:%d cred:%d open%d\n",
endpoint->endpoint_ready_to_send,
opal_list_is_empty(&endpoint->endpoint_frag_send_queue),
endpoint->endpoint_send_credits,
WINDOW_OPEN(endpoint));
#endif
}
}
/*
* Common point for posting a segment to VERBS
*/
static inline void
ompi_btl_usnic_post_segment(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_send_segment_t *sseg)
{
struct ibv_send_wr *bad_wr;
ompi_btl_usnic_channel_t *channel;
struct ibv_send_wr *wr;
int ret;
#if MSGDEBUG1
opal_output(0, "post_send: type=%d, addr=%p, len=%d\n",
sseg->ss_base.us_type,
(void*) sseg->ss_send_desc.sg_list->addr,
sseg->ss_send_desc.sg_list->length);
ompi_btl_usnic_dump_hex((void *)(sseg->ss_send_desc.sg_list->addr + sizeof(ompi_btl_usnic_btl_header_t)), 16);
#endif
/* set target address */
wr = &sseg->ss_send_desc;
wr->wr.ud.ah = endpoint->endpoint_remote_ah;
/* get channel and remote QPN */
channel = &module->mod_channels[sseg->ss_channel];
wr->wr.ud.remote_qpn =
endpoint->endpoint_remote_addr.qp_num[sseg->ss_channel];
/* Post segment to the NIC */
ret = ibv_post_send(channel->qp, &sseg->ss_send_desc, &bad_wr);
if (OPAL_UNLIKELY(0 != ret)) {
ompi_btl_usnic_util_abort("ibv_post_send() failed",
__FILE__, __LINE__, ret);
/* Never returns */
}
/* track # of time non-ACKs are posted */
if (sseg->ss_base.us_type != OMPI_BTL_USNIC_SEG_ACK) {
++sseg->ss_send_posted;
++sseg->ss_parent_frag->sf_seg_post_cnt;
}
/* consume a WQE */
--channel->sd_wqe;
/* Stats */
++module->num_total_sends;
++channel->num_channel_sends;
}
/*
* Post a send to the verbs work queue
*/
static inline void
ompi_btl_usnic_endpoint_send_segment(
ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_segment_t *sseg)
{
ompi_btl_usnic_send_frag_t *frag;
ompi_btl_usnic_endpoint_t *endpoint;
uint16_t sfi;
frag = sseg->ss_parent_frag;
endpoint = frag->sf_endpoint;
/* Do we have room in the endpoint's sender window?
Sender window:
|-------- WINDOW_SIZE ----------|
+---------------------------------+
| next_seq_to_send |
| somewhere in this range |
^+---------------------------------+
|
+-- ack_seq_rcvd: one less than the window left edge
Assuming that next_seq_to_send is > ack_seq_rcvd (verified
by assert), then the good condition to send is:
next_seq_to_send <= ack_seq_rcvd + WINDOW_SIZE
And therefore the bad condition is
next_seq_to_send > ack_seq_rcvd + WINDOW_SIZE
*/
assert(endpoint->endpoint_next_seq_to_send >
endpoint->endpoint_ack_seq_rcvd);
assert(WINDOW_OPEN(endpoint));
/* Assign sequence number and increment */
sseg->ss_base.us_btl_header->seq = endpoint->endpoint_next_seq_to_send++;
/* Fill in remote address to indicate PUT or not */
sseg->ss_base.us_btl_header->put_addr =
frag->sf_base.uf_dst_seg[0].seg_addr.pval;
/* piggy-back an ACK if needed */
ompi_btl_usnic_piggyback_ack(endpoint, sseg);
#if MSGDEBUG1
{
uint8_t mac[6];
char mac_str1[128];
char mac_str2[128];
ompi_btl_usnic_sprintf_mac(mac_str1, module->local_addr.mac);
ompi_btl_usnic_gid_to_mac(&endpoint->endpoint_remote_addr.gid, mac);
ompi_btl_usnic_sprintf_mac(mac_str2, mac);
opal_output(0, "--> Sending %s: seq: %" UDSEQ ", sender: 0x%016lx from device %s MAC %s, qp %u, seg %p, room %d, wc len %u, remote MAC %s, qp %u",
(sseg->ss_parent_frag->sf_base.uf_type == OMPI_BTL_USNIC_FRAG_LARGE_SEND)?
"CHUNK" : "FRAG",
sseg->ss_base.us_btl_header->seq,
sseg->ss_base.us_btl_header->sender,
endpoint->endpoint_module->device->name,
mac_str1, module->local_addr.qp_num[sseg->ss_channel],
sseg, sseg->ss_hotel_room,
sseg->ss_base.us_sg_entry[0].length,
mac_str2, endpoint->endpoint_remote_addr.qp_num[sseg->ss_channel]);
}
#endif
/* do the actual send */
ompi_btl_usnic_post_segment(module, endpoint, sseg);
/* Track this header by stashing in an array on the endpoint that
is the same length as the sender's window (i.e., WINDOW_SIZE).
To find a unique slot in this array, use (seq % WINDOW_SIZE).
*/
sfi = WINDOW_SIZE_MOD(sseg->ss_base.us_btl_header->seq);
endpoint->endpoint_sent_segs[sfi] = sseg;
sseg->ss_ack_pending = true;
/* bookkeeping */
--endpoint->endpoint_send_credits;
/* Stats */
if (sseg->ss_parent_frag->sf_base.uf_type
== OMPI_BTL_USNIC_FRAG_LARGE_SEND) {
++module->num_chunk_sends;
} else {
++module->num_frag_sends;
}
/* If we have room in the sender's window, we also have room in
endpoint hotel */
opal_hotel_checkin_with_res(&endpoint->endpoint_hotel, sseg,
&sseg->ss_hotel_room);
}
/*
* This enqueues a fragment send into the system. A send of a fragment
* may result in the sending of multiple segments
*/
static inline int
ompi_btl_usnic_endpoint_enqueue_frag(
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_send_frag_t *frag)
{
ompi_btl_usnic_module_t *module;
module = endpoint->endpoint_module;
#if MSGDEBUG1
opal_output(0, "enq_frag: frag=%p, endpoint=%p, type=%d, len=%d\n",
frag, endpoint, frag->sf_base.uf_type,
frag->sf_base.uf_base.des_src->seg_len);
if (frag->sf_base.uf_type == OMPI_BTL_USNIC_FRAG_LARGE_SEND) {
ompi_btl_usnic_large_send_frag_t *lfrag;
lfrag = (ompi_btl_usnic_large_send_frag_t *)frag;
opal_output(0, " large size=%d\n", lfrag->lsf_base.sf_size);
}
#endif
/* add to tail of in-progress list */
opal_list_append(&endpoint->endpoint_frag_send_queue,
&frag->sf_base.uf_base.super.super);
/* possibly make this endpoint ready to send again */
ompi_btl_usnic_check_rts(endpoint);
/* post sends now if space available */
ompi_btl_usnic_module_progress_sends(module);
return OMPI_SUCCESS;
}
void ompi_btl_usnic_frag_complete(ompi_btl_usnic_send_frag_t *frag);
void ompi_btl_usnic_frag_send_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_segment_t *sseg);
void ompi_btl_usnic_chunk_send_complete(ompi_btl_usnic_module_t *module,
ompi_btl_usnic_send_segment_t *sseg);
int ompi_btl_usnic_send_slower( ompi_btl_usnic_module_t *module,
ompi_btl_usnic_endpoint_t *endpoint,
ompi_btl_usnic_send_frag_t *frag,
mca_btl_base_tag_t tag);
#endif /* BTL_USNIC_SEND_H */

196
ompi/mca/btl/usnic/btl_usnic_util.c Обычный файл
Просмотреть файл

@ -0,0 +1,196 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "ompi_config.h"
#include <stdio.h>
#include <unistd.h>
#include <infiniband/verbs.h>
#include "orte/mca/errmgr/errmgr.h"
#include "orte/util/show_help.h"
#include "ompi/constants.h"
#include "btl_usnic_util.h"
#include "opal/util/if.h"
void ompi_btl_usnic_exit(void)
{
orte_errmgr.abort(1, NULL);
/* If the error manager returns, wait to be killed */
while (1) {
sleep(99999);
}
}
void
ompi_btl_usnic_dump_hex(uint8_t *addr, int len)
{
char buf[128];
size_t bufspace;
int i, ret;
char *p;
uint32_t sum=0;
p = buf;
memset(buf, 0, sizeof(buf));
bufspace = sizeof(buf) - 1;
for (i=0; i<len; ++i) {
ret = snprintf(p, bufspace, "%02x ", addr[i]);
p += ret;
bufspace -= ret;
sum += addr[i];
if ((i&15) == 15) {
opal_output(0, "%4x: %s\n", i&~15, buf);
p = buf;
memset(buf, 0, sizeof(buf));
bufspace = sizeof(buf) - 1;
}
}
if ((i&15) != 0) {
opal_output(0, "%4x: %s\n", i&~15, buf);
}
/*opal_output(0, "buffer sum = %x\n", sum); */
}
void ompi_btl_usnic_sprintf_mac(char *out, const uint8_t mac[6])
{
snprintf(out, 32, "%02x:%02x:%02x:%02x:%02x:%02x",
mac[0],
mac[1],
mac[2],
mac[3],
mac[4],
mac[5]);
}
void ompi_btl_usnic_sprintf_gid_mac(char *out, union ibv_gid *gid)
{
uint8_t mac[6];
ompi_btl_usnic_gid_to_mac(gid, mac);
ompi_btl_usnic_sprintf_mac(out, mac);
}
int ompi_btl_usnic_find_ip(ompi_btl_usnic_module_t *module, uint8_t mac[6])
{
int i;
uint8_t localmac[6];
char addr_string[32], mac_string[32];
struct sockaddr sa;
struct sockaddr_in *sai;
/* Loop through all IP interfaces looking for the one with the
right MAC */
for (i = opal_ifbegin(); i != -1; i = opal_ifnext(i)) {
if (OPAL_SUCCESS == opal_ifindextomac(i, localmac)) {
/* Is this the MAC I'm looking for? */
if (0 != memcmp(mac, localmac, 6)) {
continue;
}
/* Yes, it is! */
if (OPAL_SUCCESS != opal_ifindextoname(i, module->if_name,
sizeof(module->if_name)) ||
OPAL_SUCCESS != opal_ifindextoaddr(i, &sa, sizeof(sa)) ||
OPAL_SUCCESS != opal_ifindextomask(i, &module->if_cidrmask,
sizeof(module->if_cidrmask)) ||
OPAL_SUCCESS != opal_ifindextomac(i, module->if_mac) ||
OPAL_SUCCESS != opal_ifindextomtu(i, &module->if_mtu)) {
continue;
}
sai = (struct sockaddr_in *) &sa;
memcpy(&module->if_ipv4_addr, &sai->sin_addr, 4);
/* Save this information to my local address field on the
module so that it gets sent in the modex */
module->local_addr.ipv4_addr = module->if_ipv4_addr;
module->local_addr.cidrmask = module->if_cidrmask;
/* Since verbs doesn't offer a way to get standard
Ethernet MTUs (as of libibverbs 1.1.5, the MTUs are
enums, and don't inlcude values for 1500 or 9000), look
up the MTU in the corresponding enic interface.
Subtract 40 off the MTU value so that we provide enough
space for the GRH on the remote side. */
module->if_mtu -= 40;
module->local_addr.mtu = module->if_mtu;
inet_ntop(AF_INET, &(module->if_ipv4_addr),
addr_string, sizeof(addr_string));
ompi_btl_usnic_sprintf_mac(mac_string, mac);
opal_output_verbose(5, mca_btl_base_verbose,
"btl:usnic: found usNIC device corresponds to IP device %s, %s/%d, MAC %s",
module->if_name, addr_string, module->if_cidrmask,
mac_string);
return OMPI_SUCCESS;
}
}
return OMPI_ERR_NOT_FOUND;
}
/*
* Reverses the encoding done in usnic_main.c:usnic_mac_to_gid() in
* the usnic.ko kernel code.
*
* Got this scheme from Mellanox RoCE; Emulex did the same thing. So
* we followed convention.
* http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf
*/
void ompi_btl_usnic_gid_to_mac(union ibv_gid *gid, uint8_t mac[6])
{
mac[0] = gid->raw[8] ^ 2;
mac[1] = gid->raw[9];
mac[2] = gid->raw[10];
mac[3] = gid->raw[13];
mac[4] = gid->raw[14];
mac[5] = gid->raw[15];
}
/* takes an IPv4 address in network byte order and a CIDR prefix length (the
* "X" in "a.b.c.d/X") and returns the subnet in network byte order. */
uint32_t ompi_btl_usnic_get_ipv4_subnet(uint32_t addrn, uint32_t cidr_len)
{
uint32_t mask;
assert(cidr_len <= 32);
/* perform arithmetic in host byte order for shift correctness */
mask = (~0) << (32 - cidr_len);
return htonl(ntohl(addrn) & mask);
}
/*
* Simple utility in a .c file, mainly so that inline functions in .h
* files don't need to include ORTE header files.
*/
void ompi_btl_usnic_util_abort(const char *msg, const char *file, int line,
int ret)
{
orte_show_help("help-mpi-btl-usnic.txt", "internal error after init",
true,
orte_process_info.nodename,
msg, file, line, strerror(ret));
orte_errmgr.abort(ret, NULL);
/* Never returns */
}

66
ompi/mca/btl/usnic/btl_usnic_util.h Обычный файл
Просмотреть файл

@ -0,0 +1,66 @@
/*
* Copyright (c) 2013 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef BTL_USNIC_UTIL_H
#define BTL_USNIC_UTIL_H
#include "btl_usnic.h"
#include "btl_usnic_module.h"
/* Linux kernel fls() */
static __always_inline int fls(int x)
{
int r = 32;
if (!x)
return 0;
if (!(x & 0xffff0000u)) {
x <<= 16;
r -= 16;
}
if (!(x & 0xff000000u)) {
x <<= 8;
r -= 8;
}
if (!(x & 0xf0000000u)) {
x <<= 4;
r -= 4;
}
if (!(x & 0xc0000000u)) {
x <<= 2;
r -= 2;
}
if (!(x & 0x80000000u)) {
r -= 1;
}
return r;
}
/*
* Safely (but abnornmally) exit this process without abort()'ing (and
* leaving a corefile).
*/
void ompi_btl_usnic_exit(void);
void ompi_btl_usnic_sprintf_mac(char *out, const uint8_t mac[6]);
void ompi_btl_usnic_sprintf_gid_mac(char *out, union ibv_gid *gid);
int ompi_btl_usnic_find_ip(ompi_btl_usnic_module_t *module, uint8_t mac[6]);
void ompi_btl_usnic_gid_to_mac(union ibv_gid *gid, uint8_t mac[6]);
void ompi_btl_usnic_dump_hex(uint8_t *addr, int len);
uint32_t ompi_btl_usnic_get_ipv4_subnet(uint32_t addrn, uint32_t cidr_len);
void ompi_btl_usnic_util_abort(const char *msg, const char *file, int line,
int ret);
#endif /* BTL_USNIC_UTIL_H */

78
ompi/mca/btl/usnic/configure.m4 Обычный файл
Просмотреть файл

@ -0,0 +1,78 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2006 Sandia National Laboratories. All rights
# reserved.
# Copyright (c) 2010-2013 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_btl_usnic_CONFIG([action-if-can-compile],
# [action-if-cant-compile])
# ------------------------------------------------
AC_DEFUN([MCA_ompi_btl_usnic_CONFIG],[
AC_CONFIG_FILES([ompi/mca/btl/usnic/Makefile])
OMPI_CHECK_OPENFABRICS([btl_usnic],
[btl_usnic_happy="yes"],
[btl_usnic_happy="no"])
# We only want to build on Linux. We also use the clock_gettime()
# function, which conveniently only happens to exist on Linux. So
# just check for that.
AS_IF([test "$btl_usnic_happy" = "yes"],
[AC_CHECK_FUNC([clock_gettime],
[btl_usnic_happy=yes],
[btl_usnic_happy=no])
])
# Do we have the IBV_TRANSPORT_USNIC / IBV_NODE_USNIC defines?
# (note: if we have one, we have both)
btl_usnic_have_ibv_usnic=0
btl_usnic_have_ibv_event_gid_change=0
AS_IF([test "$btl_usnic_happy" = "yes"],
[AC_CHECK_DECL([IBV_NODE_USNIC],
[btl_usnic_have_ibv_usnic=1],
[],
[ #include <infiniband/verbs.h>
])
AC_CHECK_DECL([IBV_EVENT_GID_CHANGE],
[btl_usnic_have_ibv_event_gid_change=1],
[],
[ #include <infiniband/verbs.h>
])
]
)
AC_DEFINE_UNQUOTED([BTL_USNIC_HAVE_IBV_USNIC],
[$btl_usnic_have_ibv_usnic],
[Whether we have IBV_NODE_USNIC / IBV_TRANSPORT_USNIC or not])
AC_DEFINE_UNQUOTED([BTL_USNIC_HAVE_IBV_EVENT_GID_CHANGE],
[$btl_usnic_have_ibv_event_gid_change],
[Whether we have IBV_EVENT_GID_CHANGE or not])
AS_IF([test "$btl_usnic_happy" = "yes"],
[btl_usnic_WRAPPER_EXTRA_LDFLAGS="$btl_usnic_LDFLAGS"
btl_usnic_WRAPPER_EXTRA_LIBS="$btl_usnic_LIBS"
$1],
[$2])
# Substitute in the things needed to build USNIC
AC_SUBST([btl_usnic_CPPFLAGS])
AC_SUBST([btl_usnic_LDFLAGS])
AC_SUBST([btl_usnic_LIBS])
])dnl

180
ompi/mca/btl/usnic/help-mpi-btl-usnic.txt Обычный файл
Просмотреть файл

@ -0,0 +1,180 @@
# -*- text -*-
#
# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
#
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the US/English help file for the Open MPI usnic BTL.
#
[ibv API failed]
Open MPI failed a basic verbs operation on a Cisco usNIC device. This
is highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
In addition to any suggestions listed below, you might want to check
the Linux "memlock" limits on your system (they should probably be
"unlimited"). See this FAQ entry for details:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Failed function: %s (%s:%d)
Description: %s
#
[not enough usnic resources]
There are not enough usNIC resources on a VIC for all the MPI
processes on this server.
This means that you have either not provisioned enough usNICs on this
VIC, or there are not enough total receive, transmit, or completion
queues on the provisioned usNICs. On each VIC in a given server, you
need to provision at least as many usNICs as MPI processes on that
server. In each usNIC, you need to provision at least two each of the
following: send queues, receive queues, and completion queues.
Open MPI will skip this device in the usnic BTL, which may result in
either lower performance or your job aborting.
Server: %s
Device: %s
Description: %s
#
[create ibv resource failed]
Open MPI failed to allocate a usNIC-related resource on a VIC. This
usually means one of two things:
1. You are running something other than this MPI job on this server
that is consuming usNIC resources.
2. You have run our of locked Linux memory. You should probably set
the Linux "memlock" limits to "unlimited". See this FAQ entry for
details:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
This Open MPI job will skip this device/port in the usnic BTL, which
may result in either lower performance or the job aborting.
Server: %s
Device: %s
Failed function: %s (%s:%d)
Description: %s
#
[async event]
Open MPI detected a fatal error on a usNIC port. Your MPI job will
now abort; sorry.
Server: %s
Device: %s
Port: %d
Async event code: %s (%d)
#
[internal error during init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Failure: %s (%s:%d)
Description: %s
#
[internal error after init]
An internal error has occurred in the Open MPI usNIC BTL. This is
highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
[ibv API failed after init]
Open MPI failed a basic verbs operation on a Cisco usNIC device. This
is highly unusual and shouldn't happen. It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
Your MPI job may behave erratically, hang, and/or abort.
Server: %s
Failure: %s (%s:%d)
Description: %s
#
[verbs_port_bw failed]
Open MPI failed to query the supported bandwidth of a port on a Cisco
usNIC device. This is unusual and shouldn't happen. It suggests that
there may be something wrong with the usNIC or OpenFabrics
configuration on this server.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
#
[eager_limit too high]
The eager_limit in the usnic BTL is too high for a device that Open
MPI tried to use. The usnic BTL eager_limit value is the largest
message payload that Open MPI will send in a single datagram.
You are seeing this message because the eager_limit was set to a value
larger than the MPI message payload capacity of a single UD datagram.
The max payload size is smaller than the size of individual datagrams
because each datagram also contains MPI control metadata, meaning that
the some bytes in the datagram must be reserved for overhead.
Open MPI will skip this device/port in the usnic BTL, which may result
in either lower performance or your job aborting.
Server: %s
Device: %s
Port: %d
Max payload allowed: %d
Specified eager_limit: %d
#
[check_reg_mem_basics fail]
The usNIC BTL failed to initialize while trying to register some
memory. This typically can indicate that the "memlock" limits are set
too low. For most HPC installations, the memlock limits should be set
to "unlimited". The failure occured here:
Local host: %s
Memlock limit: %s
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
#
[invalid if_inexclude]
WARNING: An invalid value was given for btl_usnic_if_%s. This
value will be ignored.
Local host: %s
Value: %s
Message: %s
#
[MTU mismatch]
The MTU does not match on local and remote hosts. All interfaces on all
hosts participating in an MPI job must be configured with the same MTU.
The device and port listed below will not be used to communicate with this
remote host.
Local host: %s
Device/port: %s/%d
Local MTU: %d
Remote host: %s
Remote MTU: %d