2013-07-19 22:13:58 +00:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2011 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 2006 Sandia National Laboratories. All rights
|
|
|
|
* reserved.
|
2015-02-03 10:23:44 -08:00
|
|
|
* Copyright (c) 2011-2015 Cisco Systems, Inc. All rights reserved.
|
2013-07-19 22:13:58 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
/**
|
|
|
|
* @file
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#ifndef OPAL_BTL_USNIC_MODULE_H
|
|
|
|
#define OPAL_BTL_USNIC_MODULE_H
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
#include <rdma/fabric.h>
|
2015-05-20 17:20:32 -07:00
|
|
|
#include <rdma/fi_cm.h>
|
2014-12-02 13:09:46 -08:00
|
|
|
#include <rdma/fi_eq.h>
|
|
|
|
#include <rdma/fi_endpoint.h>
|
|
|
|
#include <rdma/fi_errno.h>
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
#include "opal/class/opal_pointer_array.h"
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
#include "btl_usnic_endpoint.h"
|
Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
2013-10-28 22:23:08 +00:00
|
|
|
#include "btl_usnic_stats.h"
|
2014-12-02 13:09:46 -08:00
|
|
|
#include "btl_usnic_util.h"
|
|
|
|
|
2015-02-03 13:33:16 -08:00
|
|
|
/* When using the embedded libfabric, this file will be in
|
|
|
|
opal/mca/common/libfabric/libfabric/prov/usnic/src/fi_ext_usnic.h.
|
|
|
|
When using the external libfabric, this file will be in
|
|
|
|
rdma/fi_ext_usnic.h. */
|
|
|
|
#include OPAL_BTL_USNIC_FI_EXT_USNIC_H
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Default limits.
|
|
|
|
*
|
|
|
|
* These values obtained from empirical testing on Intel E5-2690
|
|
|
|
* machines with Sereno/Lexington cards through an N3546 switch.
|
|
|
|
*/
|
|
|
|
#define USNIC_DFLT_EAGER_LIMIT_1DEVICE (150 * 1024)
|
|
|
|
#define USNIC_DFLT_EAGER_LIMIT_NDEVICES (25 * 1024)
|
|
|
|
#define USNIC_DFLT_RNDV_EAGER_LIMIT 500
|
2013-11-04 22:52:03 +00:00
|
|
|
#define USNIC_DFLT_PACK_LAZY_THRESHOLD (16 * 1024)
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
BEGIN_C_DECLS
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Forward declarations to avoid include loops
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
struct opal_btl_usnic_send_segment_t;
|
|
|
|
struct opal_btl_usnic_recv_segment_t;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/*
|
2014-12-02 13:09:46 -08:00
|
|
|
* Abstraction of a set of endpoints
|
2013-07-19 22:13:58 +00:00
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
typedef struct opal_btl_usnic_channel_t {
|
2013-09-06 03:18:57 +00:00
|
|
|
int chan_index;
|
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
struct fid_cq *cq;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
int chan_max_msg_size;
|
2013-07-19 22:13:58 +00:00
|
|
|
int chan_rd_num;
|
|
|
|
int chan_sd_num;
|
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
int credits; /* RFXXX until libfab credits fixed */
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/* fastsend enabled if num_credits_available >= fastsend_wqe_thresh */
|
|
|
|
unsigned fastsend_wqe_thresh;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/** pointer to receive segment whose bookkeeping has been deferred */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
struct opal_btl_usnic_recv_segment_t *chan_deferred_recv;
|
2013-09-06 03:18:57 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/** queue pair and attributes */
|
|
|
|
struct fi_info *info;
|
|
|
|
struct fid_ep *ep;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
struct opal_btl_usnic_recv_segment_t *repost_recv_head;
|
2013-09-06 03:18:57 +00:00
|
|
|
|
2013-07-19 22:13:58 +00:00
|
|
|
/** receive segments & buffers */
|
2015-02-23 08:07:45 -08:00
|
|
|
opal_free_list_t recv_segs;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2013-09-06 03:18:57 +00:00
|
|
|
bool chan_error; /* set when error detected on channel */
|
|
|
|
|
2013-07-19 22:13:58 +00:00
|
|
|
/* statistics */
|
|
|
|
uint32_t num_channel_sends;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
} opal_btl_usnic_channel_t;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/**
|
2014-12-02 13:09:46 -08:00
|
|
|
* usnic BTL module
|
2013-07-19 22:13:58 +00:00
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
typedef struct opal_btl_usnic_module_t {
|
2013-07-19 22:13:58 +00:00
|
|
|
mca_btl_base_module_t super;
|
|
|
|
|
|
|
|
/* Cache for use during component_init to associate a module with
|
2014-12-02 13:09:46 -08:00
|
|
|
the libfabric device that it came from. */
|
|
|
|
struct fid_fabric *fabric;
|
|
|
|
struct fid_domain *domain;
|
|
|
|
struct fi_info *fabric_info;
|
|
|
|
struct fi_usnic_ops_fabric *usnic_fabric_ops;
|
|
|
|
struct fi_usnic_ops_av *usnic_av_ops;
|
|
|
|
struct fi_usnic_info usnic_info;
|
|
|
|
struct fid_eq *dom_eq;
|
|
|
|
struct fid_eq *av_eq;
|
|
|
|
struct fid_av *av;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
mca_btl_base_module_error_cb_fn_t pml_error_callback;
|
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/* Information about the events */
|
2013-07-19 22:13:58 +00:00
|
|
|
struct event device_async_event;
|
|
|
|
bool device_async_event_active;
|
|
|
|
int numa_distance; /* hwloc NUMA distance from this process */
|
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/** local address information */
|
|
|
|
struct opal_btl_usnic_modex_t local_modex;
|
|
|
|
char if_ipv4_addr_str[IPV4STRADDRLEN];
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/** desired send, receive, and completion queue entries (from MCA
|
|
|
|
params; cached here on the component because the MCA param
|
|
|
|
might == 0, which means "max supported on that device") */
|
|
|
|
int sd_num;
|
|
|
|
int rd_num;
|
|
|
|
int cq_num;
|
|
|
|
int prio_sd_num;
|
|
|
|
int prio_rd_num;
|
|
|
|
|
2014-07-30 20:56:15 +00:00
|
|
|
/*
|
2013-07-19 22:13:58 +00:00
|
|
|
* Fragments larger than max_frag_payload will be broken up into
|
|
|
|
* multiple chunks. The amount that can be held in a single chunk
|
|
|
|
* segment is slightly less than what can be held in frag segment due
|
|
|
|
* to fragment reassembly info.
|
|
|
|
*/
|
2014-12-02 13:09:46 -08:00
|
|
|
size_t max_tiny_msg_size;
|
2013-07-19 22:13:58 +00:00
|
|
|
size_t max_frag_payload; /* most that fits in a frag segment */
|
|
|
|
size_t max_chunk_payload; /* most that can fit in chunk segment */
|
|
|
|
size_t max_tiny_payload; /* threshold for using inline send */
|
|
|
|
|
|
|
|
/** Hash table to keep track of senders */
|
2014-12-02 13:09:46 -08:00
|
|
|
opal_hash_table_t senders;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-07-31 22:30:20 +00:00
|
|
|
/** list of all endpoints. Note that the main application thread
|
|
|
|
reads and writes to this list, and the connectivity agent
|
|
|
|
reads from it. So all access to the list (but not the items
|
|
|
|
in the list) must be protected by a lock. Also, have a flag
|
|
|
|
that indicates that the list has been constructed. Probably
|
|
|
|
overkill, but you can't be too safe with multi-threaded
|
|
|
|
programming in non-performance-critical code paths... */
|
2013-07-19 22:13:58 +00:00
|
|
|
opal_list_t all_endpoints;
|
2014-07-31 22:30:20 +00:00
|
|
|
opal_mutex_t all_endpoints_lock;
|
|
|
|
bool all_endpoints_constructed;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/** array of procs used by this module (can't use a list because a
|
|
|
|
proc can be used by multiple modules) */
|
|
|
|
opal_pointer_array_t all_procs;
|
|
|
|
|
|
|
|
/** send fragments & buffers */
|
2015-02-23 08:07:45 -08:00
|
|
|
opal_free_list_t small_send_frags;
|
|
|
|
opal_free_list_t large_send_frags;
|
|
|
|
opal_free_list_t put_dest_frags;
|
|
|
|
opal_free_list_t chunk_segs;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/** receive buffer pools */
|
|
|
|
int first_pool;
|
|
|
|
int last_pool;
|
2015-02-23 08:07:45 -08:00
|
|
|
opal_free_list_t *module_recv_buffers;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/** list of endpoints with data to send */
|
|
|
|
/* this list uses base endpoint ptr */
|
|
|
|
opal_list_t endpoints_with_sends;
|
|
|
|
|
|
|
|
/** list of send frags that are waiting to be resent (they
|
|
|
|
previously deferred because of lack of resources) */
|
|
|
|
opal_list_t pending_resend_segs;
|
|
|
|
|
|
|
|
/** ack segments */
|
2015-02-23 08:07:45 -08:00
|
|
|
opal_free_list_t ack_segs;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/** list of endpoints to which we need to send ACKs */
|
|
|
|
/* this list uses endpoint->endpoint_ack_li */
|
|
|
|
opal_list_t endpoints_that_need_acks;
|
|
|
|
|
|
|
|
/* abstract queue-pairs into channels */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_channel_t mod_channels[USNIC_NUM_CHANNELS];
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
/* Number of short/erroneous packets we've receive on this
|
|
|
|
interface */
|
2014-08-13 15:01:20 +00:00
|
|
|
uint32_t num_short_packets;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
2013-10-28 22:23:08 +00:00
|
|
|
/* Performance / debugging statistics */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_module_stats_t stats;
|
|
|
|
} opal_btl_usnic_module_t;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
struct opal_btl_usnic_frag_t;
|
|
|
|
extern opal_btl_usnic_module_t opal_btl_usnic_module_template;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Manipulate the "endpoints_that_need_acks" list
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* get first endpoint needing ACK */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
static inline opal_btl_usnic_endpoint_t *
|
|
|
|
opal_btl_usnic_get_first_endpoint_needing_ack(
|
|
|
|
opal_btl_usnic_module_t *module)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
|
|
|
opal_list_item_t *item;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_endpoint_t *endpoint;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
item = opal_list_get_first(&module->endpoints_that_need_acks);
|
|
|
|
if (item != opal_list_get_end(&module->endpoints_that_need_acks)) {
|
|
|
|
endpoint = container_of(item, mca_btl_base_endpoint_t, endpoint_ack_li);
|
|
|
|
return endpoint;
|
|
|
|
} else {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* get next item in chain */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
static inline opal_btl_usnic_endpoint_t *
|
|
|
|
opal_btl_usnic_get_next_endpoint_needing_ack(
|
|
|
|
opal_btl_usnic_endpoint_t *endpoint)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
|
|
|
opal_list_item_t *item;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_module_t *module;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
module = endpoint->endpoint_module;
|
|
|
|
|
|
|
|
item = opal_list_get_next(&(endpoint->endpoint_ack_li));
|
|
|
|
if (item != opal_list_get_end(&module->endpoints_that_need_acks)) {
|
|
|
|
endpoint = container_of(item, mca_btl_base_endpoint_t, endpoint_ack_li);
|
|
|
|
return endpoint;
|
|
|
|
} else {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_remove_from_endpoints_needing_ack(
|
|
|
|
opal_btl_usnic_endpoint_t *endpoint)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
|
|
|
opal_list_remove_item(
|
|
|
|
&(endpoint->endpoint_module->endpoints_that_need_acks),
|
|
|
|
&endpoint->endpoint_ack_li);
|
|
|
|
endpoint->endpoint_ack_needed = false;
|
|
|
|
endpoint->endpoint_acktime = 0;
|
|
|
|
#if MSGDEBUG1
|
2013-08-29 23:24:14 +00:00
|
|
|
opal_output(0, "clear ack_needed on %p\n", (void*)endpoint);
|
2013-07-19 22:13:58 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
opal_btl_usnic_add_to_endpoints_needing_ack(
|
|
|
|
opal_btl_usnic_endpoint_t *endpoint)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
|
|
|
opal_list_append(&(endpoint->endpoint_module->endpoints_that_need_acks),
|
|
|
|
&endpoint->endpoint_ack_li);
|
|
|
|
endpoint->endpoint_ack_needed = true;
|
|
|
|
#if MSGDEBUG1
|
2013-08-29 23:24:14 +00:00
|
|
|
opal_output(0, "set ack_needed on %p\n", (void*)endpoint);
|
2013-07-19 22:13:58 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize a module
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
int opal_btl_usnic_module_init(opal_btl_usnic_module_t* module);
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Progress pending sends on a module
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
void opal_btl_usnic_module_progress_sends(opal_btl_usnic_module_t *module);
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2013-10-23 15:50:57 +00:00
|
|
|
/* opal_output statistics that are useful for debugging */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
void opal_btl_usnic_print_stats(
|
|
|
|
opal_btl_usnic_module_t *module,
|
2013-10-23 15:50:57 +00:00
|
|
|
const char *prefix,
|
|
|
|
bool reset_stats);
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
END_C_DECLS
|
|
|
|
#endif
|