2014-07-10 20:31:15 +04:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2013-07-20 02:13:58 +04:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2011 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 2006 Sandia National Laboratories. All rights
|
|
|
|
* reserved.
|
2018-05-31 14:06:25 +03:00
|
|
|
* Copyright (c) 2008-2019 Cisco Systems, Inc. All rights reserved
|
2014-07-10 20:31:15 +04:00
|
|
|
* Copyright (c) 2012-2014 Los Alamos National Security, LLC. All rights
|
2014-07-31 00:56:15 +04:00
|
|
|
* reserved.
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
* Copyright (c) 2014 Intel, Inc. All rights reserved.
|
2015-01-09 05:35:13 +03:00
|
|
|
* Copyright (c) 2015 Research Organization for Information Science
|
|
|
|
* and Technology (RIST). All rights reserved.
|
2018-10-07 02:58:16 +03:00
|
|
|
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
|
2013-07-20 02:13:58 +04:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* General notes:
|
|
|
|
*
|
|
|
|
* - OB1 handles out of order receives
|
|
|
|
* - OB1 does NOT handle duplicate receives well (it probably does for
|
|
|
|
* MATCH tags, but for non-MATCH tags, it doesn't have enough info
|
|
|
|
* to know when duplicates are received), so we have to ensure not
|
|
|
|
* to pass duplicates up to the PML.
|
|
|
|
*/
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal_config.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
#include <string.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <errno.h>
|
|
|
|
#include <unistd.h>
|
2013-08-01 20:56:15 +04:00
|
|
|
#include <stdlib.h>
|
2013-07-20 02:13:58 +04:00
|
|
|
#include <sys/time.h>
|
|
|
|
#include <sys/resource.h>
|
2013-07-24 04:38:32 +04:00
|
|
|
#include <sys/types.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <fcntl.h>
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
#include <rdma/fabric.h>
|
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
#include "opal_stdint.h"
|
|
|
|
#include "opal/prefetch.h"
|
|
|
|
#include "opal/mca/timer/base/base.h"
|
|
|
|
#include "opal/util/argv.h"
|
|
|
|
#include "opal/util/net.h"
|
|
|
|
#include "opal/util/if.h"
|
2018-10-07 02:58:16 +03:00
|
|
|
#include "opal/util/printf.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
#include "opal/mca/base/mca_base_var.h"
|
|
|
|
#include "opal/mca/memchecker/base/base.h"
|
2013-07-22 21:28:23 +04:00
|
|
|
#include "opal/util/show_help.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal/constants.h"
|
2014-12-03 00:09:46 +03:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal/mca/btl/btl.h"
|
|
|
|
#include "opal/mca/btl/base/base.h"
|
|
|
|
#include "opal/util/proc.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
#include "btl_usnic.h"
|
2014-02-27 02:21:25 +04:00
|
|
|
#include "btl_usnic_connectivity.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
#include "btl_usnic_frag.h"
|
|
|
|
#include "btl_usnic_endpoint.h"
|
|
|
|
#include "btl_usnic_module.h"
|
Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
2013-10-29 02:23:08 +04:00
|
|
|
#include "btl_usnic_stats.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
#include "btl_usnic_util.h"
|
|
|
|
#include "btl_usnic_ack.h"
|
|
|
|
#include "btl_usnic_send.h"
|
|
|
|
#include "btl_usnic_recv.h"
|
|
|
|
#include "btl_usnic_proc.h"
|
2014-02-26 11:47:50 +04:00
|
|
|
#include "btl_usnic_test.h"
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
#define OPAL_BTL_USNIC_NUM_COMPLETIONS 500
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2016-06-02 21:52:32 +03:00
|
|
|
/* MPI_THREAD_MULTIPLE_SUPPORT */
|
2016-06-20 19:43:13 +03:00
|
|
|
opal_recursive_mutex_t btl_usnic_lock = OPAL_RECURSIVE_MUTEX_STATIC_INIT;
|
2016-06-02 21:52:32 +03:00
|
|
|
|
2014-02-24 01:41:38 +04:00
|
|
|
/* RNG buffer definition */
|
2016-10-28 17:36:20 +03:00
|
|
|
opal_rng_buff_t opal_btl_usnic_rand_buff = {{0}};
|
2014-02-24 01:41:38 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/* simulated clock */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
uint64_t opal_btl_usnic_ticks = 0;
|
2014-02-24 21:47:52 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
static opal_event_t usnic_clock_timer_event;
|
|
|
|
static bool usnic_clock_timer_event_set = false;
|
|
|
|
static struct timeval usnic_clock_timeout;
|
|
|
|
|
2013-10-23 19:51:11 +04:00
|
|
|
/* set to true in a debugger to enable even more verbose output when calling
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
* opal_btl_usnic_component_debug */
|
2013-10-23 19:51:11 +04:00
|
|
|
static volatile bool dump_bitvectors = false;
|
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
static int usnic_component_open(void);
|
|
|
|
static int usnic_component_close(void);
|
|
|
|
static mca_btl_base_module_t **
|
|
|
|
usnic_component_init(int* num_btl_modules, bool want_progress_threads,
|
|
|
|
bool want_mpi_threads);
|
|
|
|
static int usnic_component_progress(void);
|
|
|
|
|
|
|
|
/* Types for filtering interfaces */
|
|
|
|
typedef struct filter_elt_t {
|
|
|
|
bool is_netmask;
|
|
|
|
|
|
|
|
/* valid iff is_netmask==false */
|
|
|
|
char *if_name;
|
|
|
|
|
|
|
|
/* valid iff is_netmask==true */
|
2014-12-03 00:09:46 +03:00
|
|
|
uint32_t addr_be; /* in network byte order */
|
|
|
|
uint32_t netmask_be;
|
2013-07-20 02:13:58 +04:00
|
|
|
} filter_elt_t;
|
|
|
|
|
|
|
|
typedef struct usnic_if_filter_t {
|
|
|
|
int n_elt;
|
|
|
|
filter_elt_t *elts;
|
|
|
|
} usnic_if_filter_t;
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
static bool filter_module(opal_btl_usnic_module_t *module,
|
2013-07-20 02:13:58 +04:00
|
|
|
usnic_if_filter_t *filter,
|
|
|
|
bool filter_incl);
|
|
|
|
static usnic_if_filter_t *parse_ifex_str(const char *orig_str,
|
|
|
|
const char *name);
|
|
|
|
static void free_filter(usnic_if_filter_t *filter);
|
|
|
|
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_component_t mca_btl_usnic_component = {
|
2015-02-25 16:37:51 +03:00
|
|
|
.super = {
|
2013-07-20 02:13:58 +04:00
|
|
|
/* First, the mca_base_component_t struct containing meta information
|
|
|
|
about the component itself */
|
2014-07-10 20:31:15 +04:00
|
|
|
.btl_version = {
|
2014-12-03 00:09:46 +03:00
|
|
|
USNIC_BTL_DEFAULT_VERSION("usnic"),
|
2014-07-10 20:31:15 +04:00
|
|
|
.mca_open_component = usnic_component_open,
|
|
|
|
.mca_close_component = usnic_component_close,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
.mca_register_component_params = opal_btl_usnic_component_register,
|
2013-07-20 02:13:58 +04:00
|
|
|
},
|
2014-07-10 20:31:15 +04:00
|
|
|
.btl_data = {
|
2013-07-20 02:13:58 +04:00
|
|
|
/* The component is not checkpoint ready */
|
2014-07-10 20:31:15 +04:00
|
|
|
.param_field = MCA_BASE_METADATA_PARAM_NONE
|
2013-07-20 02:13:58 +04:00
|
|
|
},
|
|
|
|
|
2014-07-10 20:31:15 +04:00
|
|
|
.btl_init = usnic_component_init,
|
|
|
|
.btl_progress = usnic_component_progress,
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called by MCA framework to open the component
|
|
|
|
*/
|
|
|
|
static int usnic_component_open(void)
|
|
|
|
{
|
|
|
|
/* initialize state */
|
|
|
|
mca_btl_usnic_component.num_modules = 0;
|
2013-08-01 20:56:15 +04:00
|
|
|
mca_btl_usnic_component.usnic_all_modules = NULL;
|
|
|
|
mca_btl_usnic_component.usnic_active_modules = NULL;
|
2014-12-03 00:09:46 +03:00
|
|
|
mca_btl_usnic_component.transport_header_len = -1;
|
2015-07-09 21:58:00 +03:00
|
|
|
mca_btl_usnic_component.prefix_send_offset = 0;
|
2014-07-31 00:56:15 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/* initialize objects */
|
|
|
|
OBJ_CONSTRUCT(&mca_btl_usnic_component.usnic_procs, opal_list_t);
|
2014-07-31 00:56:15 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/* Sanity check: if_include and if_exclude need to be mutually
|
|
|
|
exclusive */
|
2014-07-31 00:56:15 +04:00
|
|
|
if (OPAL_SUCCESS !=
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
mca_base_var_check_exclusive("opal",
|
2013-07-20 02:13:58 +04:00
|
|
|
mca_btl_usnic_component.super.btl_version.mca_type_name,
|
|
|
|
mca_btl_usnic_component.super.btl_version.mca_component_name,
|
|
|
|
"if_include",
|
|
|
|
mca_btl_usnic_component.super.btl_version.mca_type_name,
|
|
|
|
mca_btl_usnic_component.super.btl_version.mca_component_name,
|
|
|
|
"if_exclude")) {
|
|
|
|
/* Return ERR_NOT_AVAILABLE so that a warning message about
|
|
|
|
"open" failing is not printed */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_ERR_NOT_AVAILABLE;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
2014-07-31 00:56:15 +04:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
2014-07-31 00:56:15 +04:00
|
|
|
* Component cleanup
|
2013-07-20 02:13:58 +04:00
|
|
|
*/
|
|
|
|
static int usnic_component_close(void)
|
|
|
|
{
|
|
|
|
/* Note that this list should already be empty, because:
|
|
|
|
- module.finalize() is invoked before component.close()
|
|
|
|
- module.finalize() RELEASEs each proc that it was using
|
|
|
|
- this should drive down the ref count on procs to 0
|
|
|
|
- procs remove themselves from the component.usnic_procs list
|
|
|
|
in their destructor */
|
|
|
|
OBJ_DESTRUCT(&mca_btl_usnic_component.usnic_procs);
|
|
|
|
|
|
|
|
if (usnic_clock_timer_event_set) {
|
|
|
|
opal_event_del(&usnic_clock_timer_event);
|
|
|
|
usnic_clock_timer_event_set = false;
|
|
|
|
}
|
|
|
|
|
2014-02-27 02:21:25 +04:00
|
|
|
/* Finalize the connectivity client and agent */
|
|
|
|
if (mca_btl_usnic_component.connectivity_enabled) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_connectivity_client_finalize();
|
|
|
|
opal_btl_usnic_connectivity_agent_finalize();
|
2014-02-27 02:21:25 +04:00
|
|
|
}
|
2015-08-10 21:02:09 +03:00
|
|
|
if (mca_btl_usnic_component.opal_evbase) {
|
|
|
|
opal_progress_thread_finalize(NULL);
|
|
|
|
}
|
2014-02-27 02:21:25 +04:00
|
|
|
|
2013-08-01 20:56:15 +04:00
|
|
|
free(mca_btl_usnic_component.usnic_all_modules);
|
|
|
|
free(mca_btl_usnic_component.usnic_active_modules);
|
2014-02-26 11:50:26 +04:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#if OPAL_BTL_USNIC_UNIT_TESTS
|
2014-02-26 11:47:50 +04:00
|
|
|
/* clean up the unit test infrastructure */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_cleanup_tests();
|
2014-02-26 11:47:50 +04:00
|
|
|
#endif
|
|
|
|
|
2016-06-02 21:52:32 +03:00
|
|
|
OBJ_DESTRUCT(&btl_usnic_lock);
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
2014-12-03 00:09:46 +03:00
|
|
|
* Register address information. The modex will make this available
|
|
|
|
* to all peers.
|
2013-07-20 02:13:58 +04:00
|
|
|
*/
|
|
|
|
static int usnic_modex_send(void)
|
|
|
|
{
|
|
|
|
int rc;
|
2014-12-03 00:09:46 +03:00
|
|
|
int i;
|
2013-07-20 02:13:58 +04:00
|
|
|
size_t size;
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_btl_usnic_modex_t* modexes = NULL;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-10-22 22:11:58 +04:00
|
|
|
if (0 == mca_btl_usnic_component.num_modules) {
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2014-07-31 00:56:15 +04:00
|
|
|
size = mca_btl_usnic_component.num_modules *
|
2014-12-03 00:09:46 +03:00
|
|
|
sizeof(opal_btl_usnic_modex_t);
|
|
|
|
modexes = (opal_btl_usnic_modex_t*) malloc(size);
|
|
|
|
if (NULL == modexes) {
|
2014-10-22 22:11:58 +04:00
|
|
|
return OPAL_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < mca_btl_usnic_component.num_modules; i++) {
|
|
|
|
opal_btl_usnic_module_t* module =
|
|
|
|
mca_btl_usnic_component.usnic_active_modules[i];
|
2014-12-03 00:09:46 +03:00
|
|
|
modexes[i] = module->local_modex;
|
2014-10-22 22:11:58 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: "
|
|
|
|
"control port:%d, "
|
|
|
|
"modex_send data port:%d, "
|
|
|
|
"%s",
|
|
|
|
modexes[i].ports[USNIC_PRIORITY_CHANNEL],
|
|
|
|
modexes[i].ports[USNIC_DATA_CHANNEL],
|
|
|
|
module->if_ipv4_addr_str);
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
usnic_compat_modex_send(&rc, &mca_btl_usnic_component.super.btl_version,
|
|
|
|
modexes, size);
|
|
|
|
free(modexes);
|
2014-10-22 22:11:58 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if our memlock limit is >64K. 64K is the RHEL default memlock
|
|
|
|
* limit; this check is a first-line-of-defense hueristic to see if
|
|
|
|
* the user has set the memlock limit to *something*.
|
|
|
|
*
|
|
|
|
* We have other checks elsewhere (e.g., to ensure that QPs are able
|
|
|
|
* to be allocated -- which also require registered memory -- and to
|
|
|
|
* ensure that receive buffers can be registered, etc.), but this is a
|
|
|
|
* good first check to ensure that a default OS case is satisfied.
|
|
|
|
*/
|
|
|
|
static int check_reg_mem_basics(void)
|
|
|
|
{
|
|
|
|
#if HAVE_DECL_RLIMIT_MEMLOCK
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
int ret = OPAL_SUCCESS;
|
2013-07-20 02:13:58 +04:00
|
|
|
struct rlimit limit;
|
|
|
|
char *str_limit = NULL;
|
|
|
|
|
|
|
|
ret = getrlimit(RLIMIT_MEMLOCK, &limit);
|
|
|
|
if (0 == ret) {
|
|
|
|
if ((long) limit.rlim_cur > (64 * 1024) ||
|
|
|
|
limit.rlim_cur == RLIM_INFINITY) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_SUCCESS;
|
2014-12-03 00:09:46 +03:00
|
|
|
} else {
|
2018-10-07 02:58:16 +03:00
|
|
|
opal_asprintf(&str_limit, "%ld", (long)limit.rlim_cur);
|
2014-12-03 00:09:46 +03:00
|
|
|
}
|
2013-07-20 02:13:58 +04:00
|
|
|
} else {
|
2018-10-07 02:58:16 +03:00
|
|
|
opal_asprintf(&str_limit, "Unknown");
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2013-07-22 21:28:23 +04:00
|
|
|
opal_show_help("help-mpi-btl-usnic.txt", "check_reg_mem_basics fail",
|
2013-07-20 02:13:58 +04:00
|
|
|
true,
|
2014-07-27 01:48:23 +04:00
|
|
|
opal_process_info.nodename,
|
2013-07-20 02:13:58 +04:00
|
|
|
str_limit);
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_ERR_OUT_OF_RESOURCE;
|
2013-07-20 02:13:58 +04:00
|
|
|
#else
|
|
|
|
/* If we don't have RLIMIT_MEMLOCK, then just bypass this
|
|
|
|
safety/hueristic check. */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-20 02:13:58 +04:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-12-24 15:57:35 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/*
|
|
|
|
* Basic sanity checking for usNIC VFs / resources.
|
|
|
|
*/
|
|
|
|
static int check_usnic_config(opal_btl_usnic_module_t *module,
|
|
|
|
int num_local_procs)
|
2013-07-24 04:38:32 +04:00
|
|
|
{
|
|
|
|
char str[128];
|
2015-01-15 22:38:58 +03:00
|
|
|
unsigned unlp;
|
2014-12-03 00:09:46 +03:00
|
|
|
struct fi_usnic_info *uip;
|
|
|
|
|
|
|
|
uip = &module->usnic_info;
|
2013-07-24 04:38:32 +04:00
|
|
|
|
2015-01-15 22:38:58 +03:00
|
|
|
/* Note: we add one to num_local_procs to account for *this*
|
|
|
|
process */
|
|
|
|
unlp = (unsigned) num_local_procs + 1;
|
|
|
|
|
2013-07-24 04:38:32 +04:00
|
|
|
/* usNIC allocates QPs as a combination of PCI virtual functions
|
|
|
|
(VFs) and resources inside those VFs. Ensure that:
|
|
|
|
|
|
|
|
1. num_vfs (i.e., "usNICs") >= num_local_procs (to ensure that
|
|
|
|
each MPI process will be able to have its own protection
|
|
|
|
domain), and
|
2016-04-23 01:55:32 +03:00
|
|
|
2. num_qps_per_vf >= NUM_CHANNELS
|
2013-07-24 04:38:32 +04:00
|
|
|
(to ensure that each MPI process will be able to get the
|
|
|
|
number of QPs it needs -- we know that every VF will have
|
|
|
|
the same number of QPs), and
|
2016-04-23 01:55:32 +03:00
|
|
|
3. num_cqs_per_vf >= NUM_CHANNELS
|
2013-07-24 04:38:32 +04:00
|
|
|
(to ensure that each MPI process will be able to get the
|
|
|
|
number of CQs that it needs) */
|
2015-02-03 21:23:44 +03:00
|
|
|
if (uip->ui.v1.ui_num_vf < unlp) {
|
2013-07-24 04:38:32 +04:00
|
|
|
snprintf(str, sizeof(str), "Not enough usNICs (found %d, need %d)",
|
2015-02-03 21:23:44 +03:00
|
|
|
uip->ui.v1.ui_num_vf, unlp);
|
2013-07-24 04:38:32 +04:00
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2016-04-23 01:55:32 +03:00
|
|
|
if (uip->ui.v1.ui_qp_per_vf < USNIC_NUM_CHANNELS) {
|
|
|
|
snprintf(str, sizeof(str), "Not enough transmit/receive queues per usNIC (found %d, need %d)",
|
|
|
|
uip->ui.v1.ui_qp_per_vf,
|
|
|
|
USNIC_NUM_CHANNELS);
|
2013-07-24 04:38:32 +04:00
|
|
|
goto error;
|
|
|
|
}
|
2016-04-23 01:55:32 +03:00
|
|
|
if (uip->ui.v1.ui_cq_per_vf < USNIC_NUM_CHANNELS) {
|
2014-12-03 00:09:46 +03:00
|
|
|
snprintf(str, sizeof(str),
|
2016-04-23 01:55:32 +03:00
|
|
|
"Not enough completion queues per usNIC (found %d, need %d)",
|
|
|
|
uip->ui.v1.ui_cq_per_vf,
|
|
|
|
USNIC_NUM_CHANNELS);
|
2013-07-24 04:38:32 +04:00
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* All is good! */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-24 04:38:32 +04:00
|
|
|
|
|
|
|
error:
|
|
|
|
/* Sad panda */
|
|
|
|
opal_show_help("help-mpi-btl-usnic.txt",
|
|
|
|
"not enough usnic resources",
|
|
|
|
true,
|
2014-07-27 01:48:23 +04:00
|
|
|
opal_process_info.nodename,
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name,
|
2013-07-24 04:38:32 +04:00
|
|
|
str);
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_ERROR;
|
2013-07-24 04:38:32 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
static void usnic_clock_callback(int fd, short flags, void *timeout)
|
2013-07-20 02:13:58 +04:00
|
|
|
{
|
|
|
|
/* 1ms == 1,000,000 ns */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_ticks += 1000000;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* run progress to make sure time change gets noticed */
|
|
|
|
usnic_component_progress();
|
|
|
|
|
|
|
|
opal_event_add(&usnic_clock_timer_event, timeout);
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
|
|
|
|
/* Parse a string which is a comma-separated list containing a mix of
|
|
|
|
* interface names and IPv4 CIDR-format netmasks.
|
|
|
|
*
|
|
|
|
* Gracefully tolerates NULL pointer arguments by returning NULL.
|
|
|
|
*
|
|
|
|
* Returns a usnic_if_filter_t, which contains n_elt and a
|
|
|
|
* corresponding array of found filter elements. Caller is
|
|
|
|
* responsible for freeing the returned usnic_if_filter_t, the array
|
|
|
|
* of filter elements, and any strings in it (can do this via
|
|
|
|
* free_filter()).
|
|
|
|
*/
|
|
|
|
static usnic_if_filter_t *parse_ifex_str(const char *orig_str,
|
|
|
|
const char *name)
|
|
|
|
{
|
|
|
|
int i, ret;
|
|
|
|
char **argv, *str, *tmp;
|
|
|
|
struct sockaddr_storage argv_inaddr;
|
|
|
|
uint32_t argv_prefix, addr;
|
|
|
|
usnic_if_filter_t *filter;
|
|
|
|
int n_argv;
|
|
|
|
|
|
|
|
if (NULL == orig_str) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Get a wrapper for the filter */
|
|
|
|
filter = calloc(sizeof(*filter), 1);
|
|
|
|
if (NULL == filter) {
|
|
|
|
OPAL_ERROR_LOG(OPAL_ERR_OUT_OF_RESOURCE);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
argv = opal_argv_split(orig_str, ',');
|
|
|
|
if (NULL == argv || 0 == (n_argv = opal_argv_count(argv))) {
|
|
|
|
free(filter);
|
|
|
|
opal_argv_free(argv);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* upper bound: each entry could be a mask */
|
|
|
|
filter->elts = malloc(sizeof(*filter->elts) * n_argv);
|
|
|
|
if (NULL == filter->elts) {
|
|
|
|
OPAL_ERROR_LOG(OPAL_ERR_OUT_OF_RESOURCE);
|
|
|
|
free(filter);
|
|
|
|
opal_argv_free(argv);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Shuffle iface names to the beginning of the argv array. Process each
|
|
|
|
* netmask as we encounter it and append the resulting value to netmask_t
|
|
|
|
* array which we will return. */
|
|
|
|
filter->n_elt = 0;
|
|
|
|
for (i = 0; NULL != argv[i]; ++i) {
|
|
|
|
/* assume that all interface names begin with an alphanumeric
|
|
|
|
* character, not a number */
|
|
|
|
if (isalpha(argv[i][0])) {
|
|
|
|
filter->elts[filter->n_elt].is_netmask = false;
|
|
|
|
filter->elts[filter->n_elt].if_name = strdup(argv[i]);
|
|
|
|
opal_output_verbose(20, USNIC_OUT,
|
2015-01-13 21:30:33 +03:00
|
|
|
"btl:usnic:parse_ifex_str: parsed %s device name: %s",
|
2014-12-03 00:09:46 +03:00
|
|
|
name, filter->elts[filter->n_elt].if_name);
|
|
|
|
|
|
|
|
++filter->n_elt;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Found a subnet notation. Convert it to an IP
|
|
|
|
address/netmask. Get the prefix first. */
|
|
|
|
argv_prefix = 0;
|
|
|
|
tmp = strdup(argv[i]);
|
|
|
|
str = strchr(argv[i], '/');
|
|
|
|
if (NULL == str) {
|
|
|
|
opal_show_help("help-mpi-btl-usnic.txt", "invalid if_inexclude",
|
|
|
|
true, name, opal_process_info.nodename,
|
|
|
|
tmp, "Invalid specification (missing \"/\")");
|
|
|
|
free(tmp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
*str = '\0';
|
|
|
|
argv_prefix = atoi(str + 1);
|
|
|
|
if (argv_prefix < 1 || argv_prefix > 32) {
|
|
|
|
opal_show_help("help-mpi-btl-usnic.txt", "invalid if_inexclude",
|
|
|
|
true, name, opal_process_info.nodename,
|
|
|
|
tmp, "Invalid specification (prefix < 1 or prefix >32)");
|
|
|
|
free(tmp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now convert the IPv4 address */
|
|
|
|
((struct sockaddr*) &argv_inaddr)->sa_family = AF_INET;
|
|
|
|
ret = inet_pton(AF_INET, argv[i],
|
|
|
|
&((struct sockaddr_in*) &argv_inaddr)->sin_addr);
|
|
|
|
if (1 != ret) {
|
|
|
|
opal_show_help("help-mpi-btl-usnic.txt", "invalid if_inexclude",
|
|
|
|
true, name, opal_process_info.nodename, tmp,
|
|
|
|
"Invalid specification (inet_pton() failed)");
|
|
|
|
free(tmp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
opal_output_verbose(20, USNIC_OUT,
|
2015-01-13 21:30:33 +03:00
|
|
|
"btl:usnic:parse_ifex_str: parsed %s address+prefix: %s / %u",
|
2014-12-03 00:09:46 +03:00
|
|
|
name,
|
|
|
|
opal_net_get_hostname((struct sockaddr*) &argv_inaddr),
|
|
|
|
argv_prefix);
|
|
|
|
|
|
|
|
memcpy(&addr,
|
|
|
|
&((struct sockaddr_in*) &argv_inaddr)->sin_addr,
|
|
|
|
sizeof(addr));
|
|
|
|
|
|
|
|
/* be helpful: if the user passed A.B.C.D/24 instead of A.B.C.0/24,
|
|
|
|
* also normalize the netmask */
|
|
|
|
filter->elts[filter->n_elt].is_netmask = true;
|
|
|
|
filter->elts[filter->n_elt].if_name = NULL;
|
|
|
|
filter->elts[filter->n_elt].netmask_be =
|
|
|
|
usnic_cidrlen_to_netmask(argv_prefix);
|
|
|
|
filter->elts[filter->n_elt].addr_be = addr &
|
|
|
|
filter->elts[filter->n_elt].netmask_be;
|
|
|
|
++filter->n_elt;
|
|
|
|
|
|
|
|
free(tmp);
|
|
|
|
}
|
|
|
|
assert(i == n_argv); /* sanity */
|
|
|
|
|
|
|
|
opal_argv_free(argv);
|
|
|
|
|
|
|
|
/* don't return an empty filter */
|
|
|
|
if (filter->n_elt == 0) {
|
|
|
|
free_filter(filter);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return filter;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check this module to see if should be kept or not.
|
|
|
|
*/
|
|
|
|
static bool filter_module(opal_btl_usnic_module_t *module,
|
|
|
|
usnic_if_filter_t *filter,
|
|
|
|
bool filter_incl)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
uint32_t module_mask;
|
|
|
|
struct sockaddr_in *src;
|
|
|
|
struct fi_usnic_info *uip;
|
|
|
|
struct fi_info *info;
|
|
|
|
bool match;
|
2016-08-20 05:07:14 +03:00
|
|
|
const char *linux_device_name;
|
2014-12-03 00:09:46 +03:00
|
|
|
|
|
|
|
info = module->fabric_info;
|
|
|
|
uip = &module->usnic_info;
|
|
|
|
src = info->src_addr;
|
2016-08-20 05:07:14 +03:00
|
|
|
linux_device_name = module->linux_device_name;
|
2015-02-03 21:23:44 +03:00
|
|
|
module_mask = src->sin_addr.s_addr & uip->ui.v1.ui_netmask_be;
|
2014-12-03 00:09:46 +03:00
|
|
|
match = false;
|
|
|
|
for (i = 0; i < filter->n_elt; ++i) {
|
|
|
|
if (filter->elts[i].is_netmask) {
|
|
|
|
/* conservative: we also require the netmask to match */
|
2015-02-03 21:23:44 +03:00
|
|
|
if (filter->elts[i].netmask_be == uip->ui.v1.ui_netmask_be &&
|
2014-12-03 00:09:46 +03:00
|
|
|
filter->elts[i].addr_be == module_mask) {
|
|
|
|
match = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else {
|
2016-08-20 05:07:14 +03:00
|
|
|
if (strcmp(filter->elts[i].if_name, linux_device_name) == 0) {
|
2014-12-03 00:09:46 +03:00
|
|
|
match = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Turn the match result into whether we should keep it or not */
|
|
|
|
return match ^ !filter_incl;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* utility routine to safely free a filter element array */
|
|
|
|
static void free_filter(usnic_if_filter_t *filter)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (filter == NULL) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (NULL != filter->elts) {
|
|
|
|
for (i = 0; i < filter->n_elt; ++i) {
|
|
|
|
if (!filter->elts[i].is_netmask) {
|
|
|
|
free(filter->elts[i].if_name);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
free(filter->elts);
|
|
|
|
}
|
|
|
|
free(filter);
|
|
|
|
}
|
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/*
|
|
|
|
* UD component initialization:
|
|
|
|
* (1) read interface list from kernel and compare against component
|
|
|
|
* parameters then create a BTL instance for selected interfaces
|
|
|
|
* (2) post OOB receive for incoming connection attempts
|
|
|
|
* (3) register BTL parameters with the MCA
|
|
|
|
*/
|
|
|
|
static mca_btl_base_module_t** usnic_component_init(int* num_btl_modules,
|
|
|
|
bool want_progress_threads,
|
|
|
|
bool want_mpi_threads)
|
|
|
|
{
|
|
|
|
mca_btl_base_module_t **btls = NULL;
|
2014-12-03 00:09:46 +03:00
|
|
|
int i, j, num_final_modules;
|
|
|
|
int num_devs;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_module_t *module;
|
2015-03-10 17:08:06 +03:00
|
|
|
usnic_if_filter_t *filter = NULL;
|
2013-07-20 02:13:58 +04:00
|
|
|
bool keep_module;
|
|
|
|
bool filter_incl = false;
|
2014-12-03 00:09:46 +03:00
|
|
|
int min_distance, num_local_procs;
|
|
|
|
struct fi_info *info_list;
|
|
|
|
struct fi_info *info;
|
|
|
|
struct fid_fabric *fabric;
|
|
|
|
struct fid_domain *domain;
|
|
|
|
int ret;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
*num_btl_modules = 0;
|
|
|
|
|
2016-06-02 21:52:32 +03:00
|
|
|
/* MPI_THREAD_MULTIPLE is only supported in 2.0+ */
|
2014-12-16 19:49:01 +03:00
|
|
|
if (want_mpi_threads && !mca_btl_base_thread_multiple_override) {
|
2016-10-12 02:43:35 +03:00
|
|
|
if (OPAL_MAJOR_VERSION >= 2) {
|
2016-06-02 21:52:32 +03:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: MPI_THREAD_MULTIPLE support is in testing phase.");
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: MPI_THREAD_MULTIPLE is not supported in version < 2.");
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2016-06-02 21:52:32 +03:00
|
|
|
OBJ_CONSTRUCT(&btl_usnic_lock, opal_recursive_mutex_t);
|
|
|
|
|
2016-10-03 21:48:41 +03:00
|
|
|
/* There are multiple dimensions to consider when requesting an
|
|
|
|
API version number from libfabric:
|
|
|
|
|
2016-10-03 21:58:08 +03:00
|
|
|
1. This code understands libfabric API versions v1.3 through
|
2016-10-03 21:48:41 +03:00
|
|
|
v1.4.
|
|
|
|
|
|
|
|
2. Open MPI may be *compiled* against one version of libfabric,
|
|
|
|
but may be *running* with another.
|
|
|
|
|
|
|
|
3. There were usnic-specific bugs in Libfabric prior to
|
|
|
|
libfabric v1.3.0 (where "v1.3.0" is the tarball/package
|
|
|
|
version, not the API version; but happily, the API version
|
|
|
|
was also 1.3 in Libfabric v1.3.0):
|
|
|
|
|
|
|
|
- In libfabric v1.0.0 (i.e., API v1.0), the usnic provider
|
|
|
|
did not check the value of the "version" parameter passed
|
|
|
|
into fi_getinfo()
|
|
|
|
- If you pass FI_VERSION(1,0) to libfabric v1.1.0 (i.e., API
|
|
|
|
v1.1), the usnic provider will disable FI_MSG_PREFIX
|
|
|
|
support (on the assumption that the application will not
|
|
|
|
handle FI_MSG_PREFIX properly). This can happen if you
|
|
|
|
compile OMPI against libfabric v1.0.0 (i.e., API v1.0) and
|
|
|
|
run OMPI against libfabric v1.1.0 (i.e., API v1.1).
|
|
|
|
- Some critical AV bug fixes were included in libfabric
|
|
|
|
v1.3.0; prior versions can fail in fi_av_* operations in
|
|
|
|
unexpected ways (libnl: you win again!).
|
|
|
|
|
|
|
|
So always request a minimum API version of v1.3.
|
|
|
|
|
|
|
|
Note that the FI_MAJOR_VERSION and FI_MINOR_VERSION in
|
|
|
|
<rdma/fabric.h> represent the API version, not the Libfabric
|
|
|
|
package (i.e., tarball) version. As of Libfabric v1.3, there
|
|
|
|
is currently no way to know a) what package version of
|
|
|
|
Libfabric you were compiled against, and b) what package
|
|
|
|
version of Libfabric you are running with.
|
|
|
|
|
|
|
|
Also note that the usnic provider changed the strings in the
|
|
|
|
fabric and domain names in API v1.4. With API <= v1.3:
|
2016-08-20 05:07:14 +03:00
|
|
|
|
|
|
|
- fabric name is "usnic_X" (device name)
|
|
|
|
- domain name is NULL
|
|
|
|
|
2016-10-03 21:48:41 +03:00
|
|
|
With libfabric API >= v1.4, all Libfabric IP-based providers
|
|
|
|
(including usnic) follow the same convention:
|
2016-08-20 05:07:14 +03:00
|
|
|
|
|
|
|
- fabric name is "a.b.c.d/e" (CIDR notation of network)
|
|
|
|
- domain name is "usnic_X" (device name)
|
|
|
|
|
2015-08-12 15:21:18 +03:00
|
|
|
NOTE: The configure.m4 in this component will require libfabric
|
2016-10-03 21:48:41 +03:00
|
|
|
>= v1.1.0 (i.e., it won't accept v1.0.0) because it needs
|
|
|
|
access to the usNIC extension header structures that only
|
2016-10-03 21:58:08 +03:00
|
|
|
became available in v1.1.0.*/
|
2016-08-20 05:07:14 +03:00
|
|
|
|
2016-10-03 21:48:41 +03:00
|
|
|
/* First, check to see if the libfabric we are running with is <=
|
|
|
|
libfabric v1.3. If so, don't bother going further. */
|
2015-07-09 21:58:00 +03:00
|
|
|
uint32_t libfabric_api;
|
2016-10-03 21:48:41 +03:00
|
|
|
libfabric_api = fi_version();
|
|
|
|
if (libfabric_api < FI_VERSION(1, 3)) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: disqualifiying myself because Libfabric does not support v1.3 of the API (v1.3 is *required* for correct usNIC functionality).");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Libfabric API 1.3 is fine. Above that, we know that Open MPI
|
|
|
|
works with libfabric API v1.4, so just use that. */
|
|
|
|
if (libfabric_api > FI_VERSION(1, 3)) {
|
|
|
|
libfabric_api = FI_VERSION(1, 4);
|
2016-08-20 05:07:14 +03:00
|
|
|
}
|
2016-10-03 21:48:41 +03:00
|
|
|
|
|
|
|
struct fi_info hints = {0};
|
|
|
|
struct fi_ep_attr ep_attr = {0};
|
|
|
|
struct fi_fabric_attr fabric_attr = {0};
|
2017-04-21 18:51:15 +03:00
|
|
|
struct fi_rx_attr rx_attr = {0};
|
|
|
|
struct fi_tx_attr tx_attr = {0};
|
2016-10-03 21:48:41 +03:00
|
|
|
|
|
|
|
/* We only want providers named "usnic" that are of type EP_DGRAM */
|
|
|
|
fabric_attr.prov_name = "usnic";
|
|
|
|
ep_attr.type = FI_EP_DGRAM;
|
|
|
|
|
|
|
|
hints.caps = FI_MSG;
|
|
|
|
hints.mode = FI_LOCAL_MR | FI_MSG_PREFIX;
|
|
|
|
hints.addr_format = FI_SOCKADDR;
|
|
|
|
hints.ep_attr = &ep_attr;
|
|
|
|
hints.fabric_attr = &fabric_attr;
|
2017-04-21 18:51:15 +03:00
|
|
|
hints.tx_attr = &tx_attr;
|
|
|
|
hints.rx_attr = &rx_attr;
|
|
|
|
|
|
|
|
tx_attr.iov_limit = 1;
|
|
|
|
rx_attr.iov_limit = 1;
|
2016-10-03 21:48:41 +03:00
|
|
|
|
|
|
|
ret = fi_getinfo(libfabric_api, NULL, 0, 0, &hints, &info_list);
|
2014-12-03 00:09:46 +03:00
|
|
|
if (0 != ret) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2016-10-03 21:48:41 +03:00
|
|
|
"btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: %s (%d)", strerror(-ret), ret);
|
2014-12-03 00:09:46 +03:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
num_devs = 0;
|
|
|
|
for (info = info_list; NULL != info; info = info->next) {
|
|
|
|
++num_devs;
|
|
|
|
}
|
|
|
|
if (0 == num_devs) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: disqualifiying myself due to lack of libfabric providers");
|
2013-07-20 02:13:58 +04:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-01-09 05:35:13 +03:00
|
|
|
/* Do quick sanity check to ensure that we can lock memory (which
|
|
|
|
is required for registered memory). */
|
|
|
|
if (OPAL_SUCCESS != check_reg_mem_basics()) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: disqualifiying myself due to lack of lockable memory");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/************************************************************************
|
|
|
|
* Below this line, we assume that usnic is loaded on all procs,
|
|
|
|
* and therefore we will guarantee to the the modex send, even if
|
|
|
|
* we fail.
|
|
|
|
************************************************************************/
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2015-01-13 00:21:22 +03:00
|
|
|
"btl:usnic: usNIC fabrics found");
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_proc_t *me = opal_proc_local_get();
|
|
|
|
opal_process_name_t *name = &(me->proc_name);
|
|
|
|
mca_btl_usnic_component.my_hashed_rte_name =
|
|
|
|
usnic_compat_rte_hash_name(name);
|
|
|
|
MSGDEBUG1_OUT("%s: my_hashed_rte_name=0x%" PRIx64,
|
|
|
|
__func__, mca_btl_usnic_component.my_hashed_rte_name);
|
|
|
|
|
|
|
|
opal_srand(&opal_btl_usnic_rand_buff, ((uint32_t) getpid()));
|
2014-03-04 01:31:42 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/* Setup an array of pointers to point to each module (which we'll
|
|
|
|
return upstream) */
|
2014-12-03 00:09:46 +03:00
|
|
|
mca_btl_usnic_component.num_modules = num_devs;
|
2013-07-20 02:13:58 +04:00
|
|
|
btls = (struct mca_btl_base_module_t**)
|
2014-07-31 00:56:15 +04:00
|
|
|
malloc(mca_btl_usnic_component.num_modules *
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
sizeof(opal_btl_usnic_module_t*));
|
2013-07-20 02:13:58 +04:00
|
|
|
if (NULL == btls) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
OPAL_ERROR_LOG(OPAL_ERR_OUT_OF_RESOURCE);
|
2014-12-03 00:09:46 +03:00
|
|
|
goto send_modex;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Allocate space for btl module instances */
|
2013-08-01 20:56:15 +04:00
|
|
|
mca_btl_usnic_component.usnic_all_modules =
|
2013-07-20 02:13:58 +04:00
|
|
|
calloc(mca_btl_usnic_component.num_modules,
|
2013-08-01 20:56:15 +04:00
|
|
|
sizeof(*mca_btl_usnic_component.usnic_all_modules));
|
|
|
|
mca_btl_usnic_component.usnic_active_modules =
|
2013-09-06 07:21:34 +04:00
|
|
|
calloc(mca_btl_usnic_component.num_modules,
|
2013-08-01 20:56:15 +04:00
|
|
|
sizeof(*mca_btl_usnic_component.usnic_active_modules));
|
|
|
|
if (NULL == mca_btl_usnic_component.usnic_all_modules ||
|
|
|
|
NULL == mca_btl_usnic_component.usnic_active_modules) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
OPAL_ERROR_LOG(OPAL_ERR_OUT_OF_RESOURCE);
|
2013-08-01 20:56:15 +04:00
|
|
|
goto error;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* If we have include or exclude list, parse and set up now
|
|
|
|
* (higher level guarantees there will not be both include and exclude,
|
|
|
|
* so don't bother checking that here)
|
|
|
|
*/
|
|
|
|
if (NULL != mca_btl_usnic_component.if_include) {
|
|
|
|
opal_output_verbose(20, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_module: if_include=%s",
|
|
|
|
mca_btl_usnic_component.if_include);
|
|
|
|
|
|
|
|
filter_incl = true;
|
|
|
|
filter = parse_ifex_str(mca_btl_usnic_component.if_include, "include");
|
|
|
|
} else if (NULL != mca_btl_usnic_component.if_exclude) {
|
|
|
|
opal_output_verbose(20, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_module: if_exclude=%s",
|
|
|
|
mca_btl_usnic_component.if_exclude);
|
|
|
|
|
|
|
|
filter_incl = false;
|
|
|
|
filter = parse_ifex_str(mca_btl_usnic_component.if_exclude, "exclude");
|
|
|
|
}
|
|
|
|
|
2014-07-27 01:48:23 +04:00
|
|
|
num_local_procs = opal_process_info.num_local_peers;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/* Go through the list of devices and determine if we want it or
|
|
|
|
not. Create a module for each one that we want. */
|
|
|
|
info = info_list;
|
|
|
|
for (j = i = 0; i < num_devs &&
|
2013-07-20 02:13:58 +04:00
|
|
|
(0 == mca_btl_usnic_component.max_modules ||
|
|
|
|
i < mca_btl_usnic_component.max_modules);
|
2014-12-03 00:09:46 +03:00
|
|
|
++i, info = info->next) {
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2016-08-20 05:07:14 +03:00
|
|
|
// The fabric/domain names changed at libfabric API v1.4 (see above).
|
|
|
|
char *linux_device_name;
|
|
|
|
if (libfabric_api <= FI_VERSION(1, 3)) {
|
|
|
|
linux_device_name = info->fabric_attr->name;
|
|
|
|
} else {
|
|
|
|
linux_device_name = info->domain_attr->name;
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
ret = fi_fabric(info->fabric_attr, &fabric, NULL);
|
|
|
|
if (0 != ret) {
|
2015-03-10 02:57:41 +03:00
|
|
|
opal_show_help("help-mpi-btl-usnic.txt",
|
|
|
|
"libfabric API failed",
|
|
|
|
true,
|
|
|
|
opal_process_info.nodename,
|
2016-08-20 05:07:14 +03:00
|
|
|
linux_device_name,
|
2015-03-10 02:57:41 +03:00
|
|
|
"fi_fabric()", __FILE__, __LINE__,
|
|
|
|
ret,
|
|
|
|
strerror(-ret));
|
|
|
|
continue;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_memchecker_base_mem_defined(&fabric, sizeof(fabric));
|
|
|
|
|
|
|
|
ret = fi_domain(fabric, info, &domain, NULL);
|
|
|
|
if (0 != ret) {
|
2015-03-10 02:57:41 +03:00
|
|
|
opal_show_help("help-mpi-btl-usnic.txt",
|
|
|
|
"libfabric API failed",
|
|
|
|
true,
|
|
|
|
opal_process_info.nodename,
|
2016-08-20 05:07:14 +03:00
|
|
|
linux_device_name,
|
2015-03-10 02:57:41 +03:00
|
|
|
"fi_domain()", __FILE__, __LINE__,
|
|
|
|
ret,
|
|
|
|
strerror(-ret));
|
|
|
|
continue;
|
2014-12-03 00:09:46 +03:00
|
|
|
}
|
|
|
|
opal_memchecker_base_mem_defined(&domain, sizeof(domain));
|
|
|
|
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2016-08-20 05:07:14 +03:00
|
|
|
"btl:usnic: found: usNIC device %s",
|
|
|
|
linux_device_name);
|
2014-12-03 00:09:46 +03:00
|
|
|
|
|
|
|
/* Save a little info on the module that we have already
|
|
|
|
gathered. The rest of the module will be filled in
|
|
|
|
later. */
|
|
|
|
module = &(mca_btl_usnic_component.usnic_all_modules[j]);
|
|
|
|
memcpy(module, &opal_btl_usnic_module_template,
|
|
|
|
sizeof(opal_btl_usnic_module_t));
|
|
|
|
module->fabric = fabric;
|
|
|
|
module->domain = domain;
|
|
|
|
module->fabric_info = info;
|
2016-08-20 05:07:14 +03:00
|
|
|
module->libfabric_api = libfabric_api;
|
|
|
|
module->linux_device_name = strdup(linux_device_name);
|
|
|
|
if (NULL == module->linux_device_name) {
|
|
|
|
OPAL_ERROR_LOG(OPAL_ERR_OUT_OF_RESOURCE);
|
|
|
|
goto error;
|
|
|
|
}
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2015-01-13 22:45:50 +03:00
|
|
|
/* Obtain usnic-specific device info (e.g., netmask) that
|
|
|
|
doesn't come in the normal fi_getinfo(). This allows us to
|
|
|
|
do filtering, later. */
|
2014-12-03 00:09:46 +03:00
|
|
|
ret = fi_open_ops(&fabric->fid, FI_USNIC_FABRIC_OPS_1, 0,
|
|
|
|
(void **)&module->usnic_fabric_ops, NULL);
|
|
|
|
if (ret != 0) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: device %s fabric_open_ops failed %d (%s)",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name, ret, fi_strerror(-ret));
|
2014-12-03 00:09:46 +03:00
|
|
|
fi_close(&domain->fid);
|
|
|
|
fi_close(&fabric->fid);
|
2013-07-20 02:13:58 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2015-02-14 01:44:23 +03:00
|
|
|
ret =
|
2016-03-17 02:03:51 +03:00
|
|
|
module->usnic_fabric_ops->getinfo(1,
|
2015-02-14 01:44:23 +03:00
|
|
|
fabric,
|
|
|
|
&module->usnic_info);
|
2014-12-03 00:09:46 +03:00
|
|
|
if (ret != 0) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2015-01-13 22:45:50 +03:00
|
|
|
"btl:usnic: device %s usnic_getinfo failed %d (%s)",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name, ret, fi_strerror(-ret));
|
2014-12-03 00:09:46 +03:00
|
|
|
fi_close(&domain->fid);
|
|
|
|
fi_close(&fabric->fid);
|
2013-07-20 02:13:58 +04:00
|
|
|
continue;
|
|
|
|
}
|
2015-01-13 22:45:50 +03:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: device %s usnic_info: link speed=%d, netmask=0x%x, ifname=%s, num_vf=%d, qp/vf=%d, cq/vf=%d",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name,
|
2015-02-03 21:23:44 +03:00
|
|
|
(unsigned int) module->usnic_info.ui.v1.ui_link_speed,
|
|
|
|
(unsigned int) module->usnic_info.ui.v1.ui_netmask_be,
|
|
|
|
module->usnic_info.ui.v1.ui_ifname,
|
|
|
|
module->usnic_info.ui.v1.ui_num_vf,
|
|
|
|
module->usnic_info.ui.v1.ui_qp_per_vf,
|
|
|
|
module->usnic_info.ui.v1.ui_cq_per_vf);
|
2015-01-13 22:45:50 +03:00
|
|
|
|
|
|
|
/* respect if_include/if_exclude subnets/ifaces from the user */
|
|
|
|
if (filter != NULL) {
|
|
|
|
keep_module = filter_module(module, filter, filter_incl);
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: %s %s due to %s",
|
|
|
|
(keep_module ? "keeping" : "skipping"),
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name,
|
2015-01-13 22:45:50 +03:00
|
|
|
(filter_incl ? "if_include" : "if_exclude"));
|
|
|
|
if (!keep_module) {
|
|
|
|
fi_close(&domain->fid);
|
|
|
|
fi_close(&fabric->fid);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2015-03-10 02:57:41 +03:00
|
|
|
/* The first time through, check some usNIC configuration
|
|
|
|
minimum settings with information we got back from the fi_*
|
|
|
|
probes (these are VIC-wide settings -- they don't change
|
|
|
|
for each module we create, so we only need to check
|
|
|
|
once). */
|
|
|
|
if (0 == j &&
|
|
|
|
check_usnic_config(module, num_local_procs) != OPAL_SUCCESS) {
|
2014-03-04 01:31:42 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: device %s is not provisioned with enough resources -- skipping",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name);
|
2014-12-03 00:09:46 +03:00
|
|
|
fi_close(&domain->fid);
|
|
|
|
fi_close(&fabric->fid);
|
2015-03-10 02:57:41 +03:00
|
|
|
|
|
|
|
mca_btl_usnic_component.num_modules = 0;
|
|
|
|
goto error;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/*************************************************/
|
|
|
|
/* Below this point, we know we want this device */
|
|
|
|
/*************************************************/
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: device %s looks good!",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name);
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* Let this module advance to the next round! */
|
2014-12-03 00:09:46 +03:00
|
|
|
btls[j++] = &(module->super);
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
2014-12-03 00:09:46 +03:00
|
|
|
mca_btl_usnic_component.num_modules = j;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* free filter if created */
|
|
|
|
if (filter != NULL) {
|
|
|
|
free_filter(filter);
|
|
|
|
filter = NULL;
|
|
|
|
}
|
|
|
|
|
2015-03-10 17:41:28 +03:00
|
|
|
/* If we actually have some modules, setup the connectivity
|
|
|
|
checking agent and client. */
|
|
|
|
if (mca_btl_usnic_component.num_modules > 0 &&
|
|
|
|
mca_btl_usnic_component.connectivity_enabled) {
|
2015-08-10 21:02:09 +03:00
|
|
|
mca_btl_usnic_component.opal_evbase = opal_progress_thread_init(NULL);
|
2015-03-10 17:41:28 +03:00
|
|
|
if (OPAL_SUCCESS != opal_btl_usnic_connectivity_agent_init() ||
|
|
|
|
OPAL_SUCCESS != opal_btl_usnic_connectivity_client_init()) {
|
2015-08-10 21:02:09 +03:00
|
|
|
opal_progress_thread_finalize(NULL);
|
2015-03-10 17:41:28 +03:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/* Now that we know how many modules there are, let the modules
|
|
|
|
initialize themselves (it's useful to know how many modules
|
|
|
|
there are before doing this). */
|
2013-07-20 02:13:58 +04:00
|
|
|
for (num_final_modules = i = 0;
|
|
|
|
i < mca_btl_usnic_component.num_modules; ++i) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
module = (opal_btl_usnic_module_t*) btls[i];
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/* Let the module initialize itself */
|
|
|
|
if (OPAL_SUCCESS != opal_btl_usnic_module_init(module)) {
|
2013-07-20 02:13:58 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: failed to init module for %s",
|
|
|
|
module->if_ipv4_addr_str);
|
2013-07-20 02:13:58 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
/*************************************************/
|
|
|
|
/* Below this point, we know we want this module */
|
|
|
|
/*************************************************/
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* If module_init() failed for any prior module, this will be
|
|
|
|
a down shift in the btls[] array. Otherwise, it's an
|
|
|
|
overwrite of the same value. */
|
|
|
|
btls[num_final_modules++] = &(module->super);
|
|
|
|
|
|
|
|
/* Output all of this module's values. */
|
2016-08-20 05:07:14 +03:00
|
|
|
const char *devname = module->linux_device_name;
|
2013-07-20 02:13:58 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2016-01-30 17:58:46 +03:00
|
|
|
"btl:usnic: %s num sqe=%d, num rqe=%d, num cqe=%d, num aveqe=%d",
|
2014-12-03 00:09:46 +03:00
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->sd_num,
|
|
|
|
module->rd_num,
|
2016-01-30 17:58:46 +03:00
|
|
|
module->cq_num,
|
|
|
|
module->av_eq_num);
|
2013-07-20 02:13:58 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s priority MTU = %" PRIsize_t,
|
|
|
|
devname,
|
|
|
|
module->max_tiny_msg_size);
|
2013-07-20 02:13:58 +04:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s priority limit = %" PRIsize_t,
|
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->max_tiny_payload);
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s eager limit = %" PRIsize_t,
|
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->super.btl_eager_limit);
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s eager rndv limit = %" PRIsize_t,
|
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->super.btl_rndv_eager_limit);
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s max send size= %" PRIsize_t
|
2013-07-20 02:13:58 +04:00
|
|
|
" (not overrideable)",
|
2014-12-03 00:09:46 +03:00
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->super.btl_max_send_size);
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
2014-12-03 00:09:46 +03:00
|
|
|
"btl:usnic: %s exclusivity = %d",
|
|
|
|
devname,
|
2013-07-20 02:13:58 +04:00
|
|
|
module->super.btl_exclusivity);
|
|
|
|
}
|
|
|
|
|
2013-08-01 20:56:15 +04:00
|
|
|
/* We may have skipped some modules, so reset
|
|
|
|
component.num_modules */
|
|
|
|
mca_btl_usnic_component.num_modules = num_final_modules;
|
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
/* We've packed all the modules and pointers to those modules in
|
|
|
|
the lower ends of their respective arrays. If not all the
|
|
|
|
modules initialized successfully, we're wasting a little space.
|
|
|
|
We could realloc and re-form the btls[] array, but it doesn't
|
|
|
|
seem worth it. Just waste a little space.
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
That being said, if we ended up with zero acceptable devices,
|
2013-07-20 02:13:58 +04:00
|
|
|
then free everything. */
|
|
|
|
if (0 == num_final_modules) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: returning 0 modules");
|
2013-08-01 20:56:15 +04:00
|
|
|
goto error;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2013-08-01 20:56:15 +04:00
|
|
|
/* we have a nonzero number of modules, so save a copy of the btls array
|
|
|
|
* for later use */
|
|
|
|
memcpy(mca_btl_usnic_component.usnic_active_modules, btls,
|
2014-12-03 00:09:46 +03:00
|
|
|
num_final_modules * sizeof(*btls));
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* Loop over the modules and find the minimum value for
|
|
|
|
module->numa_distance. For every module that has a
|
|
|
|
numa_distance higher than the minimum value, increase its btl
|
|
|
|
latency rating so that the PML will prefer to send short
|
|
|
|
messages over "near" modules. */
|
|
|
|
min_distance = 9999999;
|
|
|
|
for (i = 0; i < mca_btl_usnic_component.num_modules; ++i) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
module = (opal_btl_usnic_module_t*) btls[i];
|
2013-07-20 02:13:58 +04:00
|
|
|
if (module->numa_distance < min_distance) {
|
|
|
|
min_distance = module->numa_distance;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (i = 0; i < mca_btl_usnic_component.num_modules; ++i) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
module = (opal_btl_usnic_module_t*) btls[i];
|
2013-07-20 02:13:58 +04:00
|
|
|
if (module->numa_distance > min_distance) {
|
|
|
|
++module->super.btl_latency;
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: %s is far from me; increasing latency rating",
|
2014-12-03 00:09:46 +03:00
|
|
|
module->if_ipv4_addr_str);
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* start timer to guarantee synthetic clock advances */
|
2015-07-11 20:08:19 +03:00
|
|
|
opal_event_set(opal_sync_event_base, &usnic_clock_timer_event,
|
2013-07-20 02:13:58 +04:00
|
|
|
-1, 0, usnic_clock_callback,
|
|
|
|
&usnic_clock_timeout);
|
|
|
|
usnic_clock_timer_event_set = true;
|
|
|
|
|
|
|
|
/* 1ms timer */
|
|
|
|
usnic_clock_timeout.tv_sec = 0;
|
|
|
|
usnic_clock_timeout.tv_usec = 1000;
|
|
|
|
opal_event_add(&usnic_clock_timer_event, &usnic_clock_timeout);
|
|
|
|
|
Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
2013-10-29 02:23:08 +04:00
|
|
|
/* Setup MPI_T performance variables */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_setup_mpit_pvars();
|
Move all usNIC stats to _stats.c|h and export them as MPI_T pvars.
This commit moves all the module stats into their own struct so that
the stats only need to appear as a single line in the module_t
definition, and then moves all the logic for reporting the stats into
btl_usnic_stats.c|h.
Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities
(i.e., not bound to any particular MPI handle), and are marked as
READONLY and CONTINUOUS. They currently all default to verbose level
5 ("Application tuner / detailed", according to
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels).
Most of the statistics are counters, but a small number are high
watermark values. Due to how counters are reported via MPI_T, none of
the counters are exported through MPI_T if the MCA param
btl_usnic_stats_relative=1 (i.e., the module resets the stats back to
zero at a given frequency).
When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it
will return a count that is equal to the number of active usnic BTL
modules. The values returned for any given pvar (e.g.,
num_total_sends) are an array containing one value for each active
usnic BTL module. The ordering of values in the array is both
consistent across all usnic pvars and stable throughout a single job:
array slot 0 corresponds to module X, array slot 1 corresponds to
module Y, etc.
Mapping which array slot corresponds to which underlying Linux usnic_X
device works as follows:
* The btl_usnic_devices MPI_T state pvar is associated with a
btl_usnic_device MPI_T enum, and be obtained via
MPI_T_pvar_get_info().
* If all usNIC pvars are of length N, the values [0,N) in the
btl_usnic_device enum are associated with strings of the
corresponding underlying Linux device.
For exampe, to look up which Linux device is reported in all usNIC
pvars' array slot 1, look up the int value 1 in the btl_usnic_devices
enum. Its corresponding string value is underlying Linux device name
(e.g., "usnic_1").
cmr=v1.7.4:subject="usnic BTL MPI_T pvars"
This commit was SVN r29545.
2013-10-29 02:23:08 +04:00
|
|
|
|
|
|
|
/* All done */
|
2013-07-20 02:13:58 +04:00
|
|
|
*num_btl_modules = mca_btl_usnic_component.num_modules;
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic: returning %d modules", *num_btl_modules);
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
send_modex:
|
2013-07-20 02:13:58 +04:00
|
|
|
usnic_modex_send();
|
|
|
|
return btls;
|
2013-08-01 20:56:15 +04:00
|
|
|
|
|
|
|
error:
|
|
|
|
/* clean up as much allocated memory as possible */
|
|
|
|
free(btls);
|
|
|
|
btls = NULL;
|
|
|
|
free(mca_btl_usnic_component.usnic_all_modules);
|
|
|
|
mca_btl_usnic_component.usnic_all_modules = NULL;
|
|
|
|
free(mca_btl_usnic_component.usnic_active_modules);
|
|
|
|
mca_btl_usnic_component.usnic_active_modules = NULL;
|
2015-03-10 17:08:06 +03:00
|
|
|
|
|
|
|
/* free filter if created */
|
|
|
|
if (filter != NULL) {
|
|
|
|
free_filter(filter);
|
|
|
|
filter = NULL;
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
goto send_modex;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Component progress
|
2013-09-06 07:18:57 +04:00
|
|
|
* The fast-path of an incoming packet available on the priority
|
|
|
|
* receive queue is handled directly in this routine, everything else
|
|
|
|
* is deferred to an external call, usnic_component_progress_2()
|
|
|
|
* This helps keep usnic_component_progress() very small and very responsive
|
2014-07-31 00:56:15 +04:00
|
|
|
* to a single incoming packet. We make sure not to always return
|
2013-09-06 07:18:57 +04:00
|
|
|
* immediately after one packet to avoid starvation, "fastpath_ok" is
|
|
|
|
* used for this.
|
2013-07-20 02:13:58 +04:00
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
static int usnic_handle_completion(opal_btl_usnic_module_t* module,
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_btl_usnic_channel_t *channel, struct fi_cq_entry *completion);
|
2013-09-06 07:18:57 +04:00
|
|
|
static int usnic_component_progress_2(void);
|
2014-12-03 00:09:46 +03:00
|
|
|
static void usnic_handle_cq_error(opal_btl_usnic_module_t* module,
|
|
|
|
opal_btl_usnic_channel_t *channel, int cq_ret);
|
2013-09-06 07:18:57 +04:00
|
|
|
|
2013-07-20 02:13:58 +04:00
|
|
|
static int usnic_component_progress(void)
|
|
|
|
{
|
2014-12-03 00:09:46 +03:00
|
|
|
int i;
|
2013-09-06 07:18:57 +04:00
|
|
|
int count;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_recv_segment_t* rseg;
|
|
|
|
opal_btl_usnic_module_t* module;
|
2014-12-03 00:09:46 +03:00
|
|
|
struct fi_cq_entry completion;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_channel_t *channel;
|
2014-12-03 00:09:46 +03:00
|
|
|
static bool fastpath_ok = true;
|
2013-09-06 07:18:57 +04:00
|
|
|
|
|
|
|
/* update our simulated clock */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_ticks += 5000;
|
2013-09-06 07:18:57 +04:00
|
|
|
|
|
|
|
count = 0;
|
|
|
|
if (fastpath_ok) {
|
|
|
|
for (i = 0; i < mca_btl_usnic_component.num_modules; i++) {
|
|
|
|
module = mca_btl_usnic_component.usnic_active_modules[i];
|
|
|
|
channel = &module->mod_channels[USNIC_PRIORITY_CHANNEL];
|
|
|
|
|
|
|
|
assert(channel->chan_deferred_recv == NULL);
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
int ret = fi_cq_read(channel->cq, &completion, 1);
|
2015-03-31 13:18:54 +03:00
|
|
|
assert(0 != ret);
|
2014-12-03 00:09:46 +03:00
|
|
|
if (OPAL_LIKELY(1 == ret)) {
|
|
|
|
opal_memchecker_base_mem_defined(&completion,
|
|
|
|
sizeof(completion));
|
|
|
|
rseg = (opal_btl_usnic_recv_segment_t*) completion.op_context;
|
|
|
|
if (OPAL_LIKELY(OPAL_BTL_USNIC_SEG_RECV ==
|
|
|
|
rseg->rs_base.us_type)) {
|
|
|
|
opal_btl_usnic_recv_fast(module, rseg, channel);
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_total_completions;
|
|
|
|
++module->stats.num_seg_recv_completions;
|
2013-09-06 07:18:57 +04:00
|
|
|
fastpath_ok = false; /* prevent starvation */
|
|
|
|
return 1;
|
|
|
|
} else {
|
2014-12-03 00:09:46 +03:00
|
|
|
count += usnic_handle_completion(module, channel,
|
|
|
|
&completion);
|
2013-09-06 07:18:57 +04:00
|
|
|
}
|
2015-03-31 13:18:54 +03:00
|
|
|
} else if (OPAL_LIKELY(-FI_EAGAIN == ret)) {
|
2014-12-03 00:09:46 +03:00
|
|
|
continue;
|
2015-04-02 22:25:52 +03:00
|
|
|
} else {
|
2014-12-03 00:09:46 +03:00
|
|
|
usnic_handle_cq_error(module, channel, ret);
|
2013-09-06 07:18:57 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fastpath_ok = true;
|
|
|
|
return count + usnic_component_progress_2();
|
|
|
|
}
|
|
|
|
|
|
|
|
static int usnic_handle_completion(
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_module_t* module,
|
|
|
|
opal_btl_usnic_channel_t *channel,
|
2014-12-03 00:09:46 +03:00
|
|
|
struct fi_cq_entry *completion)
|
2013-09-06 07:18:57 +04:00
|
|
|
{
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_segment_t* seg;
|
|
|
|
opal_btl_usnic_recv_segment_t* rseg;
|
2013-09-06 07:18:57 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
seg = (opal_btl_usnic_segment_t*)completion->op_context;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
rseg = (opal_btl_usnic_recv_segment_t*)seg;
|
2013-09-06 07:18:57 +04:00
|
|
|
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_total_completions;
|
|
|
|
|
2015-07-03 16:13:01 +03:00
|
|
|
/* Make the completion be Valgrind-defined */
|
|
|
|
opal_memchecker_base_mem_defined(seg, sizeof(*seg));
|
|
|
|
|
2016-06-02 21:52:32 +03:00
|
|
|
OPAL_THREAD_LOCK(&btl_usnic_lock);
|
|
|
|
|
2013-09-06 07:18:57 +04:00
|
|
|
/* Handle work completions */
|
|
|
|
switch(seg->us_type) {
|
|
|
|
|
|
|
|
/**** Send ACK completions ****/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_SEG_ACK:
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_ack_completions;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_ack_complete(module,
|
|
|
|
(opal_btl_usnic_ack_segment_t *)seg);
|
2013-09-06 07:18:57 +04:00
|
|
|
break;
|
|
|
|
|
2017-01-05 03:57:06 +03:00
|
|
|
/**** Send of frag segment completion (i.e., the MPI message's
|
|
|
|
one-and-only segment has completed sending) ****/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_SEG_FRAG:
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_frag_completions;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_frag_send_complete(module,
|
|
|
|
(opal_btl_usnic_frag_segment_t*)seg);
|
2013-09-06 07:18:57 +04:00
|
|
|
break;
|
|
|
|
|
2017-01-05 03:57:06 +03:00
|
|
|
/**** Send of chunk segment completion (i.e., part of a large MPI
|
|
|
|
message is done sending) ****/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_SEG_CHUNK:
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_chunk_completions;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_chunk_send_complete(module,
|
|
|
|
(opal_btl_usnic_chunk_segment_t*)seg);
|
2013-09-06 07:18:57 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
/**** Receive completions ****/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_SEG_RECV:
|
2017-01-10 01:10:54 +03:00
|
|
|
++module->stats.num_seg_recv_completions;
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_btl_usnic_recv(module, rseg, channel);
|
2013-09-06 07:18:57 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
2014-12-03 00:09:46 +03:00
|
|
|
BTL_ERROR(("Unhandled completion segment type %d", seg->us_type));
|
2013-09-06 07:18:57 +04:00
|
|
|
break;
|
|
|
|
}
|
2016-06-02 21:52:32 +03:00
|
|
|
|
|
|
|
OPAL_THREAD_UNLOCK(&btl_usnic_lock);
|
2013-09-06 07:18:57 +04:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
static void
|
|
|
|
usnic_handle_cq_error(opal_btl_usnic_module_t* module,
|
|
|
|
opal_btl_usnic_channel_t *channel, int cq_ret)
|
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
struct fi_cq_err_entry err_entry;
|
|
|
|
opal_btl_usnic_recv_segment_t* rseg;
|
|
|
|
|
|
|
|
if (cq_ret != -FI_EAVAIL) {
|
|
|
|
BTL_ERROR(("%s: cq_read ret = %d (%s)",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name, cq_ret,
|
2014-12-03 00:09:46 +03:00
|
|
|
fi_strerror(-cq_ret)));
|
|
|
|
channel->chan_error = true;
|
|
|
|
}
|
|
|
|
|
2014-12-04 01:44:50 +03:00
|
|
|
rc = fi_cq_readerr(channel->cq, &err_entry, 0);
|
2016-10-03 21:58:08 +03:00
|
|
|
if (rc == -FI_EAGAIN) {
|
2015-07-03 03:03:32 +03:00
|
|
|
return;
|
2016-10-03 21:58:08 +03:00
|
|
|
} else if (rc != 1) {
|
|
|
|
BTL_ERROR(("%s: cq_readerr ret = %d (expected 1)",
|
|
|
|
module->linux_device_name, rc));
|
2014-12-03 00:09:46 +03:00
|
|
|
channel->chan_error = true;
|
2015-07-03 03:03:32 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Silently count CRC errors. Truncation errors are usually a
|
|
|
|
different symptom of a CRC error. */
|
|
|
|
else if (FI_ECRC == err_entry.prov_errno ||
|
|
|
|
FI_ETRUNC == err_entry.prov_errno) {
|
2014-12-03 00:09:46 +03:00
|
|
|
#if MSGDEBUG1
|
|
|
|
static int once = 0;
|
|
|
|
if (once++ == 0) {
|
2015-07-03 03:03:32 +03:00
|
|
|
BTL_ERROR(("%s: Channel %d, %s",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name,
|
2015-07-03 03:03:32 +03:00
|
|
|
channel->chan_index,
|
|
|
|
FI_ECRC == err_entry.prov_errno ?
|
|
|
|
"CRC error" : "message truncation"));
|
2014-12-03 00:09:46 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* silently count CRC errors */
|
|
|
|
++module->stats.num_crc_errors;
|
|
|
|
|
|
|
|
/* repost segment */
|
|
|
|
++module->stats.num_recv_reposts;
|
|
|
|
|
|
|
|
/* Add recv to linked list for reposting */
|
|
|
|
rseg = err_entry.op_context;
|
|
|
|
if (OPAL_BTL_USNIC_SEG_RECV == rseg->rs_base.us_type) {
|
|
|
|
rseg->rs_next = channel->repost_recv_head;
|
|
|
|
channel->repost_recv_head = rseg;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
BTL_ERROR(("%s: CQ[%d] prov_err = %d",
|
2016-08-20 05:07:14 +03:00
|
|
|
module->linux_device_name, channel->chan_index,
|
2015-07-03 03:03:32 +03:00
|
|
|
err_entry.prov_errno));
|
2014-12-03 00:09:46 +03:00
|
|
|
channel->chan_error = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-09-06 07:18:57 +04:00
|
|
|
static int usnic_component_progress_2(void)
|
|
|
|
{
|
2015-03-31 13:18:54 +03:00
|
|
|
int i, j, count = 0, num_events, ret;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_module_t* module;
|
2014-12-03 00:09:46 +03:00
|
|
|
static struct fi_cq_entry completions[OPAL_BTL_USNIC_NUM_COMPLETIONS];
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_channel_t *channel;
|
2013-09-06 07:18:57 +04:00
|
|
|
int rc;
|
2013-07-20 02:13:58 +04:00
|
|
|
int c;
|
|
|
|
|
|
|
|
/* update our simulated clock */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_ticks += 5000;
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* Poll for completions */
|
|
|
|
for (i = 0; i < mca_btl_usnic_component.num_modules; i++) {
|
2013-08-01 20:56:15 +04:00
|
|
|
module = mca_btl_usnic_component.usnic_active_modules[i];
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* poll each channel */
|
|
|
|
for (c=0; c<USNIC_NUM_CHANNELS; ++c) {
|
|
|
|
channel = &module->mod_channels[c];
|
|
|
|
|
2013-09-06 07:18:57 +04:00
|
|
|
if (channel->chan_deferred_recv != NULL) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
(void) opal_btl_usnic_recv_frag_bookkeeping(module,
|
2013-09-06 07:18:57 +04:00
|
|
|
channel->chan_deferred_recv, channel);
|
|
|
|
channel->chan_deferred_recv = NULL;
|
|
|
|
}
|
|
|
|
|
2015-04-02 22:25:52 +03:00
|
|
|
num_events = ret =
|
|
|
|
fi_cq_read(channel->cq, completions,
|
|
|
|
OPAL_BTL_USNIC_NUM_COMPLETIONS);
|
2015-03-31 13:18:54 +03:00
|
|
|
assert(0 != ret);
|
|
|
|
opal_memchecker_base_mem_defined(&ret, sizeof(ret));
|
|
|
|
if (OPAL_UNLIKELY(ret < 0 && -FI_EAGAIN != ret)) {
|
2014-12-03 00:09:46 +03:00
|
|
|
usnic_handle_cq_error(module, channel, num_events);
|
2015-03-31 13:18:54 +03:00
|
|
|
num_events = 0;
|
2015-04-02 22:25:52 +03:00
|
|
|
} else if (-FI_EAGAIN == ret) {
|
|
|
|
num_events = 0;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
2015-03-31 13:18:54 +03:00
|
|
|
opal_memchecker_base_mem_defined(completions,
|
|
|
|
sizeof(completions[0]) *
|
2015-04-02 22:25:52 +03:00
|
|
|
num_events);
|
2013-09-06 07:18:57 +04:00
|
|
|
/* Handle each event */
|
2013-07-20 02:13:58 +04:00
|
|
|
for (j = 0; j < num_events; j++) {
|
2014-12-03 00:09:46 +03:00
|
|
|
count += usnic_handle_completion(module, channel,
|
|
|
|
&completions[j]);
|
2013-09-06 07:18:57 +04:00
|
|
|
}
|
2013-07-20 02:13:58 +04:00
|
|
|
|
2013-09-06 07:18:57 +04:00
|
|
|
/* return error if detected - this may be slightly deferred
|
|
|
|
* since fastpath avoids the "if" of checking this.
|
|
|
|
*/
|
|
|
|
if (channel->chan_error) {
|
|
|
|
channel->chan_error = false;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_ERROR;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* progress sends */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_module_progress_sends(module);
|
2013-07-20 02:13:58 +04:00
|
|
|
|
|
|
|
/* Re-post all the remaining receive buffers */
|
2014-12-03 00:09:46 +03:00
|
|
|
if (OPAL_LIKELY(NULL != channel->repost_recv_head)) {
|
|
|
|
rc = opal_btl_usnic_post_recv_list(channel);
|
2013-09-06 07:18:57 +04:00
|
|
|
if (OPAL_UNLIKELY(rc != 0)) {
|
2013-07-20 02:13:58 +04:00
|
|
|
BTL_ERROR(("error posting recv: %s\n", strerror(errno)));
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
return OPAL_ERROR;
|
2013-07-20 02:13:58 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2013-10-23 19:51:11 +04:00
|
|
|
/* could take indent as a parameter instead of hard-coding it */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
static void dump_endpoint(opal_btl_usnic_endpoint_t *endpoint)
|
2013-10-23 19:51:11 +04:00
|
|
|
{
|
|
|
|
int i;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_frag_t *frag;
|
|
|
|
opal_btl_usnic_send_segment_t *sseg;
|
2013-10-23 19:51:11 +04:00
|
|
|
struct in_addr ia;
|
|
|
|
char ep_addr_str[INET_ADDRSTRLEN];
|
|
|
|
char tmp[128], str[2048];
|
|
|
|
|
|
|
|
memset(ep_addr_str, 0x00, sizeof(ep_addr_str));
|
2014-12-03 00:09:46 +03:00
|
|
|
ia.s_addr = endpoint->endpoint_remote_modex.ipv4_addr;
|
2013-10-23 19:51:11 +04:00
|
|
|
inet_ntop(AF_INET, &ia, ep_addr_str, sizeof(ep_addr_str));
|
|
|
|
|
2014-07-31 00:53:41 +04:00
|
|
|
opal_output(0, " endpoint %p, %s job=%u, rank=%u rts=%s s_credits=%"PRIi32"\n",
|
|
|
|
(void *)endpoint, ep_addr_str,
|
2014-11-12 04:00:42 +03:00
|
|
|
endpoint->endpoint_proc->proc_opal->proc_name.jobid,
|
|
|
|
endpoint->endpoint_proc->proc_opal->proc_name.vpid,
|
2014-07-31 00:53:41 +04:00
|
|
|
(endpoint->endpoint_ready_to_send ? "true" : "false"),
|
|
|
|
endpoint->endpoint_send_credits);
|
2013-10-23 19:51:11 +04:00
|
|
|
opal_output(0, " endpoint->frag_send_queue:\n");
|
|
|
|
|
|
|
|
OPAL_LIST_FOREACH(frag, &endpoint->endpoint_frag_send_queue,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_frag_t) {
|
|
|
|
opal_btl_usnic_small_send_frag_t *ssfrag;
|
|
|
|
opal_btl_usnic_large_send_frag_t *lsfrag;
|
2014-01-10 19:39:16 +04:00
|
|
|
|
2013-10-23 19:51:11 +04:00
|
|
|
snprintf(str, sizeof(str), " --> frag %p, %s", (void *)frag,
|
|
|
|
usnic_frag_type(frag->uf_type));
|
|
|
|
switch (frag->uf_type) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_FRAG_LARGE_SEND:
|
|
|
|
lsfrag = (opal_btl_usnic_large_send_frag_t *)frag;
|
2013-10-23 19:51:11 +04:00
|
|
|
snprintf(tmp, sizeof(tmp), " tag=%"PRIu8" id=%"PRIu32" offset=%llu/%llu post_cnt=%"PRIu32" ack_bytes_left=%llu\n",
|
|
|
|
lsfrag->lsf_tag,
|
|
|
|
lsfrag->lsf_frag_id,
|
|
|
|
(unsigned long long)lsfrag->lsf_cur_offset,
|
|
|
|
(unsigned long long)lsfrag->lsf_base.sf_size,
|
|
|
|
lsfrag->lsf_base.sf_seg_post_cnt,
|
|
|
|
(unsigned long long)lsfrag->lsf_base.sf_ack_bytes_left);
|
|
|
|
strncat(str, tmp, sizeof(str) - strlen(str) - 1);
|
|
|
|
opal_output(0, "%s", str);
|
|
|
|
|
|
|
|
OPAL_LIST_FOREACH(sseg, &lsfrag->lsf_seg_chain,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_send_segment_t) {
|
2013-10-23 19:51:11 +04:00
|
|
|
/* chunk segs are just typedefs to send segs */
|
|
|
|
opal_output(0, " chunk seg %p, chan=%s hotel=%d times_posted=%"PRIu32" pending=%s\n",
|
|
|
|
(void *)sseg,
|
|
|
|
(USNIC_PRIORITY_CHANNEL == sseg->ss_channel ?
|
|
|
|
"prio" : "data"),
|
|
|
|
sseg->ss_hotel_room,
|
|
|
|
sseg->ss_send_posted,
|
|
|
|
(sseg->ss_ack_pending ? "true" : "false"));
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_FRAG_SMALL_SEND:
|
|
|
|
ssfrag = (opal_btl_usnic_small_send_frag_t *)frag;
|
2013-10-23 19:51:11 +04:00
|
|
|
snprintf(tmp, sizeof(tmp), " sf_size=%llu post_cnt=%"PRIu32" ack_bytes_left=%llu\n",
|
|
|
|
(unsigned long long)ssfrag->ssf_base.sf_size,
|
|
|
|
ssfrag->ssf_base.sf_seg_post_cnt,
|
|
|
|
(unsigned long long)ssfrag->ssf_base.sf_ack_bytes_left);
|
|
|
|
strncat(str, tmp, sizeof(str) - strlen(str) - 1);
|
|
|
|
opal_output(0, "%s", str);
|
|
|
|
|
|
|
|
sseg = &ssfrag->ssf_segment;
|
|
|
|
opal_output(0, " small seg %p, chan=%s hotel=%d times_posted=%"PRIu32" pending=%s\n",
|
|
|
|
(void *)sseg,
|
|
|
|
(USNIC_PRIORITY_CHANNEL == sseg->ss_channel ?
|
|
|
|
"prio" : "data"),
|
|
|
|
sseg->ss_hotel_room,
|
|
|
|
sseg->ss_send_posted,
|
|
|
|
(sseg->ss_ack_pending ? "true" : "false"));
|
|
|
|
break;
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
case OPAL_BTL_USNIC_FRAG_PUT_DEST:
|
2013-10-23 19:51:11 +04:00
|
|
|
/* put_dest frags are just a typedef to generic frags */
|
2014-07-10 21:18:03 +04:00
|
|
|
snprintf(tmp, sizeof(tmp), " put_addr=%p\n", frag->uf_remote_seg[0].seg_addr.pval);
|
2013-10-23 19:51:11 +04:00
|
|
|
strncat(str, tmp, sizeof(str) - strlen(str) - 1);
|
|
|
|
opal_output(0, "%s", str);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now examine the hotel for this endpoint and dump any segments we find
|
|
|
|
* there. Yes, this peeks at members that are technically "private", so
|
|
|
|
* eventually this should be done through some sort of debug or iteration
|
|
|
|
* interface in the hotel code. */
|
|
|
|
opal_output(0, " endpoint->endpoint_sent_segs (%p):\n",
|
|
|
|
(void *)endpoint->endpoint_sent_segs);
|
|
|
|
for (i = 0; i < WINDOW_SIZE; ++i) {
|
|
|
|
sseg = endpoint->endpoint_sent_segs[i];
|
|
|
|
if (NULL != sseg) {
|
|
|
|
opal_output(0, " [%d] sseg=%p %s chan=%s hotel=%d times_posted=%"PRIu32" pending=%s\n",
|
|
|
|
i,
|
|
|
|
(void *)sseg,
|
2014-12-03 00:09:46 +03:00
|
|
|
usnic_seg_type_str(sseg->ss_base.us_type),
|
2013-10-23 19:51:11 +04:00
|
|
|
(USNIC_PRIORITY_CHANNEL == sseg->ss_channel ?
|
|
|
|
"prio" : "data"),
|
|
|
|
sseg->ss_hotel_room,
|
|
|
|
sseg->ss_send_posted,
|
|
|
|
(sseg->ss_ack_pending ? "true" : "false"));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-02-26 11:40:23 +04:00
|
|
|
opal_output(0, " ack_needed=%s n_t=%"UDSEQ" n_a=%"UDSEQ" n_r=%"UDSEQ" n_s=%"UDSEQ" rfstart=%"PRIu32"\n",
|
2013-10-23 19:51:11 +04:00
|
|
|
(endpoint->endpoint_ack_needed?"true":"false"),
|
|
|
|
endpoint->endpoint_next_seq_to_send,
|
|
|
|
endpoint->endpoint_ack_seq_rcvd,
|
|
|
|
endpoint->endpoint_next_contig_seq_to_recv,
|
|
|
|
endpoint->endpoint_highest_seq_rcvd,
|
|
|
|
endpoint->endpoint_rfstart);
|
|
|
|
|
|
|
|
if (dump_bitvectors) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_snprintf_bool_array(str, sizeof(str),
|
2013-10-23 19:51:11 +04:00
|
|
|
endpoint->endpoint_rcvd_segs,
|
|
|
|
WINDOW_SIZE);
|
|
|
|
opal_output(0, " rcvd_segs 0x%s", str);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
void opal_btl_usnic_component_debug(void)
|
2013-10-23 19:51:11 +04:00
|
|
|
{
|
|
|
|
int i;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_module_t *module;
|
|
|
|
opal_btl_usnic_endpoint_t *endpoint;
|
|
|
|
opal_btl_usnic_send_segment_t *sseg;
|
2013-10-23 19:51:11 +04:00
|
|
|
opal_list_item_t *item;
|
2014-07-31 00:53:41 +04:00
|
|
|
const opal_proc_t *proc = opal_proc_local_get();
|
2013-10-23 19:51:11 +04:00
|
|
|
|
2014-12-03 00:09:46 +03:00
|
|
|
opal_output(0, "*** dumping usnic state for MPI_COMM_WORLD rank %u ***\n",
|
2014-11-12 04:00:42 +03:00
|
|
|
proc->proc_name.vpid);
|
2013-10-23 19:51:11 +04:00
|
|
|
for (i = 0; i < (int)mca_btl_usnic_component.num_modules; ++i) {
|
|
|
|
module = mca_btl_usnic_component.usnic_active_modules[i];
|
|
|
|
|
|
|
|
opal_output(0, "active_modules[%d]=%p %s max{frag,chunk,tiny}=%llu,%llu,%llu\n",
|
2016-08-20 05:07:14 +03:00
|
|
|
i, (void *)module, module->linux_device_name,
|
2013-10-23 19:51:11 +04:00
|
|
|
(unsigned long long)module->max_frag_payload,
|
|
|
|
(unsigned long long)module->max_chunk_payload,
|
|
|
|
(unsigned long long)module->max_tiny_payload);
|
|
|
|
|
|
|
|
opal_output(0, " endpoints_with_sends:\n");
|
|
|
|
OPAL_LIST_FOREACH(endpoint, &module->endpoints_with_sends,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_endpoint_t) {
|
2013-10-23 19:51:11 +04:00
|
|
|
dump_endpoint(endpoint);
|
|
|
|
}
|
|
|
|
|
|
|
|
opal_output(0, " endpoints_that_need_acks:\n");
|
|
|
|
OPAL_LIST_FOREACH(endpoint, &module->endpoints_that_need_acks,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_endpoint_t) {
|
2013-10-23 19:51:11 +04:00
|
|
|
dump_endpoint(endpoint);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* the all_endpoints list uses a different list item member */
|
|
|
|
opal_output(0, " all_endpoints:\n");
|
2014-08-01 02:30:20 +04:00
|
|
|
opal_mutex_lock(&module->all_endpoints_lock);
|
2013-10-23 19:51:11 +04:00
|
|
|
item = opal_list_get_first(&module->all_endpoints);
|
|
|
|
while (item != opal_list_get_end(&module->all_endpoints)) {
|
|
|
|
endpoint = container_of(item, mca_btl_base_endpoint_t,
|
|
|
|
endpoint_endpoint_li);
|
|
|
|
item = opal_list_get_next(item);
|
|
|
|
dump_endpoint(endpoint);
|
|
|
|
}
|
2014-08-01 02:30:20 +04:00
|
|
|
opal_mutex_unlock(&module->all_endpoints_lock);
|
2013-10-23 19:51:11 +04:00
|
|
|
|
|
|
|
opal_output(0, " pending_resend_segs:\n");
|
|
|
|
OPAL_LIST_FOREACH(sseg, &module->pending_resend_segs,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_send_segment_t) {
|
2013-10-23 19:51:11 +04:00
|
|
|
opal_output(0, " sseg %p\n", (void *)sseg);
|
|
|
|
}
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_usnic_print_stats(module, " manual", /*reset=*/false);
|
2013-10-23 19:51:11 +04:00
|
|
|
}
|
|
|
|
}
|
2014-02-26 11:48:05 +04:00
|
|
|
|
|
|
|
#include "test/btl_usnic_component_test.h"
|