2013-07-19 22:13:58 +00:00
|
|
|
/*
|
2016-03-28 09:10:12 -07:00
|
|
|
* Copyright (c) 2013-2016 Cisco Systems, Inc. All rights reserved.
|
2017-06-21 21:10:35 -07:00
|
|
|
* Copyright (c) 2016-2017 Intel, Inc. All rights reserved.
|
2013-07-19 22:13:58 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#include "opal_config.h"
|
2013-07-19 22:13:58 +00:00
|
|
|
|
2016-12-29 07:31:35 -08:00
|
|
|
#include "opal/mca/hwloc/base/base.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#include "opal/constants.h"
|
2014-12-02 13:09:46 -08:00
|
|
|
|
|
|
|
#if BTL_IN_OPAL
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#include "opal/mca/btl/base/base.h"
|
2014-12-02 13:09:46 -08:00
|
|
|
#else
|
|
|
|
#include "ompi/mca/btl/base/base.h"
|
|
|
|
#endif
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
#include "btl_usnic_hwloc.h"
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Local variables
|
|
|
|
*/
|
|
|
|
static hwloc_obj_t my_numa_node = NULL;
|
|
|
|
static int num_numa_nodes = 0;
|
2017-06-21 21:10:35 -07:00
|
|
|
static struct hwloc_distances_s *matrix = NULL;
|
|
|
|
#if HWLOC_API_VERSION >= 0x20000
|
|
|
|
static unsigned int matrix_nr = 1;
|
|
|
|
#endif
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the hwloc distance matrix (if we don't already have it).
|
|
|
|
*/
|
|
|
|
static int get_distance_matrix(void)
|
|
|
|
{
|
2017-06-21 21:10:35 -07:00
|
|
|
#if HWLOC_API_VERSION < 0x20000
|
|
|
|
/* Note that the matrix data structure belongs to hwloc; we are not
|
|
|
|
* responsible for freeing it. */
|
|
|
|
|
2013-07-19 22:13:58 +00:00
|
|
|
if (NULL == matrix) {
|
|
|
|
matrix = hwloc_get_whole_distance_matrix_by_type(opal_hwloc_topology,
|
|
|
|
HWLOC_OBJ_NODE);
|
|
|
|
}
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return (NULL == matrix) ? OPAL_ERROR : OPAL_SUCCESS;
|
2017-06-21 21:10:35 -07:00
|
|
|
#else
|
|
|
|
if (0 != hwloc_distances_get_by_type(opal_hwloc_topology, HWLOC_OBJ_NODE,
|
|
|
|
&matrix_nr, &matrix,
|
|
|
|
HWLOC_DISTANCES_KIND_MEANS_LATENCY, 0) || 0 == matrix_nr) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
#endif
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the NUMA node that covers a given cpuset
|
|
|
|
*/
|
|
|
|
static hwloc_obj_t find_numa_node(hwloc_bitmap_t cpuset)
|
|
|
|
{
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
|
|
|
|
obj =
|
|
|
|
hwloc_get_first_largest_obj_inside_cpuset(opal_hwloc_topology, cpuset);
|
|
|
|
|
|
|
|
/* Go upwards until we hit the NUMA node or run out of parents */
|
|
|
|
while (obj->type > HWLOC_OBJ_NODE &&
|
|
|
|
NULL != obj->parent) {
|
|
|
|
obj = obj->parent;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Make sure we ended up on the NUMA node */
|
|
|
|
if (obj->type != HWLOC_OBJ_NODE) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: could not find NUMA node where this process is bound; filtering by NUMA distance not possible");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Finally, make sure that our cpuset doesn't span more than 1
|
|
|
|
NUMA node */
|
|
|
|
if (hwloc_get_nbobjs_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
cpuset, HWLOC_OBJ_NODE) > 1) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: this process is bound to more than 1 NUMA node; filtering by NUMA distance not possible");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find my NUMA node in the hwloc topology. This is a Cisco
|
|
|
|
* UCS-specific BTL, so I know that I'll always have a NUMA node
|
|
|
|
* (i.e., not some unknown server type that may not have or report a
|
|
|
|
* NUMA node).
|
|
|
|
*
|
|
|
|
* Note that the my_numa_node value we find is just a handle; we
|
|
|
|
* aren't responsible for freeing it.
|
|
|
|
*/
|
|
|
|
static int find_my_numa_node(void)
|
|
|
|
{
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
hwloc_bitmap_t cpuset;
|
|
|
|
|
|
|
|
if (NULL != my_numa_node) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Get this process' binding */
|
|
|
|
cpuset = hwloc_bitmap_alloc();
|
|
|
|
if (NULL == cpuset) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_ERR_OUT_OF_RESOURCE;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
if (0 != hwloc_get_cpubind(opal_hwloc_topology, cpuset, 0)) {
|
|
|
|
hwloc_bitmap_free(cpuset);
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_ERR_NOT_AVAILABLE;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the largest object type in the cpuset */
|
|
|
|
obj = find_numa_node(cpuset);
|
|
|
|
hwloc_bitmap_free(cpuset);
|
|
|
|
if (NULL == obj) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_ERR_NOT_AVAILABLE;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Happiness */
|
|
|
|
my_numa_node = obj;
|
|
|
|
num_numa_nodes = hwloc_get_nbobjs_by_type(opal_hwloc_topology,
|
|
|
|
HWLOC_OBJ_NODE);
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-19 22:13:58 +00:00
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find a NUMA node covering the device associated with this module
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
static hwloc_obj_t find_device_numa(opal_btl_usnic_module_t *module)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
2014-12-02 13:09:46 -08:00
|
|
|
struct fi_usnic_info *uip;
|
2013-07-19 22:13:58 +00:00
|
|
|
hwloc_obj_t obj;
|
|
|
|
|
|
|
|
/* Bozo checks */
|
|
|
|
assert(NULL != matrix);
|
|
|
|
assert(NULL != my_numa_node);
|
|
|
|
|
2014-12-02 13:09:46 -08:00
|
|
|
uip = &module->usnic_info;
|
|
|
|
|
|
|
|
/* Look for the IP device name in the hwloc topology (the usnic
|
|
|
|
device is simply an alternate API to reach the same device, so
|
|
|
|
if we find the IP device name, we've found the usNIC device) */
|
|
|
|
obj = NULL;
|
|
|
|
while (NULL != (obj = hwloc_get_next_osdev(opal_hwloc_topology, obj))) {
|
|
|
|
assert(HWLOC_OBJ_OS_DEVICE == obj->type);
|
2015-02-03 10:23:44 -08:00
|
|
|
if (0 == strcmp(obj->name, uip->ui.v1.ui_ifname)) {
|
2014-12-02 13:09:46 -08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Did not find it */
|
|
|
|
if (NULL == obj) {
|
2013-07-19 22:13:58 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
2014-12-02 13:09:46 -08:00
|
|
|
|
|
|
|
/* Search upwards to find the device's NUMA node */
|
|
|
|
/* Go upwards until we hit the NUMA node or run out of parents */
|
|
|
|
while (obj->type > HWLOC_OBJ_NODE &&
|
|
|
|
NULL != obj->parent) {
|
|
|
|
obj = obj->parent;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Make sure we ended up on the NUMA node */
|
|
|
|
if (obj->type != HWLOC_OBJ_NODE) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: could not find NUMA node for %s; filtering by NUMA distance not possible",
|
2016-08-19 19:07:14 -07:00
|
|
|
module->linux_device_name);
|
2013-07-19 22:13:58 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Public entry point: find the hwloc NUMA distance from this process
|
|
|
|
* to the usnic device in the specified module.
|
|
|
|
*/
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
int opal_btl_usnic_hwloc_distance(opal_btl_usnic_module_t *module)
|
2013-07-19 22:13:58 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
hwloc_obj_t dev_numa;
|
|
|
|
|
|
|
|
/* Bozo check */
|
|
|
|
assert(NULL != module);
|
|
|
|
|
|
|
|
/* Is this process bound? */
|
2013-09-06 03:21:34 +00:00
|
|
|
if (!proc_bound()) {
|
2013-07-19 22:13:58 +00:00
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: not sorting devices by NUMA distance (process not bound)");
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: filtering devices by NUMA distance");
|
|
|
|
|
2016-12-29 07:31:35 -08:00
|
|
|
/* ensure we have the topology */
|
|
|
|
if (OPAL_SUCCESS !=- opal_hwloc_base_get_topology()) {
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: not sorting devices by NUMA distance (topology not available)");
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2013-07-19 22:13:58 +00:00
|
|
|
/* Get the hwloc distance matrix for all NUMA nodes */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
if (OPAL_SUCCESS != (ret = get_distance_matrix())) {
|
2013-07-19 22:13:58 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Find my NUMA node */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
if (OPAL_SUCCESS != (ret = find_my_numa_node())) {
|
2013-07-19 22:13:58 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
/* If my_numa_node is still NULL, that means we span more than 1
|
|
|
|
NUMA node. So... no sorting/pruning for you! */
|
|
|
|
if (NULL == my_numa_node) {
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Find the NUMA node covering this module's device */
|
|
|
|
dev_numa = find_device_numa(module);
|
|
|
|
|
|
|
|
/* Lookup the distance between my NUMA node and the NUMA node of
|
|
|
|
the device */
|
2017-06-21 21:10:35 -07:00
|
|
|
#if HWLOC_API_VERSION < 0x20000
|
2013-07-19 22:13:58 +00:00
|
|
|
if (NULL != dev_numa) {
|
2014-07-30 20:56:15 +00:00
|
|
|
module->numa_distance =
|
2013-07-19 22:13:58 +00:00
|
|
|
matrix->latency[dev_numa->logical_index * num_numa_nodes +
|
|
|
|
my_numa_node->logical_index];
|
|
|
|
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: %s is distance %d from me",
|
2016-08-19 19:07:14 -07:00
|
|
|
module->linux_device_name,
|
2013-07-19 22:13:58 +00:00
|
|
|
module->numa_distance);
|
|
|
|
}
|
2017-06-21 21:10:35 -07:00
|
|
|
#else
|
|
|
|
if (NULL != dev_numa) {
|
|
|
|
int myindex, devindex;
|
|
|
|
unsigned int j;
|
|
|
|
myindex = -1;
|
|
|
|
for (j=0; j < matrix_nr; j++) {
|
|
|
|
if (my_numa_node == matrix->objs[j]) {
|
|
|
|
myindex = j;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (-1 == myindex) {
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
devindex = -1;
|
|
|
|
for (j=0; j < matrix_nr; j++) {
|
|
|
|
if (dev_numa == matrix->objs[j]) {
|
|
|
|
devindex = j;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (-1 == devindex) {
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
module->numa_distance =
|
|
|
|
matrix->values[(devindex * num_numa_nodes) + myindex];
|
|
|
|
|
|
|
|
opal_output_verbose(5, USNIC_OUT,
|
|
|
|
"btl:usnic:filter_numa: %s is distance %d from me",
|
|
|
|
module->linux_device_name,
|
|
|
|
module->numa_distance);
|
|
|
|
}
|
|
|
|
#endif
|
2013-07-19 22:13:58 +00:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
return OPAL_SUCCESS;
|
2013-07-19 22:13:58 +00:00
|
|
|
}
|