2011-04-26 17:31:07 +04:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2012-04-24 21:31:06 +04:00
|
|
|
* Copyright (c) 2011-2012 Cisco Systems, Inc. All rights reserved.
|
2013-01-25 22:33:25 +04:00
|
|
|
* Copyright (c) 2012-2013 Los Alamos National Security, LLC.
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
* All rights reserved.
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
* Copyright (c) 2013 Intel, Inc. All rights reserved.
|
2011-04-26 17:31:07 +04:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
|
|
#include "opal_config.h"
|
|
|
|
|
|
|
|
#ifdef HAVE_SYS_TYPES_H
|
|
|
|
#include <sys/types.h>
|
|
|
|
#endif
|
|
|
|
#ifdef HAVE_UNISTD_H
|
|
|
|
#include <unistd.h>
|
|
|
|
#endif
|
|
|
|
|
2012-04-24 21:31:06 +04:00
|
|
|
#include "opal/runtime/opal.h"
|
2011-04-26 17:31:07 +04:00
|
|
|
#include "opal/constants.h"
|
2011-10-29 18:58:58 +04:00
|
|
|
#include "opal/util/argv.h"
|
2011-10-20 00:18:14 +04:00
|
|
|
#include "opal/util/output.h"
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
#include "opal/util/os_dirpath.h"
|
2011-04-26 17:31:07 +04:00
|
|
|
#include "opal/util/show_help.h"
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
#include "opal/threads/tsd.h"
|
2011-10-20 00:18:14 +04:00
|
|
|
|
2011-09-19 20:10:37 +04:00
|
|
|
#include "opal/mca/hwloc/hwloc.h"
|
|
|
|
#include "opal/mca/hwloc/base/base.h"
|
2011-04-26 17:31:07 +04:00
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/*
|
|
|
|
* Provide the hwloc object that corresponds to the given
|
|
|
|
* LOGICAL processor id. Remember: "processor" here [usually] means "core" --
|
|
|
|
* except that on some platforms, hwloc won't find any cores; it'll
|
|
|
|
* only find PUs (!). On such platforms, then do the same calculation
|
|
|
|
* but with PUs instead of COREs.
|
|
|
|
*/
|
|
|
|
static hwloc_obj_t get_pu(hwloc_topology_t topo, int lid)
|
|
|
|
{
|
|
|
|
hwloc_obj_type_t obj_type = HWLOC_OBJ_CORE;
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
|
|
|
|
/* hwloc isn't able to find cores on all platforms. Example:
|
|
|
|
PPC64 running RHEL 5.4 (linux kernel 2.6.18) only reports NUMA
|
|
|
|
nodes and PU's. Fine.
|
|
|
|
|
|
|
|
However, note that hwloc_get_obj_by_type() will return NULL in
|
|
|
|
2 (effectively) different cases:
|
|
|
|
|
|
|
|
- no objects of the requested type were found
|
|
|
|
- the Nth object of the requested type was not found
|
|
|
|
|
|
|
|
So first we have to see if we can find *any* cores by looking
|
|
|
|
for the 0th core. If we find it, then try to find the Nth
|
|
|
|
core. Otherwise, try to find the Nth PU. */
|
|
|
|
if (NULL == hwloc_get_obj_by_type(topo, HWLOC_OBJ_CORE, 0)) {
|
|
|
|
obj_type = HWLOC_OBJ_PU;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now do the actual lookup. */
|
|
|
|
obj = hwloc_get_obj_by_type(topo, obj_type, lid);
|
|
|
|
if (NULL == obj) {
|
|
|
|
opal_show_help("help-opal-hwloc-base.txt",
|
|
|
|
"logical-cpu-not-found", true,
|
|
|
|
opal_hwloc_base_cpu_set);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Found the right core (or PU). Return the object */
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* determine the node-level available cpuset based on
|
|
|
|
* online vs allowed vs user-specified cpus
|
|
|
|
*/
|
|
|
|
int opal_hwloc_base_filter_cpus(hwloc_topology_t topo)
|
|
|
|
{
|
|
|
|
hwloc_obj_t root, pu;
|
2011-11-18 14:19:24 +04:00
|
|
|
hwloc_cpuset_t avail = NULL, pucpus, res;
|
2011-10-29 18:58:58 +04:00
|
|
|
opal_hwloc_topo_data_t *sum;
|
|
|
|
char **ranges=NULL, **range=NULL;
|
|
|
|
int idx, cpu, start, end;
|
|
|
|
|
|
|
|
root = hwloc_get_root_obj(topo);
|
|
|
|
|
|
|
|
if (NULL == root->userdata) {
|
2011-11-18 14:19:24 +04:00
|
|
|
root->userdata = (void*)OBJ_NEW(opal_hwloc_topo_data_t);
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
2011-11-18 14:19:24 +04:00
|
|
|
sum = (opal_hwloc_topo_data_t*)root->userdata;
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/* should only ever enter here once, but check anyway */
|
|
|
|
if (NULL != sum->available) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:filter_cpus specified - already done"));
|
2011-10-29 18:58:58 +04:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* process any specified default cpu set against this topology */
|
|
|
|
if (NULL == opal_hwloc_base_cpu_set) {
|
|
|
|
/* get the root available cpuset */
|
|
|
|
avail = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_and(avail, root->online_cpuset, root->allowed_cpuset);
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2011-10-29 18:58:58 +04:00
|
|
|
"hwloc:base: no cpus specified - using root available cpuset"));
|
|
|
|
} else {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2011-10-29 18:58:58 +04:00
|
|
|
"hwloc:base: filtering cpuset"));
|
|
|
|
/* find the specified logical cpus */
|
|
|
|
ranges = opal_argv_split(opal_hwloc_base_cpu_set, ',');
|
2012-03-23 18:50:41 +04:00
|
|
|
avail = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_zero(avail);
|
|
|
|
res = hwloc_bitmap_alloc();
|
|
|
|
pucpus = hwloc_bitmap_alloc();
|
2011-10-29 18:58:58 +04:00
|
|
|
for (idx=0; idx < opal_argv_count(ranges); idx++) {
|
|
|
|
range = opal_argv_split(ranges[idx], '-');
|
|
|
|
switch (opal_argv_count(range)) {
|
|
|
|
case 1:
|
|
|
|
/* only one cpu given - get that object */
|
|
|
|
cpu = strtoul(range[0], NULL, 10);
|
|
|
|
if (NULL == (pu = get_pu(topo, cpu))) {
|
|
|
|
opal_argv_free(ranges);
|
|
|
|
opal_argv_free(range);
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2012-03-23 18:50:41 +04:00
|
|
|
hwloc_bitmap_and(pucpus, pu->online_cpuset, pu->allowed_cpuset);
|
|
|
|
hwloc_bitmap_or(res, avail, pucpus);
|
|
|
|
hwloc_bitmap_copy(avail, res);
|
2011-10-29 18:58:58 +04:00
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
/* range given */
|
|
|
|
start = strtoul(range[0], NULL, 10);
|
|
|
|
end = strtoul(range[1], NULL, 10);
|
|
|
|
for (cpu=start; cpu <= end; cpu++) {
|
|
|
|
if (NULL == (pu = get_pu(topo, cpu))) {
|
|
|
|
opal_argv_free(ranges);
|
|
|
|
opal_argv_free(range);
|
|
|
|
hwloc_bitmap_free(avail);
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2012-03-23 11:18:58 +04:00
|
|
|
hwloc_bitmap_and(pucpus, pu->online_cpuset, pu->allowed_cpuset);
|
2011-10-29 18:58:58 +04:00
|
|
|
hwloc_bitmap_or(res, avail, pucpus);
|
|
|
|
hwloc_bitmap_copy(avail, res);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return OPAL_ERR_BAD_PARAM;
|
|
|
|
}
|
|
|
|
opal_argv_free(range);
|
|
|
|
}
|
|
|
|
if (NULL != ranges) {
|
|
|
|
opal_argv_free(ranges);
|
|
|
|
}
|
2012-03-23 18:50:41 +04:00
|
|
|
hwloc_bitmap_free(res);
|
|
|
|
hwloc_bitmap_free(pucpus);
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* cache this info */
|
|
|
|
sum->available = avail;
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2012-04-24 21:31:06 +04:00
|
|
|
static void fill_cache_line_size(void)
|
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
unsigned size;
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
bool found = false;
|
|
|
|
|
|
|
|
/* Look for the smallest L2 cache size */
|
|
|
|
size = 4096;
|
|
|
|
while (1) {
|
|
|
|
obj = opal_hwloc_base_get_obj_by_type(opal_hwloc_topology,
|
|
|
|
HWLOC_OBJ_CACHE, 2,
|
|
|
|
i, OPAL_HWLOC_LOGICAL);
|
|
|
|
if (NULL == obj) {
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
found = true;
|
|
|
|
if (NULL != obj->attr &&
|
|
|
|
size > obj->attr->cache.linesize) {
|
|
|
|
size = obj->attr->cache.linesize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
++i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we found an L2 cache size in the hwloc data, save it in
|
|
|
|
opal_cache_line_size. Otherwise, we'll leave whatever default
|
|
|
|
was set in opal_init.c */
|
|
|
|
if (found) {
|
|
|
|
opal_cache_line_size = (int) size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
int opal_hwloc_base_get_topology(void)
|
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2011-10-29 18:58:58 +04:00
|
|
|
"hwloc:base:get_topology"));
|
|
|
|
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
if (NULL == opal_hwloc_base_topo_file) {
|
|
|
|
if (0 != hwloc_topology_init(&opal_hwloc_topology) ||
|
|
|
|
0 != hwloc_topology_set_flags(opal_hwloc_topology,
|
|
|
|
(HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM |
|
|
|
|
HWLOC_TOPOLOGY_FLAG_IO_DEVICES)) ||
|
|
|
|
0 != hwloc_topology_load(opal_hwloc_topology)) {
|
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
2011-10-29 18:58:58 +04:00
|
|
|
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
/* filter the cpus thru any default cpu set */
|
|
|
|
rc = opal_hwloc_base_filter_cpus(opal_hwloc_topology);
|
|
|
|
if (OPAL_SUCCESS != rc) {
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
rc = opal_hwloc_base_set_topology(opal_hwloc_base_topo_file);
|
|
|
|
if (OPAL_SUCCESS != rc) {
|
|
|
|
return rc;
|
|
|
|
}
|
2012-04-24 21:31:06 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* fill opal_cache_line_size global with the smallest L1 cache
|
|
|
|
line size */
|
|
|
|
fill_cache_line_size();
|
2011-10-29 18:58:58 +04:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2013-06-27 07:04:50 +04:00
|
|
|
int opal_hwloc_base_set_topology(char *topofile)
|
|
|
|
{
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
unsigned j, k;
|
|
|
|
struct hwloc_topology_support *support;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2013-10-26 06:26:21 +04:00
|
|
|
"hwloc:base:set_topology %s", topofile));
|
2013-06-27 07:04:50 +04:00
|
|
|
|
|
|
|
if (NULL != opal_hwloc_topology) {
|
|
|
|
hwloc_topology_destroy(opal_hwloc_topology);
|
|
|
|
}
|
|
|
|
if (0 != hwloc_topology_init(&opal_hwloc_topology)) {
|
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
if (0 != hwloc_topology_set_xml(opal_hwloc_topology, topofile)) {
|
|
|
|
hwloc_topology_destroy(opal_hwloc_topology);
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:set_topology bad topo file"));
|
2013-06-27 07:04:50 +04:00
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
/* since we are loading this from an external source, we have to
|
|
|
|
* explicitly set a flag so hwloc sets things up correctly
|
|
|
|
*/
|
|
|
|
if (0 != hwloc_topology_set_flags(opal_hwloc_topology,
|
|
|
|
(HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM |
|
|
|
|
HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM |
|
|
|
|
HWLOC_TOPOLOGY_FLAG_IO_DEVICES))) {
|
|
|
|
hwloc_topology_destroy(opal_hwloc_topology);
|
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
if (0 != hwloc_topology_load(opal_hwloc_topology)) {
|
|
|
|
hwloc_topology_destroy(opal_hwloc_topology);
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:set_topology failed to load"));
|
2013-06-27 07:04:50 +04:00
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
/* remove the hostname from the topology. Unfortunately, hwloc
|
|
|
|
* decided to add the source hostname to the "topology", thus
|
|
|
|
* rendering it unusable as a pure topological description. So
|
|
|
|
* we remove that information here.
|
|
|
|
*/
|
|
|
|
obj = hwloc_get_root_obj(opal_hwloc_topology);
|
|
|
|
for (k=0; k < obj->infos_count; k++) {
|
|
|
|
if (NULL == obj->infos[k].name ||
|
|
|
|
NULL == obj->infos[k].value) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (0 == strncmp(obj->infos[k].name, "HostName", strlen("HostName"))) {
|
|
|
|
free(obj->infos[k].name);
|
|
|
|
free(obj->infos[k].value);
|
|
|
|
/* left justify the array */
|
|
|
|
for (j=k; j < obj->infos_count-1; j++) {
|
|
|
|
obj->infos[j] = obj->infos[j+1];
|
|
|
|
}
|
|
|
|
obj->infos[obj->infos_count-1].name = NULL;
|
|
|
|
obj->infos[obj->infos_count-1].value = NULL;
|
|
|
|
obj->infos_count--;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* unfortunately, hwloc does not include support info in its
|
|
|
|
* xml output :-(( We default to assuming it is present as
|
|
|
|
* systems that use this option are likely to provide
|
|
|
|
* binding support
|
|
|
|
*/
|
|
|
|
support = (struct hwloc_topology_support*)hwloc_topology_get_support(opal_hwloc_topology);
|
|
|
|
support->cpubind->set_thisproc_cpubind = true;
|
|
|
|
support->membind->set_thisproc_membind = true;
|
|
|
|
|
|
|
|
/* filter the cpus thru any default cpu set */
|
|
|
|
rc = opal_hwloc_base_filter_cpus(opal_hwloc_topology);
|
|
|
|
if (OPAL_SUCCESS != rc) {
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* fill opal_cache_line_size global with the smallest L1 cache
|
|
|
|
line size */
|
|
|
|
fill_cache_line_size();
|
|
|
|
|
|
|
|
/* all done */
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
static void free_object(hwloc_obj_t obj)
|
|
|
|
{
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
unsigned k;
|
|
|
|
|
|
|
|
/* free any data hanging on this object */
|
|
|
|
if (NULL != obj->userdata) {
|
|
|
|
data = (opal_hwloc_obj_data_t*)obj->userdata;
|
|
|
|
OBJ_RELEASE(data);
|
2012-08-02 20:29:44 +04:00
|
|
|
obj->userdata = NULL;
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* loop thru our children */
|
|
|
|
for (k=0; k < obj->arity; k++) {
|
|
|
|
free_object(obj->children[k]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void opal_hwloc_base_free_topology(hwloc_topology_t topo)
|
|
|
|
{
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
opal_hwloc_topo_data_t *rdata;
|
|
|
|
unsigned k;
|
|
|
|
|
|
|
|
obj = hwloc_get_root_obj(topo);
|
|
|
|
/* release the root-level userdata */
|
|
|
|
if (NULL != obj->userdata) {
|
|
|
|
rdata = (opal_hwloc_topo_data_t*)obj->userdata;
|
|
|
|
OBJ_RELEASE(rdata);
|
2012-08-02 20:29:44 +04:00
|
|
|
obj->userdata = NULL;
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
/* now recursively descend and release userdata
|
|
|
|
* in the rest of the objects
|
|
|
|
*/
|
|
|
|
for (k=0; k < obj->arity; k++) {
|
|
|
|
free_object(obj->children[k]);
|
|
|
|
}
|
|
|
|
hwloc_topology_destroy(topo);
|
|
|
|
}
|
|
|
|
|
2011-10-20 00:18:14 +04:00
|
|
|
void opal_hwloc_base_get_local_cpuset(void)
|
|
|
|
{
|
|
|
|
hwloc_obj_t root;
|
2012-02-18 01:18:53 +04:00
|
|
|
hwloc_cpuset_t base_cpus;
|
2011-10-20 00:18:14 +04:00
|
|
|
|
|
|
|
if (NULL != opal_hwloc_topology) {
|
|
|
|
if (NULL == opal_hwloc_my_cpuset) {
|
|
|
|
opal_hwloc_my_cpuset = hwloc_bitmap_alloc();
|
|
|
|
}
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
|
2011-10-20 00:18:14 +04:00
|
|
|
/* get the cpus we are bound to */
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
if (hwloc_get_cpubind(opal_hwloc_topology,
|
|
|
|
opal_hwloc_my_cpuset,
|
|
|
|
HWLOC_CPUBIND_PROCESS) < 0) {
|
2012-02-18 01:18:53 +04:00
|
|
|
/* we are not bound - use the root's available cpuset */
|
2011-10-20 00:18:14 +04:00
|
|
|
root = hwloc_get_root_obj(opal_hwloc_topology);
|
2012-02-18 01:18:53 +04:00
|
|
|
base_cpus = opal_hwloc_base_get_available_cpus(opal_hwloc_topology, root);
|
|
|
|
hwloc_bitmap_copy(opal_hwloc_my_cpuset, base_cpus);
|
2011-10-20 00:18:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2011-04-26 17:31:07 +04:00
|
|
|
|
2011-09-19 20:10:37 +04:00
|
|
|
int opal_hwloc_base_report_bind_failure(const char *file,
|
2011-10-20 00:18:14 +04:00
|
|
|
int line,
|
|
|
|
const char *msg, int rc)
|
2011-04-26 17:31:07 +04:00
|
|
|
{
|
|
|
|
static int already_reported = 0;
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (!already_reported &&
|
|
|
|
OPAL_HWLOC_BASE_MBFA_SILENT != opal_hwloc_base_mbfa) {
|
2011-04-26 17:31:07 +04:00
|
|
|
char hostname[64];
|
|
|
|
gethostname(hostname, sizeof(hostname));
|
|
|
|
|
2011-09-19 20:10:37 +04:00
|
|
|
opal_show_help("help-opal-hwloc-base.txt", "mbind failure", true,
|
2011-04-26 17:31:07 +04:00
|
|
|
hostname, getpid(), file, line, msg,
|
2011-10-12 20:07:09 +04:00
|
|
|
(OPAL_HWLOC_BASE_MBFA_WARN == opal_hwloc_base_mbfa) ?
|
2011-04-26 17:31:07 +04:00
|
|
|
"Warning -- your job will continue, but possibly with degraded performance" :
|
|
|
|
"ERROR -- your job may abort or behave erraticly");
|
|
|
|
already_reported = 1;
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
2011-10-20 00:18:14 +04:00
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
hwloc_cpuset_t opal_hwloc_base_get_available_cpus(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t obj)
|
|
|
|
{
|
|
|
|
hwloc_obj_t root;
|
|
|
|
hwloc_cpuset_t avail, specd=NULL;
|
|
|
|
opal_hwloc_topo_data_t *rdata;
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base: get available cpus"));
|
|
|
|
|
|
|
|
/* get the node-level information */
|
2011-10-29 18:58:58 +04:00
|
|
|
root = hwloc_get_root_obj(topo);
|
|
|
|
rdata = (opal_hwloc_topo_data_t*)root->userdata;
|
|
|
|
/* bozo check */
|
|
|
|
if (NULL == rdata) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
rdata = OBJ_NEW(opal_hwloc_topo_data_t);
|
|
|
|
root->userdata = (void*)rdata;
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_available_cpus first time - filtering cpus"));
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* ensure the topo-level cpuset was prepared */
|
|
|
|
opal_hwloc_base_filter_cpus(topo);
|
2011-10-29 18:58:58 +04:00
|
|
|
|
|
|
|
/* are we asking about the root object? */
|
|
|
|
if (obj == root) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_available_cpus root object"));
|
2011-10-29 18:58:58 +04:00
|
|
|
return rdata->available;
|
|
|
|
}
|
|
|
|
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
/* some hwloc object types don't have cpus */
|
|
|
|
if (NULL == obj->online_cpuset || NULL == obj->allowed_cpuset) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/* see if we already have this info */
|
|
|
|
if (NULL == (data = (opal_hwloc_obj_data_t*)obj->userdata)) {
|
|
|
|
/* nope - create the object */
|
|
|
|
data = OBJ_NEW(opal_hwloc_obj_data_t);
|
|
|
|
obj->userdata = (void*)data;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* do we have the cpuset */
|
|
|
|
if (NULL != data->available) {
|
|
|
|
return data->available;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find the available processors on this object */
|
|
|
|
avail = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_and(avail, obj->online_cpuset, obj->allowed_cpuset);
|
|
|
|
|
|
|
|
/* filter this against the node-available processors */
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL == rdata->available) {
|
|
|
|
hwloc_bitmap_free(avail);
|
|
|
|
return NULL;
|
|
|
|
}
|
2011-10-29 18:58:58 +04:00
|
|
|
specd = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_and(specd, avail, rdata->available);
|
|
|
|
|
|
|
|
/* cache the info */
|
|
|
|
data->available = specd;
|
|
|
|
|
|
|
|
/* cleanup */
|
|
|
|
hwloc_bitmap_free(avail);
|
|
|
|
return specd;
|
|
|
|
}
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
static void df_search_cores(hwloc_obj_t obj, unsigned int *cnt)
|
|
|
|
{
|
|
|
|
unsigned k;
|
|
|
|
|
|
|
|
if (HWLOC_OBJ_CORE == obj->type) {
|
|
|
|
*cnt += 1;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (k=0; k < obj->arity; k++) {
|
|
|
|
df_search_cores(obj->children[k], cnt);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-02-18 01:18:53 +04:00
|
|
|
/* determine if there is a single cpu in a bitmap */
|
|
|
|
bool opal_hwloc_base_single_cpu(hwloc_cpuset_t cpuset)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
bool one=false;
|
|
|
|
|
|
|
|
/* count the number of bits that are set - there is
|
|
|
|
* one bit for each available pu. We could just
|
|
|
|
* subtract the first and last indices, but there
|
|
|
|
* may be "holes" in the bitmap corresponding to
|
|
|
|
* offline or unallowed cpus - so we have to
|
|
|
|
* search for them. Return false if we anything
|
|
|
|
* other than one
|
|
|
|
*/
|
|
|
|
for (i=hwloc_bitmap_first(cpuset);
|
|
|
|
i <= hwloc_bitmap_last(cpuset);
|
|
|
|
i++) {
|
|
|
|
if (hwloc_bitmap_isset(cpuset, i)) {
|
|
|
|
if (one) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
one = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return one;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/* get the number of pu's under a given hwloc object */
|
|
|
|
unsigned int opal_hwloc_base_get_npus(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t obj)
|
|
|
|
{
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
int i;
|
2013-11-13 19:54:01 +04:00
|
|
|
unsigned int cnt = 0;
|
2011-10-29 18:58:58 +04:00
|
|
|
hwloc_cpuset_t cpuset;
|
|
|
|
|
2012-09-15 02:01:19 +04:00
|
|
|
/* if the object is a hwthread (i.e., HWLOC_OBJ_PU),
|
|
|
|
* then the answer is always 1 since there isn't
|
|
|
|
* anything underneath it
|
|
|
|
*/
|
|
|
|
if (HWLOC_OBJ_PU == obj->type) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if the object is a core (i.e., HWLOC_OBJ_CORE) and
|
|
|
|
* we are NOT treating hwthreads as independent cpus,
|
|
|
|
* then the answer is also 1 since we don't allow
|
|
|
|
* you to use the underlying hwthreads as separate
|
|
|
|
* entities
|
|
|
|
*/
|
|
|
|
if (HWLOC_OBJ_CORE == obj->type &&
|
|
|
|
!opal_hwloc_use_hwthreads_as_cpus) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
data = (opal_hwloc_obj_data_t*)obj->userdata;
|
|
|
|
|
|
|
|
if (NULL == data || 0 == data->npus) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (!opal_hwloc_use_hwthreads_as_cpus) {
|
|
|
|
/* if we are treating cores as cpus, then we really
|
|
|
|
* want to know how many cores are in this object.
|
|
|
|
* hwloc sets a bit for each "pu", so we can't just
|
|
|
|
* count bits in this case as there may be more than
|
|
|
|
* one hwthread/core. Instead, find the number of cores
|
|
|
|
* in the system
|
|
|
|
*
|
|
|
|
* NOTE: remember, hwloc can't find "cores" in all
|
|
|
|
* environments. So first check to see if it found
|
|
|
|
* "core" at all.
|
|
|
|
*/
|
|
|
|
if (NULL != hwloc_get_obj_by_type(topo, HWLOC_OBJ_CORE, 0)) {
|
|
|
|
/* starting at the incoming obj, do a down-first search
|
|
|
|
* and count the number of cores under it
|
|
|
|
*/
|
|
|
|
cnt = 0;
|
|
|
|
df_search_cores(obj, &cnt);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
|
|
|
|
/* if we are treating cores as cpus, or the system can't detect
|
|
|
|
* "cores", then get the available cpuset for this object - this will
|
|
|
|
* create and store the data
|
|
|
|
*/
|
|
|
|
if (NULL == (cpuset = opal_hwloc_base_get_available_cpus(topo, obj))) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
/* count the number of bits that are set - there is
|
|
|
|
* one bit for each available pu. We could just
|
|
|
|
* subtract the first and last indices, but there
|
|
|
|
* may be "holes" in the bitmap corresponding to
|
|
|
|
* offline or unallowed cpus - so we have to
|
|
|
|
* search for them
|
|
|
|
*/
|
|
|
|
for (i=hwloc_bitmap_first(cpuset), cnt=0;
|
|
|
|
i <= hwloc_bitmap_last(cpuset);
|
|
|
|
i++) {
|
|
|
|
if (hwloc_bitmap_isset(cpuset, i)) {
|
|
|
|
cnt++;
|
|
|
|
}
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/* cache the info */
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL == data) {
|
|
|
|
data = OBJ_NEW(opal_hwloc_obj_data_t);
|
|
|
|
obj->userdata = (void*)data;
|
|
|
|
}
|
2011-10-29 18:58:58 +04:00
|
|
|
data->npus = cnt;
|
|
|
|
}
|
|
|
|
|
|
|
|
return data->npus;
|
|
|
|
}
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
unsigned int opal_hwloc_base_get_obj_idx(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t obj,
|
|
|
|
opal_hwloc_resource_type_t rtype)
|
|
|
|
{
|
|
|
|
unsigned cache_level=0;
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
hwloc_obj_t ptr;
|
|
|
|
unsigned int nobjs, i;
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_idx"));
|
|
|
|
|
|
|
|
/* see if we already have the info */
|
|
|
|
data = (opal_hwloc_obj_data_t*)obj->userdata;
|
|
|
|
|
|
|
|
if (NULL == data) {
|
|
|
|
data = OBJ_NEW(opal_hwloc_obj_data_t);
|
|
|
|
obj->userdata = (void*)data;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (data->idx < UINT_MAX) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_idx already have data: %u",
|
|
|
|
data->idx));
|
|
|
|
return data->idx;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* determine the number of objects of this type */
|
|
|
|
if (HWLOC_OBJ_CACHE == obj->type) {
|
|
|
|
cache_level = obj->attr->cache.depth;
|
|
|
|
}
|
|
|
|
nobjs = opal_hwloc_base_get_nbobjs_by_type(topo, obj->type, cache_level, rtype);
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_idx found %u objects of type %s:%u",
|
|
|
|
nobjs, hwloc_obj_type_string(obj->type), cache_level));
|
|
|
|
|
|
|
|
/* find this object */
|
|
|
|
for (i=0; i < nobjs; i++) {
|
|
|
|
ptr = opal_hwloc_base_get_obj_by_type(topo, obj->type, cache_level, i, rtype);
|
|
|
|
if (ptr == obj) {
|
|
|
|
data->idx = i;
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* if we get here, it wasn't found */
|
|
|
|
opal_show_help("help-opal-hwloc-base.txt",
|
|
|
|
"obj-idx-failed", true,
|
|
|
|
hwloc_obj_type_string(obj->type), cache_level);
|
|
|
|
return UINT_MAX;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/* hwloc treats cache objects as special
|
|
|
|
* cases. Instead of having a unique type for each cache level,
|
|
|
|
* there is a single cache object type, and the level is encoded
|
|
|
|
* in an attribute union. So looking for cache objects involves
|
|
|
|
* a multi-step test :-(
|
|
|
|
*
|
|
|
|
* And, of course, we make things even worse because we don't
|
2012-09-15 02:01:19 +04:00
|
|
|
* always care about what is physically or logically present,
|
2011-10-29 18:58:58 +04:00
|
|
|
* but rather what is available to us. For example, we don't
|
|
|
|
* want to map or bind to a cpu that is offline, or one that
|
|
|
|
* we aren't allowed by use by the OS. So we have to also filter
|
|
|
|
* the search to avoid those objects that don't have any cpus
|
|
|
|
* we can use :-((
|
|
|
|
*/
|
2012-03-23 18:50:41 +04:00
|
|
|
static hwloc_obj_t df_search(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t start,
|
2011-10-29 18:58:58 +04:00
|
|
|
hwloc_obj_type_t target,
|
|
|
|
unsigned cache_level,
|
|
|
|
unsigned int nobj,
|
|
|
|
opal_hwloc_resource_type_t rtype,
|
|
|
|
unsigned int *idx,
|
|
|
|
unsigned int *num_objs)
|
|
|
|
{
|
|
|
|
unsigned k;
|
|
|
|
hwloc_obj_t obj;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
opal_hwloc_obj_data_t *data;
|
2011-10-29 18:58:58 +04:00
|
|
|
|
|
|
|
if (target == start->type) {
|
|
|
|
if (HWLOC_OBJ_CACHE == start->type && cache_level != start->attr->cache.depth) {
|
|
|
|
goto notfound;
|
|
|
|
}
|
|
|
|
if (OPAL_HWLOC_LOGICAL == rtype) {
|
|
|
|
/* the hwloc tree is composed of LOGICAL objects, so the only
|
|
|
|
* time we come here is when we are looking for logical caches
|
|
|
|
*/
|
|
|
|
if (NULL != num_objs) {
|
|
|
|
/* we are counting the number of caches at this level */
|
|
|
|
*num_objs += 1;
|
|
|
|
} else if (*idx == nobj) {
|
|
|
|
/* found the specific instance of the cache level being sought */
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
*idx += 1;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (OPAL_HWLOC_PHYSICAL == rtype) {
|
|
|
|
/* the PHYSICAL object number is stored as the os_index. When
|
|
|
|
* counting physical objects, we can't just count the number
|
|
|
|
* that are in the hwloc tree as the only entries in the tree
|
|
|
|
* are LOGICAL objects - i.e., any physical gaps won't show. So
|
|
|
|
* we instead return the MAX os_index, as this is the best we
|
|
|
|
* can do to tell you how many PHYSICAL objects are in the system.
|
|
|
|
*
|
|
|
|
* NOTE: if the last PHYSICAL object is not present (e.g., the last
|
|
|
|
* socket on the node is empty), then the count we return will
|
|
|
|
* be wrong!
|
|
|
|
*/
|
|
|
|
if (NULL != num_objs) {
|
|
|
|
/* we are counting the number of these objects */
|
|
|
|
if (*num_objs < (unsigned int)start->os_index) {
|
|
|
|
*num_objs = (unsigned int)start->os_index;
|
|
|
|
}
|
|
|
|
} else if (*idx == nobj) {
|
|
|
|
/* found the specific instance of the cache level being sought */
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
*idx += 1;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (OPAL_HWLOC_AVAILABLE == rtype) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* check - do we already know the index of this object */
|
|
|
|
data = (opal_hwloc_obj_data_t*)start->userdata;
|
|
|
|
if (NULL == data) {
|
|
|
|
data = OBJ_NEW(opal_hwloc_obj_data_t);
|
|
|
|
start->userdata = (void*)data;
|
|
|
|
}
|
|
|
|
/* if we already know our location and it matches,
|
|
|
|
* then we are good
|
2011-10-29 18:58:58 +04:00
|
|
|
*/
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (UINT_MAX != data->idx && data->idx == nobj) {
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
/* see if we already know our available cpuset */
|
|
|
|
if (NULL == data->available) {
|
2012-03-23 18:50:41 +04:00
|
|
|
data->available = opal_hwloc_base_get_available_cpus(topo, start);
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
if (NULL != data->available && !hwloc_bitmap_iszero(data->available)) {
|
2012-03-23 18:50:41 +04:00
|
|
|
if (NULL != num_objs) {
|
|
|
|
*num_objs += 1;
|
|
|
|
} else if (*idx == nobj) {
|
|
|
|
/* cache the location */
|
|
|
|
data->idx = *idx;
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
*idx += 1;
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
/* if it wasn't one of the above, then we are lost */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
notfound:
|
|
|
|
for (k=0; k < start->arity; k++) {
|
2012-03-23 18:50:41 +04:00
|
|
|
obj = df_search(topo, start->children[k], target, cache_level, nobj, rtype, idx, num_objs);
|
2011-10-29 18:58:58 +04:00
|
|
|
if (NULL != obj) {
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
unsigned int opal_hwloc_base_get_nbobjs_by_type(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_type_t target,
|
|
|
|
unsigned cache_level,
|
|
|
|
opal_hwloc_resource_type_t rtype)
|
|
|
|
{
|
|
|
|
unsigned int num_objs, idx;
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
opal_list_item_t *item;
|
|
|
|
opal_hwloc_summary_t *sum;
|
|
|
|
opal_hwloc_topo_data_t *data;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
/* bozo check */
|
|
|
|
if (NULL == topo) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_nbobjs NULL topology"));
|
2011-10-29 18:58:58 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we want the number of LOGICAL objects, we can just
|
|
|
|
* use the hwloc accessor to get it, unless it is a CACHE
|
|
|
|
* as these are treated as special cases
|
|
|
|
*/
|
|
|
|
if (OPAL_HWLOC_LOGICAL == rtype && HWLOC_OBJ_CACHE != target) {
|
|
|
|
/* we should not get an error back, but just in case... */
|
|
|
|
if (0 > (rc = hwloc_get_nbobjs_by_type(topo, target))) {
|
|
|
|
opal_output(0, "UNKNOWN HWLOC ERROR");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* for everything else, we have to do some work */
|
|
|
|
num_objs = 0;
|
|
|
|
idx = 0;
|
|
|
|
obj = hwloc_get_root_obj(topo);
|
|
|
|
|
|
|
|
/* first see if the topology already has this summary */
|
|
|
|
data = (opal_hwloc_topo_data_t*)obj->userdata;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL == data) {
|
|
|
|
data = OBJ_NEW(opal_hwloc_topo_data_t);
|
|
|
|
obj->userdata = (void*)data;
|
|
|
|
} else {
|
|
|
|
for (item = opal_list_get_first(&data->summaries);
|
|
|
|
item != opal_list_get_end(&data->summaries);
|
|
|
|
item = opal_list_get_next(item)) {
|
|
|
|
sum = (opal_hwloc_summary_t*)item;
|
|
|
|
if (target == sum->type &&
|
|
|
|
cache_level == sum->cache_level &&
|
|
|
|
rtype == sum->rtype) {
|
|
|
|
/* yep - return the value */
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_nbojbs pre-existing data %u of %s:%u",
|
|
|
|
sum->num_objs, hwloc_obj_type_string(target), cache_level));
|
|
|
|
return sum->num_objs;
|
|
|
|
}
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* don't already know it - go get it */
|
2012-03-23 18:50:41 +04:00
|
|
|
df_search(topo, obj, target, cache_level, 0, rtype, &idx, &num_objs);
|
2011-10-29 18:58:58 +04:00
|
|
|
|
|
|
|
/* cache the results for later */
|
|
|
|
sum = OBJ_NEW(opal_hwloc_summary_t);
|
|
|
|
sum->type = target;
|
|
|
|
sum->cache_level = cache_level;
|
|
|
|
sum->num_objs = num_objs;
|
|
|
|
sum->rtype = rtype;
|
|
|
|
opal_list_append(&data->summaries, &sum->super);
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"hwloc:base:get_nbojbs computed data %u of %s:%u",
|
|
|
|
num_objs, hwloc_obj_type_string(target), cache_level));
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
return num_objs;
|
|
|
|
}
|
|
|
|
|
2012-09-15 02:01:19 +04:00
|
|
|
static hwloc_obj_t df_search_min_bound(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t start,
|
|
|
|
hwloc_obj_type_t target,
|
|
|
|
unsigned cache_level,
|
|
|
|
unsigned int *min_bound)
|
|
|
|
{
|
|
|
|
unsigned k;
|
|
|
|
hwloc_obj_t obj, save=NULL;
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
|
|
|
|
if (target == start->type) {
|
|
|
|
if (HWLOC_OBJ_CACHE == start->type && cache_level != start->attr->cache.depth) {
|
|
|
|
goto notfound;
|
|
|
|
}
|
|
|
|
/* see how many procs are bound to us */
|
|
|
|
data = (opal_hwloc_obj_data_t*)start->userdata;
|
|
|
|
if (NULL == data) {
|
|
|
|
data = OBJ_NEW(opal_hwloc_obj_data_t);
|
|
|
|
start->userdata = data;
|
|
|
|
}
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2012-09-15 02:01:19 +04:00
|
|
|
"hwloc:base:min_bound_under_obj object %s:%u nbound %u min %u",
|
|
|
|
hwloc_obj_type_string(target), start->logical_index,
|
|
|
|
data->num_bound, *min_bound));
|
|
|
|
if (data->num_bound < *min_bound) {
|
|
|
|
*min_bound = data->num_bound;
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
/* if we have more procs bound to us than the min, return NULL */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
notfound:
|
|
|
|
for (k=0; k < start->arity; k++) {
|
|
|
|
obj = df_search_min_bound(topo, start->children[k], target, cache_level, min_bound);
|
|
|
|
if (NULL != obj) {
|
|
|
|
save = obj;
|
|
|
|
}
|
|
|
|
/* if the target level is HWTHREAD and we are NOT treating
|
|
|
|
* hwthreads as separate cpus, then we can only consider
|
|
|
|
* the 0th hwthread on a core
|
|
|
|
*/
|
|
|
|
if (HWLOC_OBJ_CORE == start->type && HWLOC_OBJ_PU == target &&
|
|
|
|
!opal_hwloc_use_hwthreads_as_cpus) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return save;
|
|
|
|
}
|
|
|
|
|
|
|
|
hwloc_obj_t opal_hwloc_base_find_min_bound_target_under_obj(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t obj,
|
|
|
|
hwloc_obj_type_t target,
|
|
|
|
unsigned cache_level)
|
|
|
|
{
|
|
|
|
unsigned int min_bound;
|
|
|
|
hwloc_obj_t loc;
|
|
|
|
|
|
|
|
/* bozo check */
|
|
|
|
if (NULL == topo || NULL == obj) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2012-09-15 02:01:19 +04:00
|
|
|
"hwloc:base:find_min_bound_under_obj NULL %s",
|
|
|
|
(NULL == topo) ? "topology" : "object"));
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* if the object and target is the same type, then there is
|
|
|
|
* nothing under it, so just return itself
|
|
|
|
*/
|
|
|
|
if (target == obj->type) {
|
|
|
|
/* again, we have to treat caches differently as
|
|
|
|
* the levels distinguish them
|
|
|
|
*/
|
|
|
|
if (HWLOC_OBJ_CACHE == target &&
|
|
|
|
cache_level < obj->attr->cache.depth) {
|
|
|
|
goto moveon;
|
|
|
|
}
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
|
|
|
|
moveon:
|
|
|
|
/* the hwloc accessors all report at the topo level,
|
|
|
|
* so we have to do some work
|
|
|
|
*/
|
|
|
|
min_bound = UINT_MAX;
|
|
|
|
|
|
|
|
loc = df_search_min_bound(topo, obj, target, cache_level, &min_bound);
|
|
|
|
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
if (NULL != loc) {
|
|
|
|
if (HWLOC_OBJ_CACHE == target) {
|
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:min_bound_under_obj found min bound of %u on %s:%u:%u",
|
|
|
|
min_bound, hwloc_obj_type_string(target),
|
|
|
|
cache_level, loc->logical_index));
|
|
|
|
} else {
|
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:min_bound_under_obj found min bound of %u on %s:%u",
|
|
|
|
min_bound, hwloc_obj_type_string(target), loc->logical_index));
|
|
|
|
}
|
2012-09-15 02:01:19 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return loc;
|
|
|
|
}
|
|
|
|
|
2011-10-29 18:58:58 +04:00
|
|
|
/* as above, only return the Nth instance of the specified object
|
|
|
|
* type from inside the topology
|
|
|
|
*/
|
|
|
|
hwloc_obj_t opal_hwloc_base_get_obj_by_type(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_type_t target,
|
|
|
|
unsigned cache_level,
|
|
|
|
unsigned int instance,
|
|
|
|
opal_hwloc_resource_type_t rtype)
|
|
|
|
{
|
2012-08-07 16:05:25 +04:00
|
|
|
unsigned int idx;
|
2011-10-29 18:58:58 +04:00
|
|
|
hwloc_obj_t obj;
|
|
|
|
|
|
|
|
/* bozo check */
|
|
|
|
if (NULL == topo) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we want the nth LOGICAL object, we can just
|
|
|
|
* use the hwloc accessor to get it, unless it is a CACHE
|
|
|
|
* as these are treated as special cases
|
|
|
|
*/
|
|
|
|
if (OPAL_HWLOC_LOGICAL == rtype && HWLOC_OBJ_CACHE != target) {
|
|
|
|
return hwloc_get_obj_by_type(topo, target, instance);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* for everything else, we have to do some work */
|
|
|
|
idx = 0;
|
|
|
|
obj = hwloc_get_root_obj(topo);
|
2012-03-23 18:50:41 +04:00
|
|
|
return df_search(topo, obj, target, cache_level, instance, rtype, &idx, NULL);
|
2011-10-29 18:58:58 +04:00
|
|
|
}
|
|
|
|
|
2012-09-20 19:16:06 +04:00
|
|
|
static void df_clear(hwloc_topology_t topo,
|
|
|
|
hwloc_obj_t start)
|
|
|
|
{
|
|
|
|
unsigned k;
|
|
|
|
opal_hwloc_obj_data_t *data;
|
|
|
|
|
|
|
|
/* see how many procs are bound to us */
|
|
|
|
data = (opal_hwloc_obj_data_t*)start->userdata;
|
|
|
|
if (NULL != data) {
|
|
|
|
data->num_bound = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (k=0; k < start->arity; k++) {
|
|
|
|
df_clear(topo, start->children[k]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void opal_hwloc_base_clear_usage(hwloc_topology_t topo)
|
|
|
|
{
|
|
|
|
hwloc_obj_t root;
|
2012-09-21 03:30:32 +04:00
|
|
|
unsigned k;
|
2012-09-20 19:16:06 +04:00
|
|
|
|
|
|
|
/* bozo check */
|
|
|
|
if (NULL == topo) {
|
2013-03-28 01:11:47 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
2012-09-20 19:16:06 +04:00
|
|
|
"hwloc:base:clear_usage: NULL topology"));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
root = hwloc_get_root_obj(topo);
|
2012-09-21 03:30:32 +04:00
|
|
|
/* must not start at root as the root object has
|
|
|
|
* a different userdata attached to it
|
|
|
|
*/
|
|
|
|
for (k=0; k < root->arity; k++) {
|
|
|
|
df_clear(topo, root->children[k]);
|
|
|
|
}
|
2012-09-20 19:16:06 +04:00
|
|
|
}
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* The current slot_list notation only goes to the core level - i.e., the location
|
|
|
|
* is specified as socket:core. Thus, the code below assumes that all locations
|
|
|
|
* are to be parsed under that notation.
|
|
|
|
*/
|
2011-10-20 00:18:14 +04:00
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
static int socket_to_cpu_set(char *cpus,
|
|
|
|
hwloc_topology_t topo,
|
|
|
|
hwloc_bitmap_t cpumask)
|
|
|
|
{
|
|
|
|
char **range;
|
|
|
|
int range_cnt;
|
|
|
|
int lower_range, upper_range;
|
|
|
|
int socket_id;
|
|
|
|
hwloc_obj_t obj;
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_t res;
|
2011-10-20 00:18:14 +04:00
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if ('*' == cpus[0]) {
|
|
|
|
/* requesting cpumask for ALL sockets */
|
|
|
|
obj = hwloc_get_root_obj(topo);
|
|
|
|
/* set to all available logical processors - essentially,
|
|
|
|
* this specification equates to unbound
|
|
|
|
*/
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, obj);
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
return OPAL_SUCCESS;
|
2011-10-20 00:18:14 +04:00
|
|
|
}
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
range = opal_argv_split(cpus,'-');
|
|
|
|
range_cnt = opal_argv_count(range);
|
|
|
|
switch (range_cnt) {
|
|
|
|
case 1: /* no range was present, so just one socket given */
|
|
|
|
socket_id = atoi(range[0]);
|
|
|
|
obj = opal_hwloc_base_get_obj_by_type(topo, HWLOC_OBJ_SOCKET, 0, socket_id, OPAL_HWLOC_LOGICAL);
|
|
|
|
/* get the available logical cpus for this socket */
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, obj);
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
case 2: /* range of sockets was given */
|
|
|
|
lower_range = atoi(range[0]);
|
|
|
|
upper_range = atoi(range[1]);
|
|
|
|
/* cycle across the range of sockets */
|
|
|
|
for (socket_id=lower_range; socket_id<=upper_range; socket_id++) {
|
|
|
|
obj = opal_hwloc_base_get_obj_by_type(topo, HWLOC_OBJ_SOCKET, 0, socket_id, OPAL_HWLOC_LOGICAL);
|
|
|
|
/* get the available logical cpus for this socket */
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, obj);
|
|
|
|
/* set the corresponding bits in the bitmask */
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
opal_argv_free(range);
|
|
|
|
return OPAL_ERROR;
|
2011-10-20 00:18:14 +04:00
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
opal_argv_free(range);
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int socket_core_to_cpu_set(char *socket_core_list,
|
|
|
|
hwloc_topology_t topo,
|
|
|
|
hwloc_bitmap_t cpumask)
|
|
|
|
{
|
2013-09-08 06:04:29 +04:00
|
|
|
int rc=OPAL_SUCCESS, i, j;
|
2012-02-01 05:50:05 +04:00
|
|
|
char **socket_core, *corestr;
|
2013-09-08 06:04:29 +04:00
|
|
|
char **range, **list;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
int range_cnt;
|
|
|
|
int lower_range, upper_range;
|
|
|
|
int socket_id, core_id;
|
|
|
|
hwloc_obj_t socket, core;
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_cpuset_t res;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
unsigned int idx;
|
|
|
|
hwloc_obj_type_t obj_type = HWLOC_OBJ_CORE;
|
|
|
|
|
|
|
|
socket_core = opal_argv_split(socket_core_list, ':');
|
|
|
|
socket_id = atoi(socket_core[0]);
|
|
|
|
|
|
|
|
/* get the object for this socket id */
|
|
|
|
if (NULL == (socket = opal_hwloc_base_get_obj_by_type(topo, HWLOC_OBJ_SOCKET, 0,
|
|
|
|
socket_id, OPAL_HWLOC_LOGICAL))) {
|
|
|
|
opal_argv_free(socket_core);
|
|
|
|
return OPAL_ERR_NOT_FOUND;
|
2011-10-20 00:18:14 +04:00
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
|
|
|
|
/* as described in comment near top of file, hwloc isn't able
|
2012-02-01 05:50:05 +04:00
|
|
|
* to find cores on all platforms. Adjust the type here if
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
* required
|
2011-10-20 00:18:14 +04:00
|
|
|
*/
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL == hwloc_get_obj_by_type(topo, HWLOC_OBJ_CORE, 0)) {
|
|
|
|
obj_type = HWLOC_OBJ_PU;
|
|
|
|
}
|
|
|
|
|
2013-02-13 17:06:03 +04:00
|
|
|
for (i=1; NULL != socket_core[i]; i++) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if ('C' == socket_core[i][0] ||
|
|
|
|
'c' == socket_core[i][0]) {
|
2012-02-01 05:50:05 +04:00
|
|
|
corestr = &socket_core[i][1];
|
|
|
|
} else {
|
|
|
|
corestr = socket_core[i];
|
|
|
|
}
|
|
|
|
if ('*' == corestr[0]) {
|
|
|
|
/* set to all available logical cpus on this socket */
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, socket);
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
2012-02-01 05:50:05 +04:00
|
|
|
/* we are done - already assigned all cores! */
|
|
|
|
rc = OPAL_SUCCESS;
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
range = opal_argv_split(corestr, '-');
|
|
|
|
range_cnt = opal_argv_count(range);
|
|
|
|
/* see if a range was set or not */
|
|
|
|
switch (range_cnt) {
|
2013-09-08 06:04:29 +04:00
|
|
|
case 1: /* only one core, or a list of cores, specified */
|
|
|
|
list = opal_argv_split(range[0], ',');
|
|
|
|
for (j=0; NULL != list[j]; j++) {
|
|
|
|
core_id = atoi(list[j]);
|
|
|
|
/* get that object */
|
|
|
|
idx = 0;
|
|
|
|
if (NULL == (core = df_search(topo, socket, obj_type, 0,
|
|
|
|
core_id, OPAL_HWLOC_AVAILABLE,
|
|
|
|
&idx, NULL))) {
|
|
|
|
return OPAL_ERR_NOT_FOUND;
|
|
|
|
}
|
|
|
|
/* get the cpus */
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, core);
|
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
2012-02-01 05:50:05 +04:00
|
|
|
}
|
2013-09-08 06:04:29 +04:00
|
|
|
opal_argv_free(list);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
break;
|
2012-02-01 05:50:05 +04:00
|
|
|
|
|
|
|
case 2: /* range of core id's was given */
|
2013-03-28 01:11:47 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
2013-01-25 22:33:25 +04:00
|
|
|
"range of cores given: start %s stop %s",
|
|
|
|
range[0], range[1]);
|
2012-02-01 05:50:05 +04:00
|
|
|
lower_range = atoi(range[0]);
|
|
|
|
upper_range = atoi(range[1]);
|
|
|
|
for (core_id=lower_range; core_id <= upper_range; core_id++) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* get that object */
|
|
|
|
idx = 0;
|
2012-03-23 18:50:41 +04:00
|
|
|
if (NULL == (core = df_search(topo, socket, obj_type, 0,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
core_id, OPAL_HWLOC_AVAILABLE,
|
|
|
|
&idx, NULL))) {
|
|
|
|
return OPAL_ERR_NOT_FOUND;
|
|
|
|
}
|
|
|
|
/* get the cpus */
|
|
|
|
res = opal_hwloc_base_get_available_cpus(topo, core);
|
2012-02-01 05:50:05 +04:00
|
|
|
/* add them into the result */
|
2013-01-25 22:33:25 +04:00
|
|
|
hwloc_bitmap_or(cpumask, cpumask, res);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
2012-02-01 05:50:05 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
opal_argv_free(range);
|
2012-02-01 05:50:05 +04:00
|
|
|
opal_argv_free(socket_core);
|
|
|
|
return OPAL_ERROR;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
2012-02-01 05:50:05 +04:00
|
|
|
opal_argv_free(range);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
opal_argv_free(socket_core);
|
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
int opal_hwloc_base_slot_list_parse(const char *slot_str,
|
|
|
|
hwloc_topology_t topo,
|
|
|
|
hwloc_cpuset_t cpumask)
|
|
|
|
{
|
|
|
|
char **item;
|
2013-09-08 06:04:29 +04:00
|
|
|
int rc, i, j;
|
2012-01-27 16:21:45 +04:00
|
|
|
hwloc_obj_t pu;
|
|
|
|
hwloc_cpuset_t pucpus;
|
2013-09-08 06:04:29 +04:00
|
|
|
char **range, **list;
|
2012-01-27 16:21:45 +04:00
|
|
|
size_t range_cnt;
|
|
|
|
int core_id, lower_range, upper_range;
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* bozo checks */
|
|
|
|
if (NULL == opal_hwloc_topology) {
|
|
|
|
return OPAL_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
if (NULL == slot_str || 0 == strlen(slot_str)) {
|
|
|
|
return OPAL_ERR_BAD_PARAM;
|
|
|
|
}
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"slot assignment: slot_list == %s",
|
|
|
|
slot_str);
|
|
|
|
|
|
|
|
/* split at ';' */
|
2013-01-25 22:33:25 +04:00
|
|
|
item = opal_argv_split(slot_str, ';');
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
|
|
|
|
/* start with a clean mask */
|
|
|
|
hwloc_bitmap_zero(cpumask);
|
|
|
|
/* loop across the items and accumulate the mask */
|
|
|
|
for (i=0; NULL != item[i]; i++) {
|
2013-03-28 01:11:47 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
2013-01-25 22:33:25 +04:00
|
|
|
"working assignment %s",
|
|
|
|
item[i]);
|
2012-02-01 05:50:05 +04:00
|
|
|
/* if they specified "socket" by starting with an S/s,
|
|
|
|
* or if they use socket:core notation, then parse the
|
|
|
|
* socket/core info
|
|
|
|
*/
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if ('S' == item[i][0] ||
|
2012-02-01 05:50:05 +04:00
|
|
|
's' == item[i][0] ||
|
|
|
|
NULL != strchr(item[i], ':')) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
/* specified a socket */
|
|
|
|
if (NULL == strchr(item[i], ':')) {
|
|
|
|
/* binding just to the socket level, though
|
|
|
|
* it could specify multiple sockets
|
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (rc = socket_to_cpu_set(&item[i][1], /* skip the 'S' */
|
|
|
|
topo, cpumask))) {
|
|
|
|
opal_argv_free(item);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* binding to a socket/whatever specification */
|
2012-02-01 05:50:05 +04:00
|
|
|
if ('S' == item[i][0] ||
|
|
|
|
's' == item[i][0]) {
|
|
|
|
if (OPAL_SUCCESS != (rc = socket_core_to_cpu_set(&item[i][1], /* skip the 'S' */
|
|
|
|
topo, cpumask))) {
|
|
|
|
opal_argv_free(item);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (OPAL_SUCCESS != (rc = socket_core_to_cpu_set(item[i],
|
|
|
|
topo, cpumask))) {
|
|
|
|
opal_argv_free(item);
|
|
|
|
return rc;
|
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
2012-01-27 16:21:45 +04:00
|
|
|
/* just a core specification - see if one or a range was given */
|
|
|
|
range = opal_argv_split(item[i], '-');
|
|
|
|
range_cnt = opal_argv_count(range);
|
|
|
|
/* see if a range was set or not */
|
|
|
|
switch (range_cnt) {
|
2013-09-08 06:04:29 +04:00
|
|
|
case 1: /* only one core, or a list of cores, specified */
|
|
|
|
list = opal_argv_split(range[0], ',');
|
|
|
|
for (j=0; NULL != list[j]; j++) {
|
|
|
|
core_id = atoi(list[j]);
|
|
|
|
/* find the specified logical available cpu */
|
|
|
|
if (NULL == (pu = get_pu(topo, core_id))) {
|
|
|
|
opal_argv_free(range);
|
|
|
|
opal_argv_free(item);
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
|
|
|
/* get the available cpus for that object */
|
|
|
|
pucpus = opal_hwloc_base_get_available_cpus(topo, pu);
|
|
|
|
/* set that in the mask */
|
|
|
|
hwloc_bitmap_or(cpumask, cpumask, pucpus);
|
2012-01-27 16:21:45 +04:00
|
|
|
}
|
2013-09-08 06:04:29 +04:00
|
|
|
opal_argv_free(list);
|
2012-01-27 16:21:45 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
case 2: /* range of core id's was given */
|
|
|
|
lower_range = atoi(range[0]);
|
|
|
|
upper_range = atoi(range[1]);
|
|
|
|
for (core_id=lower_range; core_id <= upper_range; core_id++) {
|
|
|
|
/* find the specified logical available cpu */
|
|
|
|
if (NULL == (pu = get_pu(topo, core_id))) {
|
|
|
|
opal_argv_free(range);
|
|
|
|
opal_argv_free(item);
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
|
|
|
/* get the available cpus for that object */
|
|
|
|
pucpus = opal_hwloc_base_get_available_cpus(topo, pu);
|
|
|
|
/* set that in the mask */
|
|
|
|
hwloc_bitmap_or(cpumask, cpumask, pucpus);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
opal_argv_free(range);
|
|
|
|
opal_argv_free(item);
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
opal_argv_free(item);
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
opal_hwloc_locality_t opal_hwloc_base_get_relative_locality(hwloc_topology_t topo,
|
2013-03-26 22:27:50 +04:00
|
|
|
char *cpuset1, char *cpuset2)
|
2011-10-20 00:18:14 +04:00
|
|
|
{
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
opal_hwloc_locality_t locality;
|
2013-03-26 22:27:50 +04:00
|
|
|
hwloc_obj_t obj;
|
|
|
|
unsigned depth, d, width, w;
|
|
|
|
hwloc_cpuset_t avail;
|
|
|
|
bool shared;
|
|
|
|
hwloc_obj_type_t type;
|
|
|
|
int sect1, sect2;
|
|
|
|
hwloc_cpuset_t loc1, loc2;
|
2011-10-20 00:18:14 +04:00
|
|
|
|
|
|
|
/* start with what we know - they share a node on a cluster
|
|
|
|
* NOTE: we may alter that latter part as hwloc's ability to
|
|
|
|
* sense multi-cu, multi-cluster systems grows
|
|
|
|
*/
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_NODE;
|
2011-10-20 00:18:14 +04:00
|
|
|
|
2013-03-26 22:27:50 +04:00
|
|
|
/* if either cpuset is NULL, then that isn't bound */
|
|
|
|
if (NULL == cpuset1 || NULL == cpuset2) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
return locality;
|
|
|
|
}
|
|
|
|
|
2013-03-26 22:27:50 +04:00
|
|
|
/* get the max depth of the topology */
|
|
|
|
depth = hwloc_topology_get_depth(topo);
|
|
|
|
|
|
|
|
/* convert the strings to cpusets */
|
|
|
|
loc1 = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_list_sscanf(loc1, cpuset1);
|
|
|
|
loc2 = hwloc_bitmap_alloc();
|
|
|
|
hwloc_bitmap_list_sscanf(loc2, cpuset2);
|
|
|
|
|
|
|
|
/* start at the first depth below the top machine level */
|
|
|
|
for (d=1; d < depth; d++) {
|
|
|
|
shared = false;
|
|
|
|
/* get the object type at this depth */
|
|
|
|
type = hwloc_get_depth_type(topo, d);
|
|
|
|
/* if it isn't one of interest, then ignore it */
|
|
|
|
if (HWLOC_OBJ_NODE != type &&
|
|
|
|
HWLOC_OBJ_SOCKET != type &&
|
|
|
|
HWLOC_OBJ_CACHE != type &&
|
|
|
|
HWLOC_OBJ_CORE != type &&
|
|
|
|
HWLOC_OBJ_PU != type) {
|
|
|
|
continue;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
}
|
2013-03-26 22:27:50 +04:00
|
|
|
/* get the width of the topology at this depth */
|
|
|
|
width = hwloc_get_nbobjs_by_depth(topo, d);
|
2013-03-27 00:09:49 +04:00
|
|
|
|
2013-03-26 22:27:50 +04:00
|
|
|
/* scan all objects at this depth to see if
|
|
|
|
* our locations overlap with them
|
2011-10-20 00:18:14 +04:00
|
|
|
*/
|
2013-03-26 22:27:50 +04:00
|
|
|
for (w=0; w < width; w++) {
|
|
|
|
/* get the object at this depth/index */
|
|
|
|
obj = hwloc_get_obj_by_depth(topo, d, w);
|
|
|
|
/* get the available cpuset for this obj */
|
|
|
|
avail = opal_hwloc_base_get_available_cpus(topo, obj);
|
|
|
|
/* see if our locations intersect with it */
|
|
|
|
sect1 = hwloc_bitmap_intersects(avail, loc1);
|
|
|
|
sect2 = hwloc_bitmap_intersects(avail, loc2);
|
|
|
|
/* if both intersect, then we share this level */
|
|
|
|
if (sect1 && sect2) {
|
|
|
|
shared = true;
|
|
|
|
switch(obj->type) {
|
|
|
|
case HWLOC_OBJ_NODE:
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_NUMA;
|
2013-03-26 22:27:50 +04:00
|
|
|
break;
|
|
|
|
case HWLOC_OBJ_SOCKET:
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_SOCKET;
|
2013-03-26 22:27:50 +04:00
|
|
|
break;
|
|
|
|
case HWLOC_OBJ_CACHE:
|
|
|
|
if (3 == obj->attr->cache.depth) {
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_L3CACHE;
|
2013-03-26 22:27:50 +04:00
|
|
|
} else if (2 == obj->attr->cache.depth) {
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_L2CACHE;
|
2013-03-26 22:27:50 +04:00
|
|
|
} else {
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_L1CACHE;
|
2013-03-26 22:27:50 +04:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
case HWLOC_OBJ_CORE:
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_CORE;
|
2013-03-26 22:27:50 +04:00
|
|
|
break;
|
|
|
|
case HWLOC_OBJ_PU:
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
locality = OPAL_PROC_ON_HWTHREAD;
|
2013-03-26 22:27:50 +04:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* just ignore it */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* otherwise, we don't share this
|
|
|
|
* object - but we still might share another object
|
|
|
|
* on this level, so we have to keep searching
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
/* if we spanned the entire width without finding
|
|
|
|
* a point of intersection, then no need to go
|
|
|
|
* deeper
|
|
|
|
*/
|
|
|
|
if (!shared) {
|
|
|
|
break;
|
2011-10-20 00:18:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
"locality: %s",
|
|
|
|
opal_hwloc_base_print_locality(locality));
|
2014-01-26 21:24:38 +04:00
|
|
|
hwloc_bitmap_free(loc1);
|
|
|
|
hwloc_bitmap_free(loc2);
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
|
2011-10-20 00:18:14 +04:00
|
|
|
return locality;
|
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
/* searches the given topology for coprocessor objects and returns
|
|
|
|
* their serial numbers as a comma-delimited string, or NULL
|
|
|
|
* if no coprocessors are found
|
|
|
|
*/
|
|
|
|
char* opal_hwloc_base_find_coprocessors(hwloc_topology_t topo)
|
|
|
|
{
|
|
|
|
hwloc_obj_t osdev;
|
|
|
|
unsigned i;
|
|
|
|
char **cps = NULL;
|
|
|
|
char *cpstring = NULL;
|
|
|
|
int depth;
|
|
|
|
|
|
|
|
/* coprocessors are recorded under OS_DEVICEs, so first
|
|
|
|
* see if we have any of those
|
|
|
|
*/
|
|
|
|
if (HWLOC_TYPE_DEPTH_UNKNOWN == (depth = hwloc_get_type_depth(topo, HWLOC_OBJ_OS_DEVICE))) {
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:find_coprocessors: NONE FOUND IN TOPO"));
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
/* check the device objects for coprocessors */
|
|
|
|
osdev = hwloc_get_obj_by_depth(topo, depth, 0);
|
|
|
|
while (NULL != osdev) {
|
|
|
|
if (HWLOC_OBJ_OSDEV_COPROC == osdev->attr->osdev.type) {
|
|
|
|
/* got one! find and save its serial number */
|
|
|
|
for (i=0; i < osdev->infos_count; i++) {
|
|
|
|
if (0 == strncmp(osdev->infos[i].name, "MICSerialNumber", strlen("MICSerialNumber"))) {
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:find_coprocessors: coprocessor %s found",
|
|
|
|
osdev->infos[i].value));
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
opal_argv_append_nosize(&cps, osdev->infos[i].value);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
osdev = osdev->next_cousin;
|
|
|
|
}
|
|
|
|
if (NULL != cps) {
|
|
|
|
cpstring = opal_argv_join(cps, ',');
|
|
|
|
opal_argv_free(cps);
|
|
|
|
}
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:find_coprocessors: hosting coprocessors %s",
|
|
|
|
(NULL == cpstring) ? "NONE" : cpstring));
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
return cpstring;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define OPAL_HWLOC_MAX_ELOG_LINE 1024
|
|
|
|
|
|
|
|
static char *hwloc_getline(FILE *fp)
|
|
|
|
{
|
|
|
|
char *ret, *buff;
|
|
|
|
char input[OPAL_HWLOC_MAX_ELOG_LINE];
|
|
|
|
|
|
|
|
ret = fgets(input, OPAL_HWLOC_MAX_ELOG_LINE, fp);
|
|
|
|
if (NULL != ret) {
|
|
|
|
input[strlen(input)-1] = '\0'; /* remove newline */
|
|
|
|
buff = strdup(input);
|
|
|
|
return buff;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* checks local environment to determine if this process
|
|
|
|
* is on a coprocessor - if so, it returns the serial number
|
|
|
|
* as a string, or NULL if it isn't on a coprocessor
|
|
|
|
*/
|
|
|
|
char* opal_hwloc_base_check_on_coprocessor(void)
|
|
|
|
{
|
|
|
|
/* this support currently is limited to Intel Phi processors
|
|
|
|
* but will hopefully be extended as we get better, more
|
|
|
|
* generalized ways of identifying coprocessors
|
|
|
|
*/
|
|
|
|
FILE *fp;
|
|
|
|
char *t, *cptr, *e, *cp=NULL;
|
|
|
|
|
2013-10-14 23:21:45 +04:00
|
|
|
if (OPAL_SUCCESS != opal_os_dirpath_access("/proc/elog", S_IRUSR)) {
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
/* if the file isn't there, or we don't have permission
|
|
|
|
* to read it, then we are not on a coprocessor so far
|
|
|
|
* as we can tell
|
|
|
|
*/
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-10-14 23:21:45 +04:00
|
|
|
if (NULL == (fp = fopen("/proc/elog", "r"))) {
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
/* nothing we can do */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
/* look for the line containing the serial number of this
|
|
|
|
* card - usually the first line in the file
|
|
|
|
*/
|
|
|
|
while (NULL != (cptr = hwloc_getline(fp))) {
|
|
|
|
if (NULL != (t = strstr(cptr, "Card"))) {
|
|
|
|
/* we want the string right after this - delimited by
|
|
|
|
* a colon at the end
|
|
|
|
*/
|
|
|
|
t += 5; // move past "Card "
|
|
|
|
if (NULL == (e = strchr(t, ':'))) {
|
|
|
|
/* not what we were expecting */
|
|
|
|
free(cptr);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
*e = '\0';
|
|
|
|
cp = strdup(t);
|
|
|
|
free(cptr);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free(cptr);
|
|
|
|
}
|
|
|
|
fclose(fp);
|
2013-10-26 06:26:21 +04:00
|
|
|
OPAL_OUTPUT_VERBOSE((5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:check_coprocessor: on coprocessor %s",
|
|
|
|
(NULL == cp) ? "NONE" : cp));
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
return cp;
|
|
|
|
}
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
char* opal_hwloc_base_print_binding(opal_binding_policy_t binding)
|
|
|
|
{
|
|
|
|
char *ret, *bind;
|
|
|
|
opal_hwloc_print_buffers_t *ptr;
|
|
|
|
|
|
|
|
switch(OPAL_GET_BINDING_POLICY(binding)) {
|
|
|
|
case OPAL_BIND_TO_NONE:
|
|
|
|
bind = "NONE";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_BOARD:
|
|
|
|
bind = "BOARD";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_NUMA:
|
|
|
|
bind = "NUMA";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_SOCKET:
|
|
|
|
bind = "SOCKET";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L3CACHE:
|
|
|
|
bind = "L3CACHE";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L2CACHE:
|
|
|
|
bind = "L2CACHE";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_L1CACHE:
|
|
|
|
bind = "L1CACHE";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_CORE:
|
|
|
|
bind = "CORE";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_HWTHREAD:
|
|
|
|
bind = "HWTHREAD";
|
|
|
|
break;
|
|
|
|
case OPAL_BIND_TO_CPUSET:
|
|
|
|
bind = "CPUSET";
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
bind = "UNKNOWN";
|
|
|
|
}
|
2012-09-27 03:24:27 +04:00
|
|
|
ptr = opal_hwloc_get_print_buffer();
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL == ptr) {
|
|
|
|
return opal_hwloc_print_null;
|
|
|
|
}
|
|
|
|
/* cycle around the ring */
|
|
|
|
if (OPAL_HWLOC_PRINT_NUM_BUFS == ptr->cntr) {
|
|
|
|
ptr->cntr = 0;
|
|
|
|
}
|
|
|
|
if (!OPAL_BINDING_REQUIRED(binding) &&
|
|
|
|
OPAL_BIND_OVERLOAD_ALLOWED(binding)) {
|
|
|
|
snprintf(ptr->buffers[ptr->cntr], OPAL_HWLOC_PRINT_MAX_SIZE,
|
|
|
|
"%s:IF-SUPPORTED:OVERLOAD-ALLOWED", bind);
|
|
|
|
} else if (OPAL_BIND_OVERLOAD_ALLOWED(binding)) {
|
|
|
|
snprintf(ptr->buffers[ptr->cntr], OPAL_HWLOC_PRINT_MAX_SIZE,
|
|
|
|
"%s:OVERLOAD-ALLOWED", bind);
|
|
|
|
} else if (!OPAL_BINDING_REQUIRED(binding)) {
|
|
|
|
snprintf(ptr->buffers[ptr->cntr], OPAL_HWLOC_PRINT_MAX_SIZE,
|
|
|
|
"%s:IF-SUPPORTED", bind);
|
|
|
|
} else {
|
|
|
|
snprintf(ptr->buffers[ptr->cntr], OPAL_HWLOC_PRINT_MAX_SIZE, "%s", bind);
|
|
|
|
}
|
|
|
|
ret = ptr->buffers[ptr->cntr];
|
|
|
|
ptr->cntr++;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-06-01 20:02:18 +04:00
|
|
|
/*
|
|
|
|
* Turn an int bitmap to a "a-b,c" range kind of string
|
|
|
|
*/
|
|
|
|
static char *bitmap2rangestr(int bitmap)
|
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
int range_start, range_end;
|
|
|
|
bool first, isset;
|
|
|
|
char tmp[BUFSIZ];
|
|
|
|
const int stmp = sizeof(tmp) - 1;
|
|
|
|
static char ret[BUFSIZ];
|
|
|
|
|
|
|
|
memset(ret, 0, sizeof(ret));
|
|
|
|
|
|
|
|
first = true;
|
|
|
|
range_start = -999;
|
|
|
|
for (i = 0; i < sizeof(int) * 8; ++i) {
|
|
|
|
isset = (bitmap & (1 << i));
|
|
|
|
|
|
|
|
/* Do we have a running range? */
|
|
|
|
if (range_start >= 0) {
|
|
|
|
if (isset) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
/* A range just ended; output it */
|
|
|
|
if (!first) {
|
2013-11-13 19:54:01 +04:00
|
|
|
strncat(ret, ",", sizeof(ret) - strlen(ret) - 1);
|
2012-06-01 20:02:18 +04:00
|
|
|
first = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
range_end = i - 1;
|
|
|
|
if (range_start == range_end) {
|
|
|
|
snprintf(tmp, stmp, "%d", range_start);
|
|
|
|
} else {
|
|
|
|
snprintf(tmp, stmp, "%d-%d", range_start, range_end);
|
|
|
|
}
|
2013-11-13 19:54:01 +04:00
|
|
|
strncat(ret, tmp, sizeof(ret) - strlen(ret) - 1);
|
2012-06-01 20:02:18 +04:00
|
|
|
|
|
|
|
range_start = -999;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* No running range */
|
|
|
|
else {
|
|
|
|
if (isset) {
|
|
|
|
range_start = i;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we ended the bitmap with a range open, output it */
|
|
|
|
if (range_start >= 0) {
|
|
|
|
if (!first) {
|
2013-11-13 19:54:01 +04:00
|
|
|
strncat(ret, ",", sizeof(ret) - strlen(ret) - 1);
|
2012-06-01 20:02:18 +04:00
|
|
|
first = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
range_end = i - 1;
|
|
|
|
if (range_start == range_end) {
|
|
|
|
snprintf(tmp, stmp, "%d", range_start);
|
|
|
|
} else {
|
|
|
|
snprintf(tmp, stmp, "%d-%d", range_start, range_end);
|
|
|
|
}
|
2013-11-13 19:54:01 +04:00
|
|
|
strncat(ret, tmp, sizeof(ret) - strlen(ret) - 1);
|
2012-06-01 20:02:18 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make a map of socket/core/hwthread tuples
|
|
|
|
*/
|
|
|
|
static int build_map(int *num_sockets_arg, int *num_cores_arg,
|
|
|
|
hwloc_cpuset_t cpuset, int ***map)
|
|
|
|
{
|
|
|
|
static int num_sockets = -1, num_cores = -1;
|
|
|
|
int socket_index, core_index, pu_index;
|
|
|
|
hwloc_obj_t socket, core, pu;
|
|
|
|
int **data;
|
|
|
|
|
|
|
|
/* Find out how many sockets we have (cached so that we don't have
|
|
|
|
to look this up every time) */
|
|
|
|
if (num_sockets < 0) {
|
2012-10-03 13:33:40 +04:00
|
|
|
num_sockets = hwloc_get_nbobjs_by_type(opal_hwloc_topology, HWLOC_OBJ_SOCKET);
|
2013-03-25 21:51:45 +04:00
|
|
|
/* some systems (like the iMac) only have one
|
|
|
|
* socket and so don't report a socket
|
|
|
|
*/
|
|
|
|
if (0 == num_sockets) {
|
|
|
|
num_sockets = 1;
|
|
|
|
}
|
2012-06-01 20:02:18 +04:00
|
|
|
/* Lazy: take the total number of cores that we have in the
|
2012-10-03 13:33:40 +04:00
|
|
|
topology; that'll be more than the max number of cores
|
2012-06-01 20:02:18 +04:00
|
|
|
under any given socket */
|
|
|
|
num_cores = hwloc_get_nbobjs_by_type(opal_hwloc_topology, HWLOC_OBJ_CORE);
|
|
|
|
}
|
|
|
|
*num_sockets_arg = num_sockets;
|
|
|
|
*num_cores_arg = num_cores;
|
|
|
|
|
|
|
|
/* Alloc a 2D array: sockets x cores. */
|
|
|
|
data = malloc(num_sockets * sizeof(int *));
|
|
|
|
if (NULL == data) {
|
|
|
|
return OPAL_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
|
|
|
data[0] = calloc(num_sockets * num_cores, sizeof(int));
|
|
|
|
if (NULL == data[0]) {
|
|
|
|
free(data);
|
|
|
|
return OPAL_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
|
|
|
for (socket_index = 1; socket_index < num_sockets; ++socket_index) {
|
|
|
|
data[socket_index] = data[socket_index - 1] + num_cores;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Iterate the PUs in this cpuset; fill in the data[][] array with
|
|
|
|
the socket/core/pu triples */
|
|
|
|
for (pu_index = 0,
|
|
|
|
pu = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
cpuset, HWLOC_OBJ_PU,
|
|
|
|
pu_index);
|
|
|
|
NULL != pu;
|
|
|
|
pu = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
cpuset, HWLOC_OBJ_PU,
|
|
|
|
++pu_index)) {
|
|
|
|
/* Go upward and find the core this PU belongs to */
|
|
|
|
core = pu;
|
|
|
|
while (NULL != core && core->type != HWLOC_OBJ_CORE) {
|
|
|
|
core = core->parent;
|
|
|
|
}
|
|
|
|
core_index = 0;
|
|
|
|
if (NULL != core) {
|
2012-10-03 13:33:40 +04:00
|
|
|
core_index = core->logical_index;
|
2012-06-01 20:02:18 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Go upward and find the socket this PU belongs to */
|
|
|
|
socket = pu;
|
|
|
|
while (NULL != socket && socket->type != HWLOC_OBJ_SOCKET) {
|
|
|
|
socket = socket->parent;
|
|
|
|
}
|
|
|
|
socket_index = 0;
|
|
|
|
if (NULL != socket) {
|
2012-10-03 13:33:40 +04:00
|
|
|
socket_index = socket->logical_index;
|
2012-06-01 20:02:18 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Save this socket/core/pu combo. LAZY: Assuming that we
|
|
|
|
won't have more PU's per core than (sizeof(int)*8). */
|
|
|
|
data[socket_index][core_index] |= (1 << pu->sibling_rank);
|
|
|
|
}
|
|
|
|
|
|
|
|
*map = data;
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make a prettyprint string for a hwloc_cpuset_t
|
|
|
|
*/
|
|
|
|
int opal_hwloc_base_cset2str(char *str, int len, hwloc_cpuset_t cpuset)
|
|
|
|
{
|
|
|
|
bool first;
|
|
|
|
int num_sockets, num_cores;
|
|
|
|
int ret, socket_index, core_index;
|
|
|
|
char tmp[BUFSIZ];
|
|
|
|
const int stmp = sizeof(tmp) - 1;
|
|
|
|
int **map;
|
2014-01-09 20:16:16 +04:00
|
|
|
hwloc_obj_t root;
|
|
|
|
opal_hwloc_topo_data_t *sum;
|
2012-06-01 20:02:18 +04:00
|
|
|
|
|
|
|
str[0] = tmp[stmp] = '\0';
|
|
|
|
|
2014-01-10 02:39:34 +04:00
|
|
|
/* if the cpuset is all zero, then not bound */
|
|
|
|
if (hwloc_bitmap_iszero(cpuset)) {
|
|
|
|
return OPAL_ERR_NOT_BOUND;
|
|
|
|
}
|
|
|
|
|
2014-01-09 20:16:16 +04:00
|
|
|
/* if the cpuset includes all available cpus, then we are unbound */
|
|
|
|
root = hwloc_get_root_obj(opal_hwloc_topology);
|
2014-01-10 02:39:34 +04:00
|
|
|
if (NULL == root->userdata) {
|
|
|
|
opal_hwloc_base_filter_cpus(opal_hwloc_topology);
|
|
|
|
}
|
|
|
|
sum = (opal_hwloc_topo_data_t*)root->userdata;
|
|
|
|
if (NULL == sum->available) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
|
|
|
if (0 != hwloc_bitmap_isincluded(sum->available, cpuset)) {
|
|
|
|
return OPAL_ERR_NOT_BOUND;
|
2014-01-09 20:16:16 +04:00
|
|
|
}
|
|
|
|
|
2012-06-01 20:02:18 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = build_map(&num_sockets, &num_cores, cpuset, &map))) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-10-03 13:33:40 +04:00
|
|
|
/* Iterate over the data matrix and build up the string */
|
2012-06-01 20:02:18 +04:00
|
|
|
first = true;
|
|
|
|
for (socket_index = 0; socket_index < num_sockets; ++socket_index) {
|
|
|
|
for (core_index = 0; core_index < num_cores; ++core_index) {
|
|
|
|
if (map[socket_index][core_index] > 0) {
|
|
|
|
if (!first) {
|
|
|
|
strncat(str, ", ", len - strlen(str));
|
|
|
|
}
|
|
|
|
first = false;
|
|
|
|
|
|
|
|
snprintf(tmp, stmp, "socket %d[core %d[hwt %s]]",
|
|
|
|
socket_index, core_index,
|
|
|
|
bitmap2rangestr(map[socket_index][core_index]));
|
|
|
|
strncat(str, tmp, len - strlen(str));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
free(map[0]);
|
|
|
|
free(map);
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make a prettyprint string for a cset in a map format.
|
|
|
|
* Example: [B./..]
|
|
|
|
* Key: [] - signifies socket
|
|
|
|
* / - divider between cores
|
|
|
|
* . - signifies PU a process not bound to
|
|
|
|
* B - signifies PU a process is bound to
|
|
|
|
*/
|
|
|
|
int opal_hwloc_base_cset2mapstr(char *str, int len, hwloc_cpuset_t cpuset)
|
|
|
|
{
|
|
|
|
char tmp[BUFSIZ];
|
|
|
|
int core_index, pu_index;
|
|
|
|
const int stmp = sizeof(tmp) - 1;
|
|
|
|
hwloc_obj_t socket, core, pu;
|
2014-01-09 20:16:16 +04:00
|
|
|
hwloc_obj_t root;
|
|
|
|
opal_hwloc_topo_data_t *sum;
|
2012-06-01 20:02:18 +04:00
|
|
|
|
|
|
|
str[0] = tmp[stmp] = '\0';
|
|
|
|
|
2014-01-10 02:39:34 +04:00
|
|
|
/* if the cpuset is all zero, then not bound */
|
|
|
|
if (hwloc_bitmap_iszero(cpuset)) {
|
|
|
|
return OPAL_ERR_NOT_BOUND;
|
|
|
|
}
|
|
|
|
|
2014-01-09 20:16:16 +04:00
|
|
|
/* if the cpuset includes all available cpus, then we are unbound */
|
|
|
|
root = hwloc_get_root_obj(opal_hwloc_topology);
|
2014-01-10 02:39:34 +04:00
|
|
|
if (NULL == root->userdata) {
|
|
|
|
opal_hwloc_base_filter_cpus(opal_hwloc_topology);
|
|
|
|
}
|
|
|
|
sum = (opal_hwloc_topo_data_t*)root->userdata;
|
|
|
|
if (NULL == sum->available) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
|
|
|
if (0 != hwloc_bitmap_isincluded(sum->available, cpuset)) {
|
|
|
|
return OPAL_ERR_NOT_BOUND;
|
2014-01-09 20:16:16 +04:00
|
|
|
}
|
|
|
|
|
2012-06-01 20:02:18 +04:00
|
|
|
/* Iterate over all existing sockets */
|
|
|
|
for (socket = hwloc_get_obj_by_type(opal_hwloc_topology,
|
|
|
|
HWLOC_OBJ_SOCKET, 0);
|
|
|
|
NULL != socket;
|
|
|
|
socket = socket->next_cousin) {
|
|
|
|
strncat(str, "[", len - strlen(str));
|
|
|
|
|
|
|
|
/* Iterate over all existing cores in this socket */
|
|
|
|
core_index = 0;
|
|
|
|
for (core = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
socket->cpuset,
|
|
|
|
HWLOC_OBJ_CORE, core_index);
|
|
|
|
NULL != core;
|
|
|
|
core = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
socket->cpuset,
|
|
|
|
HWLOC_OBJ_CORE, ++core_index)) {
|
|
|
|
if (core_index > 0) {
|
|
|
|
strncat(str, "/", len - strlen(str));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Iterate over all existing PUs in this core */
|
|
|
|
pu_index = 0;
|
|
|
|
for (pu = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
core->cpuset,
|
|
|
|
HWLOC_OBJ_PU, pu_index);
|
|
|
|
NULL != pu;
|
|
|
|
pu = hwloc_get_obj_inside_cpuset_by_type(opal_hwloc_topology,
|
|
|
|
core->cpuset,
|
|
|
|
HWLOC_OBJ_PU, ++pu_index)) {
|
|
|
|
|
|
|
|
/* Is this PU in the cpuset? */
|
|
|
|
if (hwloc_bitmap_isset(cpuset, pu->os_index)) {
|
|
|
|
strncat(str, "B", len - strlen(str));
|
|
|
|
} else {
|
|
|
|
strncat(str, ".", len - strlen(str));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
strncat(str, "]", len - strlen(str));
|
|
|
|
}
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
|
|
|
|
static int dist_cmp_fn (opal_list_item_t **a, opal_list_item_t **b)
|
|
|
|
{
|
|
|
|
orte_rmaps_numa_node_t *aitem = *((orte_rmaps_numa_node_t **) a);
|
|
|
|
orte_rmaps_numa_node_t *bitem = *((orte_rmaps_numa_node_t **) b);
|
|
|
|
|
2013-11-13 13:26:40 +04:00
|
|
|
if (aitem->dist_from_closed > bitem->dist_from_closed) {
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
return 1;
|
|
|
|
} else if( aitem->dist_from_closed == bitem->dist_from_closed ) {
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-08-28 20:23:33 +04:00
|
|
|
static void sort_by_dist(hwloc_topology_t topo, char* device_name, opal_list_t *sorted_list)
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
{
|
|
|
|
hwloc_obj_t device_obj = NULL;
|
|
|
|
hwloc_obj_t obj = NULL, root = NULL;
|
|
|
|
const struct hwloc_distances_s* distances;
|
|
|
|
orte_rmaps_numa_node_t *numa_node;
|
|
|
|
int close_node_index;
|
|
|
|
float latency;
|
2013-06-12 20:25:25 +04:00
|
|
|
unsigned int j;
|
2013-11-13 19:54:01 +04:00
|
|
|
int depth;
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (device_obj = hwloc_get_obj_by_type(topo, HWLOC_OBJ_OS_DEVICE, 0); device_obj; device_obj = hwloc_get_next_osdev(topo, device_obj)) {
|
|
|
|
if (device_obj->attr->osdev.type == HWLOC_OBJ_OSDEV_OPENFABRICS
|
|
|
|
|| device_obj->attr->osdev.type == HWLOC_OBJ_OSDEV_NETWORK) {
|
|
|
|
if (!strcmp(device_obj->name, device_name)) {
|
|
|
|
/* find numa node containing this device */
|
|
|
|
obj = device_obj->parent;
|
|
|
|
while ((obj != NULL) && (obj->type != HWLOC_OBJ_NODE)) {
|
|
|
|
obj = obj->parent;
|
|
|
|
}
|
|
|
|
if (obj == NULL) {
|
2013-08-28 20:23:33 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:get_sorted_numa_list: NUMA node closest to %s wasn't found.",
|
|
|
|
device_name);
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
close_node_index = obj->logical_index;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find distance matrix for all numa nodes */
|
|
|
|
distances = hwloc_get_whole_distance_matrix_by_type(topo, HWLOC_OBJ_NODE);
|
|
|
|
if (NULL == distances) {
|
|
|
|
/* we can try to find distances under group object. This info can be there. */
|
|
|
|
depth = hwloc_get_type_depth(topo, HWLOC_OBJ_NODE);
|
2013-11-13 19:54:01 +04:00
|
|
|
if (HWLOC_TYPE_DEPTH_UNKNOWN == depth) {
|
2013-08-28 20:23:33 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:get_sorted_numa_list: There is no information about distances on the node.");
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
root = hwloc_get_root_obj(topo);
|
|
|
|
for (i = 0; i < root->arity; i++) {
|
|
|
|
obj = root->children[i];
|
|
|
|
if (obj->distances_count > 0) {
|
|
|
|
for(j = 0; j < obj->distances_count; j++) {
|
2013-11-13 19:54:01 +04:00
|
|
|
if (obj->distances[j]->relative_depth + 1 == (unsigned) depth) {
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
distances = obj->distances[j];
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* find all distances for our close node with logical index = close_node_index as close_node_index + nbobjs*j */
|
|
|
|
if ((NULL == distances) || (0 == distances->nbobjs)) {
|
2013-08-28 20:23:33 +04:00
|
|
|
opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
|
|
|
|
"hwloc:base:get_sorted_numa_list: There is no information about distances on the node.");
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* fill list of numa nodes */
|
|
|
|
for (j = 0; j < distances->nbobjs; j++) {
|
|
|
|
latency = distances->latency[close_node_index + distances->nbobjs * j];
|
|
|
|
numa_node = OBJ_NEW(orte_rmaps_numa_node_t);
|
|
|
|
numa_node->index = j;
|
|
|
|
numa_node->dist_from_closed = latency;
|
|
|
|
opal_list_append(sorted_list, &numa_node->super);
|
|
|
|
}
|
|
|
|
/* sort numa nodes by distance from the closest one to PCI */
|
|
|
|
opal_list_sort(sorted_list, dist_cmp_fn);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-08-28 20:23:33 +04:00
|
|
|
static int find_devices(hwloc_topology_t topo, char* device_name)
|
|
|
|
{
|
|
|
|
hwloc_obj_t device_obj = NULL;
|
|
|
|
int count = 0;
|
|
|
|
for (device_obj = hwloc_get_obj_by_type(topo, HWLOC_OBJ_OS_DEVICE, 0); device_obj; device_obj = hwloc_get_next_osdev(topo, device_obj)) {
|
|
|
|
if (device_obj->attr->osdev.type == HWLOC_OBJ_OSDEV_OPENFABRICS) {
|
|
|
|
count++;
|
|
|
|
free(device_name);
|
|
|
|
device_name = strdup(device_obj->name);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
int opal_hwloc_get_sorted_numa_list(hwloc_topology_t topo, char* device_name, opal_list_t *sorted_list)
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
{
|
|
|
|
hwloc_obj_t obj;
|
|
|
|
opal_list_item_t *item;
|
|
|
|
opal_hwloc_summary_t *sum;
|
|
|
|
opal_hwloc_topo_data_t *data;
|
|
|
|
orte_rmaps_numa_node_t *numa, *copy_numa;
|
2013-08-28 20:23:33 +04:00
|
|
|
int count;
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
|
|
|
|
obj = hwloc_get_root_obj(topo);
|
|
|
|
|
|
|
|
/* first see if the topology already has this info */
|
|
|
|
/* we call opal_hwloc_base_get_nbobjs_by_type() before it to fill summary object so it should exist*/
|
|
|
|
data = (opal_hwloc_topo_data_t*)obj->userdata;
|
|
|
|
if (NULL != data) {
|
|
|
|
for (item = opal_list_get_first(&data->summaries);
|
|
|
|
item != opal_list_get_end(&data->summaries);
|
|
|
|
item = opal_list_get_next(item)) {
|
|
|
|
sum = (opal_hwloc_summary_t*)item;
|
|
|
|
if (HWLOC_OBJ_NODE == sum->type) {
|
|
|
|
if (opal_list_get_size(&sum->sorted_by_dist_list) > 0) {
|
|
|
|
OPAL_LIST_FOREACH(numa, &(sum->sorted_by_dist_list), orte_rmaps_numa_node_t) {
|
|
|
|
copy_numa = OBJ_NEW(orte_rmaps_numa_node_t);
|
|
|
|
copy_numa->index = numa->index;
|
|
|
|
copy_numa->dist_from_closed = numa->dist_from_closed;
|
|
|
|
opal_list_append(sorted_list, ©_numa->super);
|
|
|
|
}
|
2013-08-30 00:01:06 +04:00
|
|
|
return OPAL_SUCCESS;
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
}else {
|
|
|
|
/* don't already know it - go get it */
|
2013-08-28 20:23:33 +04:00
|
|
|
/* firstly we check if we need to autodetect OpenFabrics devices or we have the specified one */
|
|
|
|
if (!strcmp(device_name, "auto")) {
|
|
|
|
count = find_devices(topo, device_name);
|
|
|
|
if (count > 1) {
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!device_name || (strlen(device_name) == 0)) {
|
2013-08-30 00:01:06 +04:00
|
|
|
return OPAL_ERR_NOT_FOUND;
|
2013-08-28 20:23:33 +04:00
|
|
|
}
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
sort_by_dist(topo, device_name, sorted_list);
|
|
|
|
/* store this info in summary object for later usage */
|
|
|
|
OPAL_LIST_FOREACH(numa, sorted_list, orte_rmaps_numa_node_t) {
|
|
|
|
copy_numa = OBJ_NEW(orte_rmaps_numa_node_t);
|
|
|
|
copy_numa->index = numa->index;
|
|
|
|
copy_numa->dist_from_closed = numa->dist_from_closed;
|
|
|
|
opal_list_append(&(sum->sorted_by_dist_list), ©_numa->super);
|
|
|
|
}
|
2013-08-30 00:01:06 +04:00
|
|
|
return OPAL_SUCCESS;
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2013-08-30 00:01:06 +04:00
|
|
|
return OPAL_ERR_NOT_FOUND;
|
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 17:04:40 +04:00
|
|
|
}
|