2011-09-11 19:02:24 +00:00
/*
2014-01-22 12:17:14 +00:00
* Copyright ( c ) 2011 - 2014 Cisco Systems , Inc . All rights reserved .
2014-02-09 16:14:38 +00:00
* Copyright ( c ) 2013 - 2014 Intel , Inc . All rights reserved .
2011-09-11 19:02:24 +00:00
* $ COPYRIGHT $
*
* Additional copyrights may follow
*
* $ HEADER $
*/
# include "opal_config.h"
# include "opal/constants.h"
# include "opal/dss/dss.h"
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
# include "opal/util/argv.h"
2011-09-11 19:02:24 +00:00
# include "opal/util/output.h"
2011-09-19 16:10:37 +00:00
# include "opal/util/show_help.h"
2011-09-11 19:02:24 +00:00
# include "opal/mca/mca.h"
# include "opal/mca/base/base.h"
2012-09-26 23:24:27 +00:00
# include "opal/threads/tsd.h"
2011-09-11 19:02:24 +00:00
# include "opal/mca/hwloc/hwloc.h"
# include "opal/mca/hwloc/base/base.h"
/*
* The following file was created by configure . It contains extern
* statements and the definition of an array of pointers to each
* component ' s public mca_base_component_t struct .
*/
# include "opal/mca/hwloc/base/static-components.h"
/*
* Globals
*/
bool opal_hwloc_base_inited = false ;
# if OPAL_HAVE_HWLOC
hwloc_topology_t opal_hwloc_topology = NULL ;
2011-10-19 20:18:14 +00:00
hwloc_cpuset_t opal_hwloc_my_cpuset = NULL ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
hwloc_cpuset_t opal_hwloc_base_given_cpus = NULL ;
opal_hwloc_base_map_t opal_hwloc_base_map = OPAL_HWLOC_BASE_MAP_NONE ;
opal_hwloc_base_mbfa_t opal_hwloc_base_mbfa = OPAL_HWLOC_BASE_MBFA_WARN ;
opal_binding_policy_t opal_hwloc_binding_policy = 0 ;
char * opal_hwloc_base_slot_list = NULL ;
2011-10-29 14:58:58 +00:00
char * opal_hwloc_base_cpu_set = NULL ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
bool opal_hwloc_report_bindings = false ;
2011-10-29 14:58:58 +00:00
hwloc_obj_type_t opal_hwloc_levels [ ] = {
HWLOC_OBJ_MACHINE ,
HWLOC_OBJ_NODE ,
HWLOC_OBJ_SOCKET ,
HWLOC_OBJ_CACHE ,
HWLOC_OBJ_CACHE ,
HWLOC_OBJ_CACHE ,
HWLOC_OBJ_CORE ,
HWLOC_OBJ_PU
} ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
bool opal_hwloc_use_hwthreads_as_cpus = false ;
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
char * opal_hwloc_base_topo_file = NULL ;
2011-09-11 19:02:24 +00:00
# endif
2011-09-19 16:10:37 +00:00
2013-04-10 15:08:31 +00:00
# if OPAL_HAVE_HWLOC
2013-03-27 21:09:41 +00:00
static mca_base_var_enum_value_t hwloc_base_map [ ] = {
{ OPAL_HWLOC_BASE_MAP_NONE , " none " } ,
{ OPAL_HWLOC_BASE_MAP_LOCAL_ONLY , " local_only " } ,
{ 0 , NULL }
} ;
static mca_base_var_enum_value_t hwloc_failure_action [ ] = {
{ OPAL_HWLOC_BASE_MBFA_SILENT , " silent " } ,
{ OPAL_HWLOC_BASE_MBFA_WARN , " warn " } ,
{ OPAL_HWLOC_BASE_MBFA_ERROR , " error " } ,
{ 0 , NULL }
} ;
2013-04-10 15:08:31 +00:00
# endif
2013-03-27 21:09:41 +00:00
2013-03-27 21:11:47 +00:00
static int opal_hwloc_base_register ( mca_base_register_flag_t flags ) ;
static int opal_hwloc_base_open ( mca_base_open_flag_t flags ) ;
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
static int opal_hwloc_base_close ( void ) ;
2013-03-27 21:11:47 +00:00
MCA_BASE_FRAMEWORK_DECLARE ( opal , hwloc , NULL , opal_hwloc_base_register , opal_hwloc_base_open , opal_hwloc_base_close ,
mca_hwloc_base_static_components , 0 ) ;
2013-04-10 15:08:31 +00:00
# if OPAL_HAVE_HWLOC
2013-03-27 21:09:41 +00:00
static char * opal_hwloc_base_binding_policy = NULL ;
static bool opal_hwloc_base_bind_to_core = false ;
static bool opal_hwloc_base_bind_to_socket = false ;
2013-04-10 15:08:31 +00:00
# endif
2013-03-27 21:09:41 +00:00
2013-03-27 21:11:47 +00:00
static int opal_hwloc_base_register ( mca_base_register_flag_t flags )
2013-03-27 21:09:41 +00:00
{
2013-04-10 15:08:31 +00:00
# if OPAL_HAVE_HWLOC
2013-03-27 21:09:41 +00:00
mca_base_var_enum_t * new_enum ;
int ret ;
/* hwloc_base_mbind_policy */
opal_hwloc_base_map = OPAL_HWLOC_BASE_MAP_NONE ;
mca_base_var_enum_create ( " hwloc memory allocation policy " , hwloc_base_map , & new_enum ) ;
ret = mca_base_var_register ( " opal " , " hwloc " , " base " , " mem_alloc_policy " ,
" General memory allocations placement policy (this is not memory binding). "
" \" none \" means that no memory policy is applied. \" local_only \" means that a process' memory allocations will be restricted to its local NUMA node. "
" If using direct launch, this policy will not be in effect until after MPI_INIT. "
" Note that operating system paging policies are unaffected by this setting. For example, if \" local_only \" is used and local NUMA node memory is exhausted, a new memory allocation may cause paging. " ,
MCA_BASE_VAR_TYPE_INT , new_enum , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_map ) ;
OBJ_RELEASE ( new_enum ) ;
if ( 0 > ret ) {
return ret ;
}
/* hwloc_base_bind_failure_action */
opal_hwloc_base_mbfa = OPAL_HWLOC_BASE_MBFA_WARN ;
mca_base_var_enum_create ( " hwloc memory bind failure action " , hwloc_failure_action , & new_enum ) ;
ret = mca_base_var_register ( " opal " , " hwloc " , " base " , " mem_bind_failure_action " ,
" What Open MPI will do if it explicitly tries to bind memory to a specific NUMA location, and fails. Note that this is a different case than the general allocation policy described by hwloc_base_alloc_policy. A value of \" silent \" means that Open MPI will proceed without comment. A value of \" warn \" means that Open MPI will warn the first time this happens, but allow the job to continue (possibly with degraded performance). A value of \" error \" means that Open MPI will abort the job if this happens. " ,
MCA_BASE_VAR_TYPE_INT , new_enum , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_mbfa ) ;
OBJ_RELEASE ( new_enum ) ;
if ( 0 > ret ) {
return ret ;
}
opal_hwloc_base_binding_policy = NULL ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " binding_policy " ,
2013-12-15 17:25:54 +00:00
" Policy for binding processes [none | hwthread | core (default) | l1cache | l2cache | l3cache | socket | numa | board] (supported qualifiers: overload-allowed,if-supported) " ,
2013-03-27 21:09:41 +00:00
MCA_BASE_VAR_TYPE_STRING , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_binding_policy ) ;
/* backward compatibility */
opal_hwloc_base_bind_to_core = false ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " bind_to_core " , " Bind processes to cores " ,
MCA_BASE_VAR_TYPE_BOOL , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_bind_to_core ) ;
opal_hwloc_base_bind_to_socket = false ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " bind_to_socket " , " Bind processes to sockets " ,
MCA_BASE_VAR_TYPE_BOOL , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_bind_to_socket ) ;
opal_hwloc_report_bindings = false ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " report_bindings " , " Report bindings to stderr " ,
MCA_BASE_VAR_TYPE_BOOL , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_report_bindings ) ;
opal_hwloc_base_slot_list = NULL ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " slot_list " ,
" List of processor IDs to bind processes to [default=NULL] " ,
MCA_BASE_VAR_TYPE_STRING , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_slot_list ) ;
opal_hwloc_base_cpu_set = NULL ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " cpu_set " ,
" Comma-separated list of ranges specifying logical cpus allocated to this job [default: none] " ,
MCA_BASE_VAR_TYPE_STRING , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_cpu_set ) ;
/* declare hwthreads as independent cpus */
opal_hwloc_use_hwthreads_as_cpus = false ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " use_hwthreads_as_cpus " ,
" Use hardware threads as independent cpus " ,
MCA_BASE_VAR_TYPE_BOOL , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_use_hwthreads_as_cpus ) ;
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
opal_hwloc_base_topo_file = NULL ;
( void ) mca_base_var_register ( " opal " , " hwloc " , " base " , " topo_file " ,
" Read local topology from file instead of directly sensing it " ,
MCA_BASE_VAR_TYPE_STRING , NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , & opal_hwloc_base_topo_file ) ;
2013-04-10 15:08:31 +00:00
# endif
2013-04-25 19:13:56 +00:00
/* register parameters */
return OPAL_SUCCESS ;
2013-03-27 21:09:41 +00:00
}
2011-09-11 19:02:24 +00:00
2013-03-27 21:11:47 +00:00
static int opal_hwloc_base_open ( mca_base_open_flag_t flags )
2011-09-11 19:02:24 +00:00
{
if ( opal_hwloc_base_inited ) {
return OPAL_SUCCESS ;
}
opal_hwloc_base_inited = true ;
# if OPAL_HAVE_HWLOC
{
2013-12-20 20:42:39 +00:00
int rc ;
2011-09-11 19:02:24 +00:00
opal_data_type_t tmp ;
2013-12-20 20:42:39 +00:00
if ( OPAL_SUCCESS ! = ( rc = opal_hwloc_base_set_binding_policy ( & opal_hwloc_binding_policy ,
opal_hwloc_base_binding_policy ) ) ) {
return rc ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
}
2013-03-27 21:09:41 +00:00
if ( opal_hwloc_base_bind_to_core ) {
2014-01-15 14:48:39 +00:00
opal_show_help ( " help-opal-hwloc-base.txt " , " deprecated " , true ,
2014-01-22 12:17:14 +00:00
" --bind-to-core " , " --bind-to core " ,
2014-01-15 14:48:39 +00:00
" hwloc_base_bind_to_core " , " hwloc_base_binding_policy=core " ) ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
/* set binding policy to core - error if something else already set */
if ( OPAL_BINDING_POLICY_IS_SET ( opal_hwloc_binding_policy ) & &
OPAL_GET_BINDING_POLICY ( opal_hwloc_binding_policy ) ! = OPAL_BIND_TO_CORE ) {
/* error - cannot redefine the default ranking policy */
opal_show_help ( " help-opal-hwloc-base.txt " , " redefining-policy " , true ,
" core " , opal_hwloc_base_print_binding ( opal_hwloc_binding_policy ) ) ;
Refs trac:3275.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't). If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!". Kaboom.
The only problem is that it didn't give you any indication of where
this value was being set. Quite maddening, from a user perspective.
So we changed the ompi_info handles this case. If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.
At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).
We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens. Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes. Why? I think we used them in exactly one
place in the code base (mca_base_components_open.c). So we deleted
those and just used the normal OPAL_* codes instead.
While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).
This commit was SVN r27306.
The following Trac tickets were found above:
Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
2012-09-11 20:47:24 +00:00
return OPAL_ERR_BAD_PARAM ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
}
OPAL_SET_BINDING_POLICY ( opal_hwloc_binding_policy , OPAL_BIND_TO_CORE ) ;
}
2013-03-27 21:09:41 +00:00
if ( opal_hwloc_base_bind_to_socket ) {
2014-01-15 14:48:39 +00:00
opal_show_help ( " help-opal-hwloc-base.txt " , " deprecated " , true ,
2014-01-22 12:17:14 +00:00
" --bind-to-socket " , " --bind-to socket " ,
2014-01-15 14:48:39 +00:00
" hwloc_base_bind_to_socket " , " hwloc_base_binding_policy=socket " ) ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
/* set binding policy to socket - error if something else already set */
if ( OPAL_BINDING_POLICY_IS_SET ( opal_hwloc_binding_policy ) & &
OPAL_GET_BINDING_POLICY ( opal_hwloc_binding_policy ) ! = OPAL_BIND_TO_SOCKET ) {
/* error - cannot redefine the default ranking policy */
opal_show_help ( " help-opal-hwloc-base.txt " , " redefining-policy " , true ,
" socket " , opal_hwloc_base_print_binding ( opal_hwloc_binding_policy ) ) ;
return OPAL_ERR_SILENT ;
}
OPAL_SET_BINDING_POLICY ( opal_hwloc_binding_policy , OPAL_BIND_TO_SOCKET ) ;
}
/* did the user provide a slot list? */
if ( NULL ! = opal_hwloc_base_slot_list ) {
/* if we already were given a policy, then this is an error */
if ( OPAL_BINDING_POLICY_IS_SET ( opal_hwloc_binding_policy ) ) {
opal_show_help ( " help-opal-hwloc-base.txt " , " redefining-policy " , true ,
" socket " , opal_hwloc_base_print_binding ( opal_hwloc_binding_policy ) ) ;
return OPAL_ERR_SILENT ;
}
OPAL_SET_BINDING_POLICY ( opal_hwloc_binding_policy , OPAL_BIND_TO_CPUSET ) ;
}
2011-10-29 14:58:58 +00:00
/* cpu allocation specification */
2012-03-23 14:05:52 +00:00
if ( NULL ! = opal_hwloc_base_cpu_set ) {
if ( ! OPAL_BINDING_POLICY_IS_SET ( opal_hwloc_binding_policy ) ) {
/* it is okay if a binding policy was already given - just ensure that
* we do bind to the given cpus if provided , otherwise this would be
* ignored if someone didn ' t also specify a binding policy
*/
OPAL_SET_BINDING_POLICY ( opal_hwloc_binding_policy , OPAL_BIND_TO_CPUSET ) ;
}
}
2011-10-29 14:58:58 +00:00
2014-02-09 16:14:38 +00:00
/* if we are binding to hwthreads, then we must use hwthreads as cpus */
if ( OPAL_GET_BINDING_POLICY ( opal_hwloc_binding_policy ) = = OPAL_BIND_TO_HWTHREAD ) {
opal_hwloc_use_hwthreads_as_cpus = true ;
}
2011-09-11 19:02:24 +00:00
/* to support tools such as ompi_info, add the components
* to a list
*/
if ( OPAL_SUCCESS ! =
2013-03-27 21:11:47 +00:00
mca_base_framework_components_open ( & opal_hwloc_base_framework , flags ) ) {
2011-09-11 19:02:24 +00:00
return OPAL_ERROR ;
}
/* declare the hwloc data types */
tmp = OPAL_HWLOC_TOPO ;
2013-12-20 20:42:39 +00:00
if ( OPAL_SUCCESS ! = ( rc = opal_dss . register_type ( opal_hwloc_pack ,
opal_hwloc_unpack ,
( opal_dss_copy_fn_t ) opal_hwloc_copy ,
( opal_dss_compare_fn_t ) opal_hwloc_compare ,
( opal_dss_print_fn_t ) opal_hwloc_print ,
OPAL_DSS_STRUCTURED ,
" OPAL_HWLOC_TOPO " , & tmp ) ) ) {
return rc ;
2011-09-11 19:02:24 +00:00
}
}
# endif
return OPAL_SUCCESS ;
}
2011-10-29 14:58:58 +00:00
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
static int opal_hwloc_base_close ( void )
{
if ( ! opal_hwloc_base_inited ) {
return OPAL_SUCCESS ;
}
# if OPAL_HAVE_HWLOC
{
int ret ;
/* no need to close the component as it was statically opened */
/* for support of tools such as ompi_info */
ret = mca_base_framework_components_close ( & opal_hwloc_base_framework , NULL ) ;
if ( OPAL_SUCCESS ! = ret ) {
return ret ;
}
/* free memory */
if ( NULL ! = opal_hwloc_my_cpuset ) {
hwloc_bitmap_free ( opal_hwloc_my_cpuset ) ;
opal_hwloc_my_cpuset = NULL ;
}
}
# endif
/* All done */
opal_hwloc_base_inited = false ;
return OPAL_SUCCESS ;
}
2012-09-26 23:24:27 +00:00
static bool fns_init = false ;
static opal_tsd_key_t print_tsd_key ;
2012-09-27 01:43:54 +00:00
char * opal_hwloc_print_null = " NULL " ;
2012-09-26 23:24:27 +00:00
static void buffer_cleanup ( void * value )
{
int i ;
opal_hwloc_print_buffers_t * ptr ;
if ( NULL ! = value ) {
ptr = ( opal_hwloc_print_buffers_t * ) value ;
for ( i = 0 ; i < OPAL_HWLOC_PRINT_NUM_BUFS ; i + + ) {
free ( ptr - > buffers [ i ] ) ;
}
}
}
opal_hwloc_print_buffers_t * opal_hwloc_get_print_buffer ( void )
{
opal_hwloc_print_buffers_t * ptr ;
int ret , i ;
if ( ! fns_init ) {
/* setup the print_args function */
if ( OPAL_SUCCESS ! = ( ret = opal_tsd_key_create ( & print_tsd_key , buffer_cleanup ) ) ) {
return NULL ;
}
fns_init = true ;
}
ret = opal_tsd_getspecific ( print_tsd_key , ( void * * ) & ptr ) ;
if ( OPAL_SUCCESS ! = ret ) return NULL ;
if ( NULL = = ptr ) {
ptr = ( opal_hwloc_print_buffers_t * ) malloc ( sizeof ( opal_hwloc_print_buffers_t ) ) ;
for ( i = 0 ; i < OPAL_HWLOC_PRINT_NUM_BUFS ; i + + ) {
ptr - > buffers [ i ] = ( char * ) malloc ( ( OPAL_HWLOC_PRINT_MAX_SIZE + 1 ) * sizeof ( char ) ) ;
}
ptr - > cntr = 0 ;
ret = opal_tsd_setspecific ( print_tsd_key , ( void * ) ptr ) ;
}
return ( opal_hwloc_print_buffers_t * ) ptr ;
}
char * opal_hwloc_base_print_locality ( opal_hwloc_locality_t locality )
{
opal_hwloc_print_buffers_t * ptr ;
int idx ;
ptr = opal_hwloc_get_print_buffer ( ) ;
if ( NULL = = ptr ) {
return opal_hwloc_print_null ;
}
/* cycle around the ring */
if ( OPAL_HWLOC_PRINT_NUM_BUFS = = ptr - > cntr ) {
ptr - > cntr = 0 ;
}
idx = 0 ;
if ( OPAL_PROC_ON_LOCAL_CLUSTER ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' C ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' L ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_CU ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' C ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' U ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_NODE ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' N ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_BOARD ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' B ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_NUMA ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' N ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' u ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_SOCKET ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' S ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_L3CACHE ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' L ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' 3 ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_L2CACHE ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' L ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' 2 ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_L1CACHE ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' L ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' 1 ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_CORE ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' C ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( OPAL_PROC_ON_LOCAL_HWTHREAD ( locality ) ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' H ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' w ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' t ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' : ' ;
}
if ( 0 < idx ) {
ptr - > buffers [ ptr - > cntr ] [ idx - 1 ] = ' \0 ' ;
} else if ( OPAL_PROC_NON_LOCAL & locality ) {
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' N ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' O ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' N ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' \0 ' ;
} else {
/* must be an unknown locality */
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' U ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' N ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' K ' ;
ptr - > buffers [ ptr - > cntr ] [ idx + + ] = ' \0 ' ;
}
return ptr - > buffers [ ptr - > cntr ] ;
}
2011-10-29 14:58:58 +00:00
# if OPAL_HAVE_HWLOC
static void obj_data_const ( opal_hwloc_obj_data_t * ptr )
{
ptr - > available = NULL ;
ptr - > npus = 0 ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
ptr - > idx = UINT_MAX ;
2012-09-14 22:01:19 +00:00
ptr - > num_bound = 0 ;
2011-10-29 14:58:58 +00:00
}
static void obj_data_dest ( opal_hwloc_obj_data_t * ptr )
{
if ( NULL ! = ptr - > available ) {
hwloc_bitmap_free ( ptr - > available ) ;
}
}
OBJ_CLASS_INSTANCE ( opal_hwloc_obj_data_t ,
opal_object_t ,
obj_data_const , obj_data_dest ) ;
static void sum_const ( opal_hwloc_summary_t * ptr )
{
ptr - > num_objs = 0 ;
ptr - > rtype = 0 ;
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 13:04:40 +00:00
OBJ_CONSTRUCT ( & ptr - > sorted_by_dist_list , opal_list_t ) ;
}
static void sum_dest ( opal_hwloc_summary_t * ptr )
{
opal_list_item_t * item ;
while ( NULL ! = ( item = opal_list_remove_first ( & ptr - > sorted_by_dist_list ) ) ) {
OBJ_RELEASE ( item ) ;
}
OBJ_DESTRUCT ( & ptr - > sorted_by_dist_list ) ;
2011-10-29 14:58:58 +00:00
}
OBJ_CLASS_INSTANCE ( opal_hwloc_summary_t ,
opal_list_item_t ,
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 13:04:40 +00:00
sum_const , sum_dest ) ;
2011-10-29 14:58:58 +00:00
static void topo_data_const ( opal_hwloc_topo_data_t * ptr )
{
ptr - > available = NULL ;
OBJ_CONSTRUCT ( & ptr - > summaries , opal_list_t ) ;
2012-08-02 16:29:44 +00:00
ptr - > userdata = NULL ;
2011-10-29 14:58:58 +00:00
}
static void topo_data_dest ( opal_hwloc_topo_data_t * ptr )
{
opal_list_item_t * item ;
if ( NULL ! = ptr - > available ) {
hwloc_bitmap_free ( ptr - > available ) ;
}
while ( NULL ! = ( item = opal_list_remove_first ( & ptr - > summaries ) ) ) {
OBJ_RELEASE ( item ) ;
}
OBJ_DESTRUCT ( & ptr - > summaries ) ;
2012-08-02 16:29:44 +00:00
ptr - > userdata = NULL ;
2011-10-29 14:58:58 +00:00
}
OBJ_CLASS_INSTANCE ( opal_hwloc_topo_data_t ,
opal_object_t ,
topo_data_const ,
topo_data_dest ) ;
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 13:04:40 +00:00
2014-07-31 19:58:47 +00:00
OBJ_CLASS_INSTANCE ( opal_rmaps_numa_node_t ,
This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
2013-05-22 13:04:40 +00:00
opal_list_item_t ,
NULL ,
NULL ) ;
2013-12-20 20:42:39 +00:00
int opal_hwloc_base_set_binding_policy ( opal_binding_policy_t * policy , char * spec )
{
int i ;
opal_binding_policy_t tmp ;
char * * tmpvals , * * quals ;
/* set default */
tmp = 0 ;
/* binding specification */
if ( NULL = = spec ) {
/* default to bind-to core, and that no binding policy was specified */
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_CORE ) ;
tmp & = ~ OPAL_BIND_GIVEN ;
} else if ( 0 = = strncasecmp ( spec , " none " , strlen ( " none " ) ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_NONE ) ;
} else {
tmpvals = opal_argv_split ( spec , ' : ' ) ;
2014-06-14 15:38:32 +00:00
if ( 1 < opal_argv_count ( tmpvals ) | | ' : ' = = spec [ 0 ] ) {
if ( ' : ' = = spec [ 0 ] ) {
quals = opal_argv_split ( & spec [ 1 ] , ' , ' ) ;
} else {
quals = opal_argv_split ( tmpvals [ 1 ] , ' , ' ) ;
}
2013-12-20 20:42:39 +00:00
for ( i = 0 ; NULL ! = quals [ i ] ; i + + ) {
2014-06-08 20:26:59 +00:00
if ( 0 = = strncasecmp ( quals [ i ] , " if-supported " , strlen ( quals [ i ] ) ) ) {
2013-12-20 20:42:39 +00:00
tmp | = OPAL_BIND_IF_SUPPORTED ;
2014-06-08 20:26:59 +00:00
} else if ( 0 = = strncasecmp ( quals [ i ] , " overload-allowed " , strlen ( quals [ i ] ) ) | |
0 = = strncasecmp ( quals [ i ] , " oversubscribe-allowed " , strlen ( quals [ i ] ) ) ) {
2013-12-20 20:42:39 +00:00
tmp | = OPAL_BIND_ALLOW_OVERLOAD ;
} else {
/* unknown option */
2014-06-08 20:26:59 +00:00
opal_output ( 0 , " Unknown qualifier to binding policy: %s " , spec ) ;
2013-12-20 20:42:39 +00:00
return OPAL_ERR_BAD_PARAM ;
}
}
opal_argv_free ( quals ) ;
}
2014-06-14 15:38:32 +00:00
if ( NULL = = tmpvals [ 0 ] | | ' : ' = = spec [ 0 ] ) {
2013-12-20 20:42:39 +00:00
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_CORE ) ;
2014-06-08 20:26:59 +00:00
tmp & = ~ OPAL_BIND_GIVEN ;
2013-12-20 20:42:39 +00:00
} else {
2014-06-08 20:26:59 +00:00
if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " hwthread " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_HWTHREAD ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " core " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_CORE ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " l1cache " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_L1CACHE ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " l2cache " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_L2CACHE ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " l3cache " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_L3CACHE ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " socket " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_SOCKET ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " numa " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_NUMA ) ;
} else if ( 0 = = strcasecmp ( tmpvals [ 0 ] , " board " ) ) {
OPAL_SET_BINDING_POLICY ( tmp , OPAL_BIND_TO_BOARD ) ;
} else {
opal_show_help ( " help-opal-hwloc-base.txt " , " invalid binding_policy " , true , " binding " , spec ) ;
opal_argv_free ( tmpvals ) ;
return OPAL_ERR_BAD_PARAM ;
}
2013-12-20 20:42:39 +00:00
}
opal_argv_free ( tmpvals ) ;
}
* policy = tmp ;
return OPAL_SUCCESS ;
}
2011-10-29 14:58:58 +00:00
# endif