2008-02-28 04:57:57 +03:00
|
|
|
/*
|
2010-03-13 02:57:50 +03:00
|
|
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
2008-02-28 04:57:57 +03:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
2011-06-24 00:38:02 +04:00
|
|
|
* Copyright (c) 2004-2011 The University of Tennessee and The University
|
2008-02-28 04:57:57 +03:00
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2015-06-24 06:59:57 +03:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
2008-02-28 04:57:57 +03:00
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
* Copyright (c) 2007-2011 Cisco Systems, Inc. All rights reserved.
|
2010-08-09 23:28:56 +04:00
|
|
|
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
|
2013-01-20 04:33:42 +04:00
|
|
|
* Copyright (c) 2011-2013 Los Alamos National Security, LLC.
|
2012-04-06 18:23:13 +04:00
|
|
|
* All rights reserved.
|
2015-01-27 05:15:57 +03:00
|
|
|
* Copyright (c) 2013-2015 Intel, Inc. All rights reserved
|
2015-05-08 03:17:00 +03:00
|
|
|
* Copyright (c) 2014-2015 Research Organization for Information Science
|
2014-11-19 11:21:43 +03:00
|
|
|
* and Technology (RIST). All rights reserved.
|
2008-02-28 04:57:57 +03:00
|
|
|
* $COPYRIGHT$
|
2015-06-24 06:59:57 +03:00
|
|
|
*
|
2008-02-28 04:57:57 +03:00
|
|
|
* Additional copyrights may follow
|
2015-06-24 06:59:57 +03:00
|
|
|
*
|
2008-02-28 04:57:57 +03:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "orte_config.h"
|
|
|
|
#include "orte/constants.h"
|
|
|
|
#include "orte/types.h"
|
|
|
|
|
|
|
|
#ifdef HAVE_SYS_TIME_H
|
|
|
|
#include <sys/time.h>
|
|
|
|
#endif
|
|
|
|
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
#include "opal/mca/hwloc/hwloc.h"
|
2015-06-18 19:53:20 +03:00
|
|
|
#include "opal/mca/pmix/pmix.h"
|
2008-09-01 21:49:31 +04:00
|
|
|
#include "opal/util/argv.h"
|
2009-02-14 05:26:12 +03:00
|
|
|
#include "opal/util/output.h"
|
2013-10-15 02:01:48 +04:00
|
|
|
#include "opal/class/opal_hash_table.h"
|
2008-02-28 08:32:23 +03:00
|
|
|
#include "opal/class/opal_pointer_array.h"
|
2009-06-17 06:54:20 +04:00
|
|
|
#include "opal/class/opal_value_array.h"
|
2008-02-28 04:57:57 +03:00
|
|
|
#include "opal/dss/dss.h"
|
2010-04-23 08:44:41 +04:00
|
|
|
#include "opal/threads/threads.h"
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
#include "orte/mca/errmgr/errmgr.h"
|
2010-04-23 08:44:41 +04:00
|
|
|
#include "orte/mca/rml/rml.h"
|
2009-05-04 15:07:40 +04:00
|
|
|
#include "orte/util/proc_info.h"
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
#include "orte/util/name_fns.h"
|
2008-02-28 04:57:57 +03:00
|
|
|
|
|
|
|
#include "orte/runtime/runtime.h"
|
2008-06-18 07:15:56 +04:00
|
|
|
#include "orte/runtime/runtime_internals.h"
|
2008-02-28 04:57:57 +03:00
|
|
|
#include "orte/runtime/orte_globals.h"
|
|
|
|
|
|
|
|
/* need the data type support functions here */
|
|
|
|
#include "orte/runtime/data_type_support/orte_dt_support.h"
|
|
|
|
|
2012-04-11 01:50:01 +04:00
|
|
|
/* State Machine */
|
2015-05-08 03:17:00 +03:00
|
|
|
opal_list_t orte_job_states = {{0}};
|
|
|
|
opal_list_t orte_proc_states = {{0}};
|
2012-04-11 01:50:01 +04:00
|
|
|
|
|
|
|
/* a clean output channel without prefix */
|
|
|
|
int orte_clean_output = -1;
|
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
/* globals used by RTE */
|
2008-03-28 05:20:37 +03:00
|
|
|
bool orte_debug_daemons_file_flag = false;
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_leave_session_attached = false;
|
2008-04-17 17:50:59 +04:00
|
|
|
bool orte_do_not_launch = false;
|
2008-03-28 05:20:37 +03:00
|
|
|
bool orted_spin_flag = false;
|
2010-08-09 23:28:56 +04:00
|
|
|
char *orte_local_cpu_type = NULL;
|
2009-12-01 02:11:25 +03:00
|
|
|
char *orte_local_cpu_model = NULL;
|
2010-07-18 01:03:27 +04:00
|
|
|
char *orte_basename = NULL;
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
bool orte_coprocessors_detected = false;
|
2013-10-24 04:08:47 +04:00
|
|
|
opal_hash_table_t *orte_coprocessors = NULL;
|
2014-12-09 02:33:45 +03:00
|
|
|
char *orte_topo_signature = NULL;
|
2009-08-21 22:03:34 +04:00
|
|
|
|
|
|
|
/* ORTE OOB port flags */
|
2008-03-28 05:20:37 +03:00
|
|
|
bool orte_static_ports = false;
|
2009-08-21 22:03:34 +04:00
|
|
|
char *orte_oob_static_ports = NULL;
|
2009-08-22 06:58:20 +04:00
|
|
|
bool orte_standalone_operation = false;
|
2009-08-21 22:03:34 +04:00
|
|
|
|
2008-04-02 00:32:17 +04:00
|
|
|
bool orte_keep_fqdn_hostnames = false;
|
2011-12-01 18:24:43 +04:00
|
|
|
bool orte_have_fqdn_allocation = false;
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_show_resolved_nodenames = false;
|
|
|
|
bool orte_retain_aliases = false;
|
|
|
|
int orte_use_hostname_alias = -1;
|
2012-11-16 08:04:29 +04:00
|
|
|
|
2015-05-08 03:17:00 +03:00
|
|
|
int orted_debug_failure = -1;
|
|
|
|
int orted_debug_failure_delay = -1;
|
2008-06-24 21:50:56 +04:00
|
|
|
bool orte_hetero_apps = false;
|
2011-11-01 22:43:10 +04:00
|
|
|
bool orte_hetero_nodes = false;
|
2008-08-19 19:19:30 +04:00
|
|
|
bool orte_never_launched = false;
|
2008-09-23 19:46:34 +04:00
|
|
|
bool orte_devel_level_output = false;
|
2011-10-29 19:12:45 +04:00
|
|
|
bool orte_display_topo_with_map = false;
|
2011-11-03 18:22:07 +04:00
|
|
|
bool orte_display_diffable_output = false;
|
2008-04-10 02:10:53 +04:00
|
|
|
|
2015-05-08 03:17:00 +03:00
|
|
|
char **orte_launch_environ = NULL;
|
2008-07-25 21:13:22 +04:00
|
|
|
|
|
|
|
bool orte_hnp_is_allocated = false;
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_allocation_required = false;
|
2012-09-04 20:34:05 +04:00
|
|
|
bool orte_managed_allocation = false;
|
If (and only if) a user requests, set the default number of slots on any node to the number of objects of the specified type. This *only* takes effect in an unmanaged environment - i.e., if an external resource manager assigns us a number of slots, then that is what we use. However, if we are using a hostfile, then the user may or may not have given us a value for the number of slots on each node.
For those nodes (and *only* those nodes) where the user does *not* specify a slot count, we will set the number of slots according to their direction: either to the number of cores, numas, sockets, or hwthreads. Otherwise, the slot count is set to 1.
Note that the default behavior remains unchanged: in the absence of any value for #slots, and in the absence of any directive to set #slots, we will set #slots=1.
This commit was SVN r27236.
2012-09-05 00:58:26 +04:00
|
|
|
char *orte_set_slots = NULL;
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_display_allocation = false;
|
|
|
|
bool orte_display_devel_allocation = false;
|
2012-09-05 22:42:09 +04:00
|
|
|
bool orte_soft_locations = false;
|
2014-06-01 08:28:17 +04:00
|
|
|
int orted_pmi_version = 0;
|
2008-07-25 21:13:22 +04:00
|
|
|
|
2011-06-30 07:12:38 +04:00
|
|
|
/* launch agents */
|
2009-02-13 07:14:10 +03:00
|
|
|
char *orte_launch_agent = NULL;
|
2008-02-28 04:57:57 +03:00
|
|
|
char **orted_cmd_line=NULL;
|
2011-06-30 07:12:38 +04:00
|
|
|
char **orte_fork_agent=NULL;
|
2008-08-05 19:09:29 +04:00
|
|
|
|
2010-10-23 00:07:24 +04:00
|
|
|
/* debugger job */
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_debugger_dump_proctable = false;
|
|
|
|
char *orte_debugger_test_daemon = NULL;
|
|
|
|
bool orte_debugger_test_attach = false;
|
|
|
|
int orte_debugger_check_rate = -1;
|
2008-08-13 21:47:24 +04:00
|
|
|
|
2010-07-18 01:03:27 +04:00
|
|
|
/* exit flags */
|
2008-02-28 04:57:57 +03:00
|
|
|
int orte_exit_status = 0;
|
|
|
|
bool orte_abnormal_term_ordered = false;
|
2011-12-15 21:13:52 +04:00
|
|
|
bool orte_routing_is_enabled = true;
|
2009-02-27 13:16:25 +03:00
|
|
|
bool orte_job_term_ordered = false;
|
2010-05-23 06:57:03 +04:00
|
|
|
bool orte_orteds_term_ordered = false;
|
2012-11-10 18:09:12 +04:00
|
|
|
bool orte_allowed_exit_without_sync = false;
|
2008-06-03 01:46:34 +04:00
|
|
|
|
2015-05-08 03:17:00 +03:00
|
|
|
int orte_startup_timeout = -1;
|
|
|
|
int orte_timeout_usec_per_proc = -1;
|
|
|
|
float orte_max_timeout = -1.0;
|
2013-12-07 05:58:32 +04:00
|
|
|
orte_timer_t *orte_mpiexec_timeout = NULL;
|
2008-02-28 04:57:57 +03:00
|
|
|
|
2008-05-01 23:19:34 +04:00
|
|
|
opal_buffer_t *orte_tree_launch_cmd = NULL;
|
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
/* global arrays for data storage */
|
2015-05-08 03:17:00 +03:00
|
|
|
opal_pointer_array_t *orte_job_data = NULL;
|
|
|
|
opal_pointer_array_t *orte_node_pool = NULL;
|
|
|
|
opal_pointer_array_t *orte_node_topologies = NULL;
|
|
|
|
opal_pointer_array_t *orte_local_children = NULL;
|
2013-11-14 21:01:43 +04:00
|
|
|
orte_vpid_t orte_total_procs = 0;
|
2008-02-28 04:57:57 +03:00
|
|
|
|
2009-01-31 01:47:30 +03:00
|
|
|
/* IOF controls */
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_tag_output = false;
|
|
|
|
bool orte_timestamp_output = false;
|
|
|
|
char *orte_output_filename = NULL;
|
2009-01-31 01:47:30 +03:00
|
|
|
/* generate new xterm windows to display output from specified ranks */
|
2015-05-08 03:17:00 +03:00
|
|
|
char *orte_xterm = NULL;
|
2009-01-31 01:47:30 +03:00
|
|
|
|
2009-01-30 21:50:10 +03:00
|
|
|
/* whether or not to forward SIGTSTP and SIGCONT signals */
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_forward_job_control = false;
|
2009-01-30 21:50:10 +03:00
|
|
|
|
2009-06-03 03:52:02 +04:00
|
|
|
/* report launch progress */
|
|
|
|
bool orte_report_launch_progress = false;
|
|
|
|
|
2009-08-11 06:51:27 +04:00
|
|
|
/* allocation specification */
|
2009-08-13 20:08:43 +04:00
|
|
|
char *orte_default_hostfile = NULL;
|
2012-02-15 08:16:05 +04:00
|
|
|
bool orte_default_hostfile_given = false;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
char *orte_rankfile = NULL;
|
2011-07-07 22:54:30 +04:00
|
|
|
int orte_num_allocated_nodes = 0;
|
|
|
|
char *orte_node_regex = NULL;
|
2009-08-11 06:51:27 +04:00
|
|
|
|
2009-09-09 09:28:45 +04:00
|
|
|
/* tool communication controls */
|
|
|
|
bool orte_report_events = false;
|
|
|
|
char *orte_report_events_uri = NULL;
|
|
|
|
|
2009-09-28 07:17:15 +04:00
|
|
|
/* report bindings */
|
|
|
|
bool orte_report_bindings = false;
|
|
|
|
|
2010-01-14 20:59:42 +03:00
|
|
|
/* barrier control */
|
|
|
|
bool orte_do_not_barrier = false;
|
|
|
|
|
2010-04-28 08:06:57 +04:00
|
|
|
/* process recovery */
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_enable_recovery = false;
|
|
|
|
int32_t orte_max_restarts = 0;
|
2010-04-28 08:06:57 +04:00
|
|
|
|
2010-05-12 22:11:58 +04:00
|
|
|
/* exit status reporting */
|
2015-05-08 03:17:00 +03:00
|
|
|
bool orte_report_child_jobs_separately = false;
|
|
|
|
struct timeval orte_child_time_to_exit = {0};
|
|
|
|
bool orte_abort_non_zero_exit = false;
|
2010-05-12 22:11:58 +04:00
|
|
|
|
2011-06-30 07:12:38 +04:00
|
|
|
/* length of stat history to keep */
|
2015-05-08 03:17:00 +03:00
|
|
|
int orte_stat_history_size = -1;
|
2011-06-30 07:12:38 +04:00
|
|
|
|
2011-12-07 01:31:22 +04:00
|
|
|
/* envars to forward */
|
2013-12-20 18:47:35 +04:00
|
|
|
char **orte_forwarded_envars = NULL;
|
2011-12-07 01:31:22 +04:00
|
|
|
|
2012-05-03 01:00:22 +04:00
|
|
|
/* map-reduce mode */
|
|
|
|
bool orte_map_reduce = false;
|
2012-11-10 18:09:12 +04:00
|
|
|
bool orte_staged_execution = false;
|
2012-05-03 01:00:22 +04:00
|
|
|
|
2012-04-27 18:39:34 +04:00
|
|
|
/* map stddiag output to stderr so it isn't forwarded to mpirun */
|
|
|
|
bool orte_map_stddiag_to_stderr = false;
|
|
|
|
|
2012-05-27 20:48:19 +04:00
|
|
|
/* maximum size of virtual machine - used to subdivide allocation */
|
|
|
|
int orte_max_vm_size = -1;
|
|
|
|
|
2013-03-28 01:09:41 +04:00
|
|
|
/* user debugger */
|
|
|
|
char *orte_base_user_debugger = NULL;
|
|
|
|
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
/* modex cutoff */
|
2014-10-09 08:15:31 +04:00
|
|
|
uint32_t orte_direct_modex_cutoff = UINT32_MAX;
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
|
2008-09-02 19:07:48 +04:00
|
|
|
int orte_debug_output = -1;
|
2008-09-01 21:15:01 +04:00
|
|
|
bool orte_debug_daemons_flag = false;
|
|
|
|
bool orte_xml_output = false;
|
2009-09-02 22:03:10 +04:00
|
|
|
FILE *orte_xml_fp = NULL;
|
2009-07-15 23:43:26 +04:00
|
|
|
char *orte_job_ident = NULL;
|
2010-04-02 18:19:38 +04:00
|
|
|
bool orte_execute_quiet = false;
|
2010-10-16 07:29:47 +04:00
|
|
|
bool orte_report_silent_errors = false;
|
2008-09-01 21:15:01 +04:00
|
|
|
|
2008-08-01 02:11:46 +04:00
|
|
|
/* See comment in orte/tools/orterun/debuggers.c about this MCA
|
|
|
|
param */
|
|
|
|
bool orte_in_parallel_debugger = false;
|
|
|
|
|
2014-01-31 03:50:14 +04:00
|
|
|
char *orte_daemon_cores = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
|
2008-05-28 17:29:58 +04:00
|
|
|
int orte_dt_init(void)
|
2008-02-28 04:57:57 +03:00
|
|
|
{
|
2008-05-28 17:29:58 +04:00
|
|
|
int rc;
|
|
|
|
opal_data_type_t tmp;
|
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
/* set default output */
|
2008-06-09 18:53:58 +04:00
|
|
|
orte_debug_output = opal_output_open(NULL);
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
/* open up the verbose output for ORTE debugging */
|
|
|
|
if (orte_debug_flag || 0 < orte_debug_verbosity ||
|
2009-05-04 15:07:40 +04:00
|
|
|
(orte_debug_daemons_flag && (ORTE_PROC_IS_DAEMON || ORTE_PROC_IS_HNP))) {
|
2008-02-28 04:57:57 +03:00
|
|
|
if (0 < orte_debug_verbosity) {
|
2008-06-09 18:53:58 +04:00
|
|
|
opal_output_set_verbosity(orte_debug_output, orte_debug_verbosity);
|
2008-02-28 04:57:57 +03:00
|
|
|
} else {
|
2008-06-09 18:53:58 +04:00
|
|
|
opal_output_set_verbosity(orte_debug_output, 1);
|
2008-02-28 04:57:57 +03:00
|
|
|
}
|
|
|
|
}
|
2012-08-29 01:20:17 +04:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
/** register the base system types with the DSS */
|
|
|
|
tmp = ORTE_STD_CNTR;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_std_cntr,
|
|
|
|
orte_dt_unpack_std_cntr,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_std_cntr,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_std_cntr,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_STD_CNTR", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2008-09-01 21:15:01 +04:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_JOB;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_job,
|
|
|
|
orte_dt_unpack_job,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_job,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_job,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_job,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_JOB", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_NODE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_node,
|
|
|
|
orte_dt_unpack_node,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_node,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_node,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_node,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_NODE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_PROC;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_proc,
|
|
|
|
orte_dt_unpack_proc,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_proc,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_proc,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_proc,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_PROC", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_APP_CONTEXT;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_app_context,
|
|
|
|
orte_dt_unpack_app_context,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_app_context,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_app_context,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_app_context,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_APP_CONTEXT", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_NODE_STATE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_node_state,
|
|
|
|
orte_dt_unpack_node_state,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_node_state,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_node_state,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_NODE_STATE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_PROC_STATE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_proc_state,
|
|
|
|
orte_dt_unpack_proc_state,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_proc_state,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_proc_state,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_PROC_STATE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_JOB_STATE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_job_state,
|
|
|
|
orte_dt_unpack_job_state,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_job_state,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_job_state,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_JOB_STATE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_EXIT_CODE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_exit_code,
|
|
|
|
orte_dt_unpack_exit_code,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_exit_code,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_exit_code,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_EXIT_CODE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
tmp = ORTE_JOB_MAP;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_map,
|
|
|
|
orte_dt_unpack_map,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_map,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_map,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_map,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_JOB_MAP", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-02-28 04:57:57 +03:00
|
|
|
tmp = ORTE_RML_TAG;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_tag,
|
|
|
|
orte_dt_unpack_tag,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_tag,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_tags,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_RML_TAG", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
tmp = ORTE_DAEMON_CMD;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_daemon_cmd,
|
|
|
|
orte_dt_unpack_daemon_cmd,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_daemon_cmd,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_daemon_cmd,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_DAEMON_CMD", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
|
|
|
tmp = ORTE_IOF_TAG;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_iof_tag,
|
|
|
|
orte_dt_unpack_iof_tag,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_iof_tag,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_iof_tag,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_std_print,
|
|
|
|
OPAL_DSS_UNSTRUCTURED,
|
|
|
|
"ORTE_IOF_TAG", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
tmp = ORTE_ATTRIBUTE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_attr,
|
|
|
|
orte_dt_unpack_attr,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_attr,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_attr,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_attr,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_ATTRIBUTE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
tmp = ORTE_SIGNATURE;
|
|
|
|
if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_sig,
|
|
|
|
orte_dt_unpack_sig,
|
|
|
|
(opal_dss_copy_fn_t)orte_dt_copy_sig,
|
|
|
|
(opal_dss_compare_fn_t)orte_dt_compare_sig,
|
|
|
|
(opal_dss_print_fn_t)orte_dt_print_sig,
|
|
|
|
OPAL_DSS_STRUCTURED,
|
|
|
|
"ORTE_SIGNATURE", &tmp))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
|
|
|
return ORTE_SUCCESS;
|
2008-02-28 04:57:57 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
orte_job_t* orte_get_job_data_object(orte_jobid_t job)
|
|
|
|
{
|
2009-04-13 23:06:54 +04:00
|
|
|
int32_t ljob;
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2010-03-31 23:20:06 +04:00
|
|
|
/* if the job data wasn't setup, we cannot provide the data */
|
|
|
|
if (NULL == orte_job_data) {
|
2008-02-28 04:57:57 +03:00
|
|
|
return NULL;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2009-04-13 23:06:54 +04:00
|
|
|
/* the job is indexed by its local jobid, so we can
|
|
|
|
* just look it up here. it is not an error for this
|
|
|
|
* to not be found - could just be
|
2009-03-24 21:06:49 +03:00
|
|
|
* a race condition whereby the job has already been
|
2009-04-13 23:06:54 +04:00
|
|
|
* removed from the array. The get_item function
|
|
|
|
* will just return NULL in that case.
|
2009-03-24 21:06:49 +03:00
|
|
|
*/
|
2009-04-13 23:06:54 +04:00
|
|
|
ljob = ORTE_LOCAL_JOBID(job);
|
|
|
|
return (orte_job_t*)opal_pointer_array_get_item(orte_job_data, ljob);
|
2008-02-28 04:57:57 +03:00
|
|
|
}
|
2008-09-01 21:15:01 +04:00
|
|
|
|
2012-06-27 18:53:55 +04:00
|
|
|
orte_proc_t* orte_get_proc_object(orte_process_name_t *proc)
|
|
|
|
{
|
|
|
|
orte_job_t *jdata;
|
|
|
|
orte_proc_t *proct;
|
|
|
|
|
|
|
|
if (NULL == (jdata = orte_get_job_data_object(proc->jobid))) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, proc->vpid);
|
|
|
|
return proct;
|
|
|
|
}
|
|
|
|
|
|
|
|
orte_vpid_t orte_get_proc_daemon_vpid(orte_process_name_t *proc)
|
|
|
|
{
|
|
|
|
orte_job_t *jdata;
|
|
|
|
orte_proc_t *proct;
|
|
|
|
|
|
|
|
if (NULL == (jdata = orte_get_job_data_object(proc->jobid))) {
|
|
|
|
return ORTE_VPID_INVALID;
|
|
|
|
}
|
|
|
|
if (NULL == (proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, proc->vpid))) {
|
|
|
|
return ORTE_VPID_INVALID;
|
|
|
|
}
|
|
|
|
if (NULL == proct->node || NULL == proct->node->daemon) {
|
|
|
|
return ORTE_VPID_INVALID;
|
|
|
|
}
|
|
|
|
return proct->node->daemon->name.vpid;
|
|
|
|
}
|
|
|
|
|
|
|
|
char* orte_get_proc_hostname(orte_process_name_t *proc)
|
|
|
|
{
|
|
|
|
orte_proc_t *proct;
|
2015-09-30 20:33:53 +03:00
|
|
|
char *hostname = NULL;
|
2015-06-18 19:53:20 +03:00
|
|
|
int rc;
|
2012-06-27 18:53:55 +04:00
|
|
|
|
2014-09-16 20:28:29 +04:00
|
|
|
/* don't bother error logging any not-found situations
|
|
|
|
* as the layer above us will have something to say
|
|
|
|
* about it */
|
2012-06-27 18:53:55 +04:00
|
|
|
if (ORTE_PROC_IS_DAEMON || ORTE_PROC_IS_HNP) {
|
|
|
|
/* look it up on our arrays */
|
|
|
|
if (NULL == (proct = orte_get_proc_object(proc))) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (NULL == proct->node || NULL == proct->node->name) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
return proct->node->name;
|
|
|
|
}
|
|
|
|
|
2014-04-30 01:49:23 +04:00
|
|
|
/* if we are an app, get the data from the modex db */
|
2015-06-18 19:53:20 +03:00
|
|
|
OPAL_MODEX_RECV_VALUE(rc, OPAL_PMIX_HOSTNAME,
|
|
|
|
(opal_process_name_t*)proc,
|
|
|
|
&hostname, OPAL_STRING);
|
|
|
|
|
2014-04-30 01:49:23 +04:00
|
|
|
/* user is responsible for releasing the data */
|
2012-06-27 18:53:55 +04:00
|
|
|
return hostname;
|
|
|
|
}
|
|
|
|
|
|
|
|
orte_node_rank_t orte_get_proc_node_rank(orte_process_name_t *proc)
|
|
|
|
{
|
|
|
|
orte_proc_t *proct;
|
2015-06-18 19:53:20 +03:00
|
|
|
orte_node_rank_t *noderank, nd;
|
2012-06-27 18:53:55 +04:00
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (ORTE_PROC_IS_DAEMON || ORTE_PROC_IS_HNP) {
|
|
|
|
/* look it up on our arrays */
|
|
|
|
if (NULL == (proct = orte_get_proc_object(proc))) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
|
|
|
|
return ORTE_NODE_RANK_INVALID;
|
|
|
|
}
|
|
|
|
return proct->node_rank;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we are an app, get the value from the modex db */
|
2015-09-30 20:33:53 +03:00
|
|
|
noderank = &nd;
|
2015-06-18 19:53:20 +03:00
|
|
|
OPAL_MODEX_RECV_VALUE(rc, OPAL_PMIX_NODE_RANK,
|
|
|
|
(opal_process_name_t*)proc,
|
|
|
|
&noderank, ORTE_NODE_RANK);
|
2015-09-30 20:33:53 +03:00
|
|
|
if (OPAL_SUCCESS != rc) {
|
|
|
|
nd = ORTE_NODE_RANK_INVALID;
|
|
|
|
}
|
2015-06-18 19:53:20 +03:00
|
|
|
return nd;
|
2012-06-27 18:53:55 +04:00
|
|
|
}
|
|
|
|
|
2011-02-14 22:45:59 +03:00
|
|
|
orte_vpid_t orte_get_lowest_vpid_alive(orte_jobid_t job)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
orte_job_t *jdata;
|
|
|
|
orte_proc_t *proc;
|
|
|
|
|
|
|
|
if (NULL == (jdata = orte_get_job_data_object(job))) {
|
|
|
|
return ORTE_VPID_INVALID;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ORTE_PROC_IS_DAEMON &&
|
|
|
|
ORTE_PROC_MY_NAME->jobid == job &&
|
|
|
|
NULL != orte_process_info.my_hnp_uri) {
|
|
|
|
/* if we were started by an HNP, then the lowest vpid
|
|
|
|
* is always 1
|
|
|
|
*/
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i=0; i < jdata->procs->size; i++) {
|
|
|
|
if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, i))) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (proc->state == ORTE_PROC_STATE_RUNNING) {
|
|
|
|
/* must be lowest one alive */
|
|
|
|
return proc->name.vpid;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* only get here if no live proc found */
|
|
|
|
return ORTE_VPID_INVALID;
|
|
|
|
}
|
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
/*
|
|
|
|
* CONSTRUCTORS, DESTRUCTORS, AND CLASS INSTANTIATIONS
|
|
|
|
* FOR ORTE CLASSES
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void orte_app_context_construct(orte_app_context_t* app_context)
|
|
|
|
{
|
|
|
|
app_context->idx=0;
|
|
|
|
app_context->app=NULL;
|
|
|
|
app_context->num_procs=0;
|
2012-08-29 01:20:17 +04:00
|
|
|
OBJ_CONSTRUCT(&app_context->procs, opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(&app_context->procs,
|
|
|
|
1,
|
|
|
|
ORTE_GLOBAL_ARRAY_MAX_SIZE,
|
|
|
|
16);
|
|
|
|
app_context->state = ORTE_APP_STATE_UNDEF;
|
2012-08-12 05:28:23 +04:00
|
|
|
app_context->first_rank = 0;
|
2008-09-01 21:15:01 +04:00
|
|
|
app_context->argv=NULL;
|
|
|
|
app_context->env=NULL;
|
|
|
|
app_context->cwd=NULL;
|
2014-06-01 20:14:10 +04:00
|
|
|
app_context->flags = 0;
|
|
|
|
OBJ_CONSTRUCT(&app_context->attributes, opal_list_t);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void orte_app_context_destructor(orte_app_context_t* app_context)
|
|
|
|
{
|
2012-08-29 01:20:17 +04:00
|
|
|
int i;
|
|
|
|
orte_proc_t *proc;
|
2011-05-29 02:18:19 +04:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
if (NULL != app_context->app) {
|
|
|
|
free (app_context->app);
|
2009-07-14 22:56:49 +04:00
|
|
|
app_context->app = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2012-08-29 01:20:17 +04:00
|
|
|
for (i=0; i < app_context->procs.size; i++) {
|
|
|
|
if (NULL != (proc = (orte_proc_t*)opal_pointer_array_get_item(&app_context->procs, i))) {
|
|
|
|
OBJ_RELEASE(proc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
OBJ_DESTRUCT(&app_context->procs);
|
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
/* argv and env lists created by util/argv copy functions */
|
|
|
|
if (NULL != app_context->argv) {
|
|
|
|
opal_argv_free(app_context->argv);
|
2009-07-14 22:56:49 +04:00
|
|
|
app_context->argv = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
if (NULL != app_context->env) {
|
|
|
|
opal_argv_free(app_context->env);
|
2009-07-14 22:56:49 +04:00
|
|
|
app_context->env = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
if (NULL != app_context->cwd) {
|
|
|
|
free (app_context->cwd);
|
2009-07-14 22:56:49 +04:00
|
|
|
app_context->cwd = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
OPAL_LIST_DESTRUCT(&app_context->attributes);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CLASS_INSTANCE(orte_app_context_t,
|
|
|
|
opal_object_t,
|
|
|
|
orte_app_context_construct,
|
|
|
|
orte_app_context_destructor);
|
|
|
|
|
|
|
|
static void orte_job_construct(orte_job_t* job)
|
|
|
|
{
|
2015-01-27 05:15:57 +03:00
|
|
|
job->personality = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
job->jobid = ORTE_JOBID_INVALID;
|
2013-11-14 21:01:43 +04:00
|
|
|
job->offset = 0;
|
2008-09-01 21:15:01 +04:00
|
|
|
job->apps = OBJ_NEW(opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(job->apps,
|
|
|
|
1,
|
|
|
|
ORTE_GLOBAL_ARRAY_MAX_SIZE,
|
|
|
|
2);
|
|
|
|
job->num_apps = 0;
|
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
|
|
|
job->stdin_target = ORTE_VPID_INVALID;
|
2008-09-01 21:15:01 +04:00
|
|
|
job->total_slots_alloc = 0;
|
|
|
|
job->num_procs = 0;
|
|
|
|
job->procs = OBJ_NEW(opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(job->procs,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_MAX_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE);
|
|
|
|
job->map = NULL;
|
|
|
|
job->bookmark = NULL;
|
2015-06-17 19:20:08 +03:00
|
|
|
job->bkmark_obj = 0;
|
2008-09-01 21:15:01 +04:00
|
|
|
job->state = ORTE_JOB_STATE_UNDEF;
|
|
|
|
|
2012-08-30 00:35:52 +04:00
|
|
|
job->num_mapped = 0;
|
2008-09-01 21:15:01 +04:00
|
|
|
job->num_launched = 0;
|
|
|
|
job->num_reported = 0;
|
|
|
|
job->num_terminated = 0;
|
2010-04-23 08:44:41 +04:00
|
|
|
job->num_daemons_reported = 0;
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2012-04-06 18:23:13 +04:00
|
|
|
job->originator.jobid = ORTE_JOBID_INVALID;
|
|
|
|
job->originator.vpid = ORTE_VPID_INVALID;
|
|
|
|
job->num_local_procs = 0;
|
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
job->flags = 0;
|
|
|
|
ORTE_FLAG_SET(job, ORTE_JOB_FLAG_GANG_LAUNCHED);
|
|
|
|
ORTE_FLAG_SET(job, ORTE_JOB_FLAG_FORWARD_OUTPUT);
|
2012-10-30 03:11:30 +04:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
OBJ_CONSTRUCT(&job->attributes, opal_list_t);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void orte_job_destruct(orte_job_t* job)
|
|
|
|
{
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
orte_proc_t *proc;
|
|
|
|
orte_app_context_t *app;
|
|
|
|
orte_job_t *jdata;
|
2009-03-03 19:39:13 +03:00
|
|
|
int n;
|
2014-06-01 20:14:10 +04:00
|
|
|
orte_timer_t *evtimer;
|
2012-04-06 18:23:13 +04:00
|
|
|
|
2009-05-11 18:03:07 +04:00
|
|
|
if (NULL == job) {
|
|
|
|
/* probably just a race condition - just return */
|
|
|
|
return;
|
|
|
|
}
|
2015-01-27 05:15:57 +03:00
|
|
|
|
2009-05-11 18:03:07 +04:00
|
|
|
if (orte_debug_flag) {
|
|
|
|
opal_output(0, "%s Releasing job data for %s",
|
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_JOBID_PRINT(job->jobid));
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2015-01-27 05:15:57 +03:00
|
|
|
if (NULL != job->personality) {
|
|
|
|
free(job->personality);
|
|
|
|
}
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
for (n=0; n < job->apps->size; n++) {
|
|
|
|
if (NULL == (app = (orte_app_context_t*)opal_pointer_array_get_item(job->apps, n))) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
OBJ_RELEASE(app);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
OBJ_RELEASE(job->apps);
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
/* release any pointers in the attributes */
|
|
|
|
evtimer = NULL;
|
|
|
|
if (orte_get_attribute(&job->attributes, ORTE_JOB_FAILURE_TIMER_EVENT,
|
2014-06-04 00:36:38 +04:00
|
|
|
(void**)&evtimer, OPAL_PTR)) {
|
2014-06-02 19:59:18 +04:00
|
|
|
orte_remove_attribute(&job->attributes, ORTE_JOB_FAILURE_TIMER_EVENT);
|
2014-06-01 20:14:10 +04:00
|
|
|
/* the timer is a pointer to orte_timer_t */
|
|
|
|
OBJ_RELEASE(evtimer);
|
|
|
|
}
|
|
|
|
proc = NULL;
|
|
|
|
if (orte_get_attribute(&job->attributes, ORTE_JOB_ABORTED_PROC,
|
2014-06-04 00:36:38 +04:00
|
|
|
(void**)&proc, OPAL_PTR)) {
|
2014-06-02 19:59:18 +04:00
|
|
|
orte_remove_attribute(&job->attributes, ORTE_JOB_ABORTED_PROC);
|
2014-06-01 20:14:10 +04:00
|
|
|
/* points to an orte_proc_t */
|
|
|
|
OBJ_RELEASE(proc);
|
2013-11-12 03:50:40 +04:00
|
|
|
}
|
|
|
|
|
2009-07-14 22:56:49 +04:00
|
|
|
if (NULL != job->map) {
|
|
|
|
OBJ_RELEASE(job->map);
|
|
|
|
job->map = NULL;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2009-03-03 19:39:13 +03:00
|
|
|
for (n=0; n < job->procs->size; n++) {
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(job->procs, n))) {
|
|
|
|
continue;
|
2009-03-03 19:39:13 +03:00
|
|
|
}
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
OBJ_RELEASE(proc);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
OBJ_RELEASE(job->procs);
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
/* release the attributes */
|
|
|
|
OPAL_LIST_DESTRUCT(&job->attributes);
|
2012-10-30 03:11:30 +04:00
|
|
|
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
/* find the job in the global array */
|
2014-06-01 20:14:10 +04:00
|
|
|
if (NULL != orte_job_data && ORTE_JOBID_INVALID != job->jobid) {
|
2009-05-11 18:03:07 +04:00
|
|
|
for (n=0; n < orte_job_data->size; n++) {
|
|
|
|
if (NULL == (jdata = (orte_job_t*)opal_pointer_array_get_item(orte_job_data, n))) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (jdata->jobid == job->jobid) {
|
|
|
|
/* set the entry to NULL */
|
|
|
|
opal_pointer_array_set_item(orte_job_data, n, NULL);
|
|
|
|
break;
|
|
|
|
}
|
The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
2009-05-11 07:38:15 +04:00
|
|
|
}
|
|
|
|
}
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CLASS_INSTANCE(orte_job_t,
|
|
|
|
opal_list_item_t,
|
|
|
|
orte_job_construct,
|
|
|
|
orte_job_destruct);
|
|
|
|
|
|
|
|
|
|
|
|
static void orte_node_construct(orte_node_t* node)
|
|
|
|
{
|
****************************************************************
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
2013-10-14 20:52:58 +04:00
|
|
|
node->index = -1;
|
2008-09-01 21:15:01 +04:00
|
|
|
node->name = NULL;
|
|
|
|
node->daemon = NULL;
|
|
|
|
|
|
|
|
node->num_procs = 0;
|
|
|
|
node->procs = OBJ_NEW(opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(node->procs,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_MAX_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE);
|
|
|
|
node->next_node_rank = 0;
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
node->state = ORTE_NODE_STATE_UNKNOWN;
|
|
|
|
node->slots = 0;
|
|
|
|
node->slots_inuse = 0;
|
|
|
|
node->slots_max = 0;
|
2011-09-11 23:02:24 +04:00
|
|
|
node->topology = NULL;
|
2011-04-22 02:55:45 +04:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
node->flags = 0;
|
|
|
|
OBJ_CONSTRUCT(&node->attributes, opal_list_t);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void orte_node_destruct(orte_node_t* node)
|
|
|
|
{
|
2009-03-03 19:39:13 +03:00
|
|
|
int i;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
orte_proc_t *proc;
|
2011-06-30 07:12:38 +04:00
|
|
|
|
2008-09-01 21:15:01 +04:00
|
|
|
if (NULL != node->name) {
|
|
|
|
free(node->name);
|
2009-07-14 22:56:49 +04:00
|
|
|
node->name = NULL;
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
2009-03-03 19:39:13 +03:00
|
|
|
if (NULL != node->daemon) {
|
|
|
|
node->daemon->node = NULL;
|
|
|
|
OBJ_RELEASE(node->daemon);
|
2009-07-14 22:56:49 +04:00
|
|
|
node->daemon = NULL;
|
2009-03-03 19:39:13 +03:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2009-03-03 19:39:13 +03:00
|
|
|
for (i=0; i < node->procs->size; i++) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL != (proc = (orte_proc_t*)opal_pointer_array_get_item(node->procs, i))) {
|
|
|
|
opal_pointer_array_set_item(node->procs, i, NULL);
|
|
|
|
OBJ_RELEASE(proc);
|
2009-03-03 19:39:13 +03:00
|
|
|
}
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
OBJ_RELEASE(node->procs);
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2011-09-11 23:02:24 +04:00
|
|
|
/* do NOT destroy the topology */
|
2011-06-30 07:12:38 +04:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
/* release the attributes */
|
|
|
|
OPAL_LIST_DESTRUCT(&node->attributes);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
OBJ_CLASS_INSTANCE(orte_node_t,
|
|
|
|
opal_list_item_t,
|
|
|
|
orte_node_construct,
|
|
|
|
orte_node_destruct);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
static void orte_proc_construct(orte_proc_t* proc)
|
|
|
|
{
|
|
|
|
proc->name = *ORTE_NAME_INVALID;
|
|
|
|
proc->pid = 0;
|
2009-06-06 05:08:47 +04:00
|
|
|
proc->local_rank = ORTE_LOCAL_RANK_INVALID;
|
|
|
|
proc->node_rank = ORTE_NODE_RANK_INVALID;
|
2011-06-17 00:31:30 +04:00
|
|
|
proc->app_rank = -1;
|
2010-03-24 00:28:02 +03:00
|
|
|
proc->last_errmgr_state = ORTE_PROC_STATE_UNDEF;
|
2008-09-01 21:15:01 +04:00
|
|
|
proc->state = ORTE_PROC_STATE_UNDEF;
|
2010-02-27 20:37:34 +03:00
|
|
|
proc->app_idx = 0;
|
2008-09-01 21:15:01 +04:00
|
|
|
proc->node = NULL;
|
2011-03-03 03:02:21 +03:00
|
|
|
proc->exit_code = 0; /* Assume we won't fail unless otherwise notified */
|
2008-09-01 21:15:01 +04:00
|
|
|
proc->rml_uri = NULL;
|
2014-06-01 20:14:10 +04:00
|
|
|
proc->flags = 0;
|
|
|
|
OBJ_CONSTRUCT(&proc->attributes, opal_list_t);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void orte_proc_destruct(orte_proc_t* proc)
|
|
|
|
{
|
2009-07-14 22:56:49 +04:00
|
|
|
if (NULL != proc->node) {
|
|
|
|
OBJ_RELEASE(proc->node);
|
|
|
|
proc->node = NULL;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2009-07-14 22:56:49 +04:00
|
|
|
if (NULL != proc->rml_uri) {
|
|
|
|
free(proc->rml_uri);
|
|
|
|
proc->rml_uri = NULL;
|
|
|
|
}
|
2011-06-30 07:12:38 +04:00
|
|
|
|
2014-06-01 20:14:10 +04:00
|
|
|
OPAL_LIST_DESTRUCT(&proc->attributes);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CLASS_INSTANCE(orte_proc_t,
|
|
|
|
opal_list_item_t,
|
|
|
|
orte_proc_construct,
|
|
|
|
orte_proc_destruct);
|
|
|
|
|
|
|
|
static void orte_job_map_construct(orte_job_map_t* map)
|
|
|
|
{
|
2011-03-12 08:30:09 +03:00
|
|
|
map->req_mapper = NULL;
|
|
|
|
map->last_mapper = NULL;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
map->mapping = 0;
|
|
|
|
map->ranking = 0;
|
|
|
|
map->binding = 0;
|
|
|
|
map->ppr = NULL;
|
2009-08-11 06:51:27 +04:00
|
|
|
map->cpus_per_rank = 1;
|
2008-09-01 21:15:01 +04:00
|
|
|
map->display_map = false;
|
|
|
|
map->num_new_daemons = 0;
|
|
|
|
map->daemon_vpid_start = ORTE_VPID_INVALID;
|
|
|
|
map->num_nodes = 0;
|
|
|
|
map->nodes = OBJ_NEW(opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(map->nodes,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_MAX_SIZE,
|
|
|
|
ORTE_GLOBAL_ARRAY_BLOCK_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void orte_job_map_destruct(orte_job_map_t* map)
|
|
|
|
{
|
|
|
|
orte_std_cntr_t i;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
orte_node_t *node;
|
|
|
|
|
2011-03-12 08:30:09 +03:00
|
|
|
if (NULL != map->req_mapper) {
|
|
|
|
free(map->req_mapper);
|
|
|
|
}
|
|
|
|
if (NULL != map->last_mapper) {
|
|
|
|
free(map->last_mapper);
|
|
|
|
}
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL != map->ppr) {
|
|
|
|
free(map->ppr);
|
|
|
|
}
|
2008-09-01 21:15:01 +04:00
|
|
|
for (i=0; i < map->nodes->size; i++) {
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
if (NULL != (node = (orte_node_t*)opal_pointer_array_get_item(map->nodes, i))) {
|
|
|
|
OBJ_RELEASE(node);
|
|
|
|
opal_pointer_array_set_item(map->nodes, i, NULL);
|
2008-09-01 21:15:01 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
OBJ_RELEASE(map->nodes);
|
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CLASS_INSTANCE(orte_job_map_t,
|
|
|
|
opal_object_t,
|
|
|
|
orte_job_map_construct,
|
|
|
|
orte_job_map_destruct);
|
2014-06-01 20:14:10 +04:00
|
|
|
|
|
|
|
static void orte_attr_cons(orte_attribute_t* p)
|
|
|
|
{
|
|
|
|
p->key = 0;
|
|
|
|
p->local = true; // default to local-only data
|
|
|
|
memset(&p->data, 0, sizeof(p->data));
|
|
|
|
}
|
|
|
|
static void orte_attr_des(orte_attribute_t *p)
|
|
|
|
{
|
|
|
|
if (OPAL_BYTE_OBJECT == p->type) {
|
|
|
|
if (NULL != p->data.bo.bytes) {
|
|
|
|
free(p->data.bo.bytes);
|
|
|
|
}
|
|
|
|
} else if (OPAL_BUFFER == p->type) {
|
|
|
|
OBJ_DESTRUCT(&p->data.buf);
|
2014-11-19 11:21:43 +03:00
|
|
|
} else if (OPAL_STRING == p->type) {
|
|
|
|
free(p->data.string);
|
2014-06-01 20:14:10 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
OBJ_CLASS_INSTANCE(orte_attribute_t,
|
|
|
|
opal_list_item_t,
|
|
|
|
orte_attr_cons, orte_attr_des);
|
2014-12-09 02:33:45 +03:00
|
|
|
|
|
|
|
static void tcon(orte_topology_t *t)
|
|
|
|
{
|
|
|
|
t->topo = NULL;
|
|
|
|
t->sig = NULL;
|
|
|
|
}
|
|
|
|
static void tdes(orte_topology_t *t)
|
|
|
|
{
|
|
|
|
if (NULL != t->topo) {
|
|
|
|
hwloc_topology_destroy(t->topo);
|
|
|
|
}
|
|
|
|
if (NULL != t->sig) {
|
|
|
|
free(t->sig);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
OBJ_CLASS_INSTANCE(orte_topology_t,
|
|
|
|
opal_object_t,
|
|
|
|
tcon, tdes);
|