2005-03-14 23:57:21 +03:00
/* -*- C -*-
*
2010-03-13 02:57:50 +03:00
* Copyright ( c ) 2004 - 2010 The Trustees of Indiana University and Indiana
2005-11-05 22:57:48 +03:00
* University Research and Technology
* Corporation . All rights reserved .
2008-02-28 08:32:23 +03:00
* Copyright ( c ) 2004 - 2008 The University of Tennessee and The University
2005-11-05 22:57:48 +03:00
* of Tennessee Research Foundation . All rights
* reserved .
2005-09-20 21:09:11 +04:00
* Copyright ( c ) 2004 - 2005 High Performance Computing Center Stuttgart ,
2005-03-14 23:57:21 +03:00
* University of Stuttgart . All rights reserved .
2005-03-24 15:43:37 +03:00
* Copyright ( c ) 2004 - 2005 The Regents of the University of California .
* All rights reserved .
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
* Copyright ( c ) 2006 - 2011 Cisco Systems , Inc . All rights reserved .
2009-01-30 21:50:10 +03:00
* Copyright ( c ) 2007 - 2009 Sun Microsystems , Inc . All rights reserved .
2012-01-11 19:53:09 +04:00
* Copyright ( c ) 2007 - 2011 Los Alamos National Security , LLC . All rights
2007-06-05 07:03:59 +04:00
* reserved .
2005-03-14 23:57:21 +03:00
* $ COPYRIGHT $
2005-09-20 21:09:11 +04:00
*
2005-03-14 23:57:21 +03:00
* Additional copyrights may follow
2005-09-20 21:09:11 +04:00
*
2005-03-14 23:57:21 +03:00
* $ HEADER $
*/
# include "orte_config.h"
2008-02-28 04:57:57 +03:00
# include "orte/constants.h"
2007-07-19 23:00:06 +04:00
2009-03-13 05:10:32 +03:00
# ifdef HAVE_STRING_H
# include <string.h>
# endif
2005-03-14 23:57:21 +03:00
# include <stdio.h>
2012-01-11 19:53:09 +04:00
# ifdef HAVE_STDLIB_H
# include <stdlib.h>
# endif /* HAVE_STDLIB_H */
# ifdef HAVE_STRINGS_H
# include <strings.h>
# endif /* HAVE_STRINGS_H */
2005-03-14 23:57:21 +03:00
# ifdef HAVE_UNISTD_H
# include <unistd.h>
# endif
# ifdef HAVE_SYS_PARAM_H
# include <sys/param.h>
# endif
# include <errno.h>
# include <signal.h>
# include <ctype.h>
2005-12-18 01:05:10 +03:00
# ifdef HAVE_SYS_TYPES_H
2005-04-01 04:30:37 +04:00
# include <sys/types.h>
2005-12-18 01:05:10 +03:00
# endif /* HAVE_SYS_TYPES_H */
# ifdef HAVE_SYS_WAIT_H
2005-04-01 04:30:37 +04:00
# include <sys/wait.h>
2005-12-18 01:05:10 +03:00
# endif /* HAVE_SYS_WAIT_H */
2007-01-25 17:17:44 +03:00
# ifdef HAVE_SYS_TIME_H
# include <sys/time.h>
2007-04-01 20:16:54 +04:00
# endif /* HAVE_SYS_TIME_H */
2012-01-11 19:53:09 +04:00
# include <fcntl.h>
# ifdef HAVE_SYS_STAT_H
# include <sys/stat.h>
# endif
2005-03-14 23:57:21 +03:00
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
# include "opal/mca/event/event.h"
2007-04-21 04:15:05 +04:00
# include "opal/mca/installdirs/installdirs.h"
2011-12-03 05:10:52 +04:00
# include "opal/mca/paffinity/base/base.h"
2005-09-19 21:20:01 +04:00
# include "opal/mca/base/base.h"
2005-07-04 04:13:44 +04:00
# include "opal/util/argv.h"
2009-02-14 05:26:12 +03:00
# include "opal/util/output.h"
2005-09-19 21:20:01 +04:00
# include "opal/util/basename.h"
2005-07-04 04:13:44 +04:00
# include "opal/util/cmd_line.h"
2005-09-19 21:20:01 +04:00
# include "opal/util/opal_environ.h"
2008-02-28 04:57:57 +03:00
# include "opal/util/opal_getcwd.h"
2008-06-09 18:53:58 +04:00
# include "orte/util/show_help.h"
2008-03-07 00:36:32 +03:00
# include "opal/sys/atomic.h"
2010-03-13 02:57:50 +03:00
# if OPAL_ENABLE_FT_CR == 1
2007-03-17 02:11:45 +03:00
# include "opal/runtime/opal_cr.h"
# endif
2006-06-09 21:21:23 +04:00
# include "opal/version.h"
2007-04-21 04:15:05 +04:00
# include "opal/runtime/opal.h"
2007-07-19 23:00:06 +04:00
# include "opal/util/os_path.h"
2009-01-25 15:39:24 +03:00
# include "opal/util/path.h"
2008-02-28 08:32:23 +03:00
# include "opal/class/opal_pointer_array.h"
2008-02-28 04:57:57 +03:00
# include "opal/dss/dss.h"
2008-02-28 08:32:23 +03:00
2005-09-19 21:20:01 +04:00
# include "orte/util/proc_info.h"
2006-09-14 19:27:17 +04:00
# include "orte/util/pre_condition_transports.h"
2008-02-28 04:57:57 +03:00
# include "orte/util/session_dir.h"
2008-12-10 20:10:39 +03:00
# include "orte/util/hnp_contact.h"
2005-03-14 23:57:21 +03:00
2008-06-09 17:08:54 +04:00
# include "orte/mca/odls/odls.h"
2008-02-28 04:57:57 +03:00
# include "orte/mca/plm/plm.h"
2011-03-13 01:50:53 +03:00
# include "orte/mca/plm/base/plm_private.h"
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# include "orte/mca/ras/ras.h"
2007-07-12 23:53:18 +04:00
# include "orte/mca/rml/rml.h"
2009-02-14 05:26:12 +03:00
# include "orte/mca/rml/rml_types.h"
2008-04-16 18:27:42 +04:00
# include "orte/mca/rml/base/rml_contact.h"
2005-09-19 21:20:01 +04:00
# include "orte/mca/errmgr/errmgr.h"
2010-04-27 02:15:57 +04:00
# include "orte/mca/errmgr/base/errmgr_private.h"
2009-05-11 18:11:44 +04:00
# include "orte/mca/grpcomm/grpcomm.h"
2005-03-14 23:57:21 +03:00
2005-09-19 21:20:01 +04:00
# include "orte/runtime/runtime.h"
2008-02-28 04:57:57 +03:00
# include "orte/runtime/orte_globals.h"
2005-09-19 21:20:01 +04:00
# include "orte/runtime/orte_wait.h"
2008-02-28 04:57:57 +03:00
# include "orte/runtime/orte_data_server.h"
Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation.
Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway.
So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked.
Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion.
This commit was SVN r17843.
2008-03-17 20:58:59 +03:00
# include "orte/runtime/orte_locks.h"
2010-07-18 01:03:27 +04:00
# include "orte/runtime/orte_quit.h"
2005-03-14 23:57:21 +03:00
2007-07-12 23:53:18 +04:00
/* ensure I can behave like a daemon */
# include "orte/orted/orted.h"
2005-08-31 20:15:59 +04:00
# include "orterun.h"
2005-03-14 23:57:21 +03:00
2012-01-11 19:53:09 +04:00
/* instance the standard MPIR interfaces */
# define MPIR_MAX_PATH_LENGTH 512
# define MPIR_MAX_ARG_LENGTH 1024
struct MPIR_PROCDESC * MPIR_proctable = NULL ;
int MPIR_proctable_size = 0 ;
volatile int MPIR_being_debugged = 0 ;
volatile int MPIR_debug_state = 0 ;
int MPIR_i_am_starter = 0 ;
int MPIR_partial_attach_ok = 1 ;
char MPIR_executable_path [ MPIR_MAX_PATH_LENGTH ] ;
char MPIR_server_arguments [ MPIR_MAX_ARG_LENGTH ] ;
volatile int MPIR_forward_output = 0 ;
volatile int MPIR_forward_comm = 0 ;
char MPIR_attach_fifo [ MPIR_MAX_PATH_LENGTH ] ;
int MPIR_force_to_main = 0 ;
static void orte_debugger_dump ( void ) ;
static void orte_debugger_init_before_spawn ( orte_job_t * jdata ) ;
static void orte_debugger_init_after_spawn ( orte_job_t * jdata ) ;
static void attach_debugger ( int fd , short event , void * arg ) ;
static void build_debugger_args ( orte_app_context_t * debugger ) ;
static void open_fifo ( void ) ;
ORTE_DECLSPEC void * MPIR_Breakpoint ( void ) ;
/*
* Breakpoint function for parallel debuggers
*/
void * MPIR_Breakpoint ( void )
{
return NULL ;
}
2005-03-14 23:57:21 +03:00
/*
* Globals
*/
2010-07-15 20:33:57 +04:00
static orte_job_t * jdata = NULL ;
2005-08-08 20:42:28 +04:00
static char * * global_mca_env = NULL ;
2006-08-15 23:54:10 +04:00
static orte_std_cntr_t total_num_apps = 0 ;
2006-09-15 06:52:08 +04:00
static bool want_prefix_by_default = ( bool ) ORTE_WANT_ORTERUN_PREFIX_BY_DEFAULT ;
2008-02-28 04:57:57 +03:00
static char * ompi_server = NULL ;
2008-06-10 21:53:28 +04:00
2005-03-14 23:57:21 +03:00
/*
2007-07-10 16:53:48 +04:00
* Globals
2005-03-14 23:57:21 +03:00
*/
2008-03-06 22:35:57 +03:00
struct orterun_globals_t orterun_globals ;
static bool globals_init = false ;
2005-03-14 23:57:21 +03:00
2008-03-06 22:35:57 +03:00
static opal_cmd_line_init_t cmd_line_init [ ] = {
2005-03-14 23:57:21 +03:00
/* Various "obvious" options */
2005-09-05 00:54:19 +04:00
{ NULL , NULL , NULL , ' h ' , NULL , " help " , 0 ,
2005-07-04 04:13:44 +04:00
& orterun_globals . help , OPAL_CMD_LINE_TYPE_BOOL ,
2005-03-14 23:57:21 +03:00
" This help message " } ,
2006-06-09 21:21:23 +04:00
{ NULL , NULL , NULL , ' V ' , NULL , " version " , 0 ,
& orterun_globals . version , OPAL_CMD_LINE_TYPE_BOOL ,
" Print version and exit " } ,
2005-03-14 23:57:21 +03:00
{ NULL , NULL , NULL , ' v ' , NULL , " verbose " , 0 ,
2005-07-04 04:13:44 +04:00
& orterun_globals . verbose , OPAL_CMD_LINE_TYPE_BOOL ,
2005-03-14 23:57:21 +03:00
" Be verbose " } ,
2010-04-02 18:19:38 +04:00
{ " orte " , " execute " , " quiet " , ' q ' , NULL , " quiet " , 0 ,
2010-07-18 01:03:27 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
2006-06-26 22:21:45 +04:00
" Suppress helpful messages " } ,
2008-12-24 18:27:46 +03:00
{ NULL , NULL , NULL , ' \0 ' , " report-pid " , " report-pid " , 1 ,
& orterun_globals . report_pid , OPAL_CMD_LINE_TYPE_STRING ,
" Printout pid on stdout [-], stderr [+], or a file [anything else] " } ,
{ NULL , NULL , NULL , ' \0 ' , " report-uri " , " report-uri " , 1 ,
& orterun_globals . report_uri , OPAL_CMD_LINE_TYPE_STRING ,
" Printout URI on stdout [-], stderr [+], or a file [anything else] " } ,
2010-05-12 22:11:58 +04:00
/* exit status reporting */
{ " orte " , " report " , " child_jobs_separately " , ' \0 ' , " report-child-jobs-separately " , " report-child-jobs-separately " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Return the exit status of the primary job only " } ,
2008-12-24 18:27:46 +03:00
2008-06-24 21:50:56 +04:00
/* hetero apps */
2011-11-01 22:43:10 +04:00
{ " orte " , " hetero " , " apps " , ' \0 ' , NULL , " hetero-apps " , 0 ,
2008-06-24 21:50:56 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Indicates that multiple app_contexts are being provided that are a mix of 32/64 bit binaries " } ,
2008-05-29 18:11:31 +04:00
/* select XML output */
2008-08-14 22:59:01 +04:00
{ " orte " , " xml " , " output " , ' \0 ' , " xml " , " xml " , 0 ,
2008-06-05 00:53:12 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
2008-05-29 18:11:31 +04:00
" Provide all output in XML format " } ,
2009-09-02 22:03:10 +04:00
{ " orte " , " xml " , " file " , ' \0 ' , " xml-file " , " xml-file " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Provide all output in XML format to the specified file " } ,
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
/* tag output */
{ " orte " , " tag " , " output " , ' \0 ' , " tag-output " , " tag-output " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Tag all output with [job,rank] " } ,
2009-01-31 01:47:30 +03:00
{ " orte " , " timestamp " , " output " , ' \0 ' , " timestamp-output " , " timestamp-output " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Timestamp all application process output " } ,
{ " orte " , " output " , " filename " , ' \0 ' , " output-filename " , " output-filename " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Redirect output from application processes into filename.rank " } ,
{ " orte " , " xterm " , NULL , ' \0 ' , " xterm " , " xterm " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Create a new xterm window and display output from the specified ranks there " } ,
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
/* select stdin option */
{ NULL , NULL , NULL , ' \0 ' , " stdin " , " stdin " , 1 ,
& orterun_globals . stdin_target , OPAL_CMD_LINE_TYPE_STRING ,
" Specify procs to receive stdin [rank, all, none] (default: 0, indicating rank 0) " } ,
Per the July technical meeting:
Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways.
Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument.
For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work.
This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do.
To solve this problem, we did the following:
1. removed all mca params from the individual plms for the launch agent
2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted".
3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values
4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry.
5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices
Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct.
This commit was SVN r19097.
2008-07-30 22:26:24 +04:00
/* Specify the launch agent to be used */
2008-08-14 22:59:01 +04:00
{ " orte " , " launch " , " agent " , ' \0 ' , " launch-agent " , " launch-agent " , 1 ,
Per the July technical meeting:
Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways.
Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument.
For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work.
This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do.
To solve this problem, we did the following:
1. removed all mca params from the individual plms for the launch agent
2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted".
3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values
4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry.
5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices
Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct.
This commit was SVN r19097.
2008-07-30 22:26:24 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Command used to start processes on remote nodes (default: orted) " } ,
2007-03-17 02:11:45 +03:00
/* Preload the binary on the remote machine */
{ NULL , NULL , NULL , ' s ' , NULL , " preload-binary " , 0 ,
& orterun_globals . preload_binary , OPAL_CMD_LINE_TYPE_BOOL ,
" Preload the binary on the remote machine before starting the remote process. " } ,
/* Preload files on the remote machine */
{ NULL , NULL , NULL , ' \0 ' , NULL , " preload-files " , 1 ,
& orterun_globals . preload_files , OPAL_CMD_LINE_TYPE_STRING ,
" Preload the comma separated list of files to the remote machines current working directory before starting the remote process. " } ,
/* Where to Preload files on the remote machine */
{ NULL , NULL , NULL , ' \0 ' , NULL , " preload-files-dest-dir " , 1 ,
& orterun_globals . preload_files_dest_dir , OPAL_CMD_LINE_TYPE_STRING ,
" The destination directory to use in conjunction with --preload-files. By default the absolute and relative paths provided by --preload-files are used. " } ,
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
# if OPAL_ENABLE_FT_CR == 1
/* Tell SStore to preload a snapshot before launch */
{ NULL , NULL , NULL , ' \0 ' , NULL , " sstore-load " , 1 ,
& orterun_globals . sstore_load , OPAL_CMD_LINE_TYPE_STRING ,
" Internal Use Only! Tell SStore to preload a snapshot before launch. " } ,
# endif
2005-03-14 23:57:21 +03:00
/* Use an appfile */
{ NULL , NULL , NULL , ' \0 ' , NULL , " app " , 1 ,
2005-07-04 04:13:44 +04:00
& orterun_globals . appfile , OPAL_CMD_LINE_TYPE_STRING ,
2005-03-14 23:57:21 +03:00
" Provide an appfile; ignore all other command line options " } ,
/* Number of processes; -c, -n, --n, -np, and --np are all
synonyms */
{ NULL , NULL , NULL , ' c ' , " np " , " np " , 1 ,
2006-09-25 23:41:54 +04:00
& orterun_globals . num_procs , OPAL_CMD_LINE_TYPE_INT ,
2005-03-14 23:57:21 +03:00
" Number of processes to run " } ,
{ NULL , NULL , NULL , ' \0 ' , " n " , " n " , 1 ,
2006-09-25 23:41:54 +04:00
& orterun_globals . num_procs , OPAL_CMD_LINE_TYPE_INT ,
2005-03-14 23:57:21 +03:00
" Number of processes to run " } ,
2006-07-11 01:25:33 +04:00
2005-03-14 23:57:21 +03:00
/* Set a hostfile */
2008-02-28 04:57:57 +03:00
{ NULL , NULL , NULL , ' \0 ' , " hostfile " , " hostfile " , 1 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
2005-03-19 02:40:08 +03:00
" Provide a hostfile " } ,
2008-02-28 04:57:57 +03:00
{ NULL , NULL , NULL , ' \0 ' , " machinefile " , " machinefile " , 1 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
2005-03-14 23:57:21 +03:00
" Provide a hostfile " } ,
2008-04-17 17:50:59 +04:00
{ " orte " , " default " , " hostfile " , ' \0 ' , " default-hostfile " , " default-hostfile " , 1 ,
2008-03-05 07:54:57 +03:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Provide a default hostfile " } ,
2008-04-17 17:50:59 +04:00
{ " opal " , " if " , " do_not_resolve " , ' \0 ' , " do-not-resolve " , " do-not-resolve " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Do not attempt to resolve interfaces " } ,
2008-03-05 07:54:57 +03:00
2008-02-28 04:57:57 +03:00
/* uri of Open MPI server, or at least where to get it */
{ NULL , NULL , NULL , ' \0 ' , " ompi-server " , " ompi-server " , 1 ,
& orterun_globals . ompi_server , OPAL_CMD_LINE_TYPE_STRING ,
2008-04-04 23:17:28 +04:00
" Specify the URI of the Open MPI server, or the name of the file (specified as file:filename) that contains that info " } ,
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-05 00:29:50 +04:00
{ NULL , NULL , NULL , ' \0 ' , " wait-for-server " , " wait-for-server " , 0 ,
& orterun_globals . wait_for_server , OPAL_CMD_LINE_TYPE_BOOL ,
" If ompi-server is not already running, wait until it is detected (default: false) " } ,
{ NULL , NULL , NULL , ' \0 ' , " server-wait-time " , " server-wait-time " , 1 ,
& orterun_globals . server_wait_timeout , OPAL_CMD_LINE_TYPE_INT ,
" Time in seconds to wait for ompi-server (default: 10 sec) " } ,
2008-02-28 04:57:57 +03:00
2008-01-23 12:20:34 +03:00
{ " carto " , " file " , " path " , ' \0 ' , " cf " , " cartofile " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Provide a cartography file " } ,
2008-02-28 04:57:57 +03:00
2009-08-13 20:08:43 +04:00
{ " orte " , " rankfile " , NULL , ' \0 ' , " rf " , " rankfile " , 1 ,
2008-07-07 17:46:22 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Provide a rankfile file " } ,
2005-03-14 23:57:21 +03:00
/* Export environment variables; potentially used multiple times,
so it does not make sense to set into a variable */
{ NULL , NULL , NULL , ' x ' , NULL , NULL , 1 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_NULL ,
2005-03-14 23:57:21 +03:00
" Export an environment variable, optionally specifying a value (e.g., \" -x foo \" exports the environment variable foo and takes its value from the current environment; \" -x foo=bar \" exports the environment variable name foo and sets its value to \" bar \" in the started processes) " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* Mapping controls */
2006-12-13 16:49:15 +03:00
{ " rmaps " , " base " , " display_map " , ' \0 ' , " display-map " , " display-map " , 0 ,
2006-12-03 16:59:23 +03:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display the process map just before launch " } ,
2008-09-23 19:46:34 +04:00
{ " rmaps " , " base " , " display_devel_map " , ' \0 ' , " display-devel-map " , " display-devel-map " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display a detailed process map (mostly intended for developers) just before launch " } ,
2011-10-29 19:12:45 +04:00
{ " rmaps " , " base " , " display_topo_with_map " , ' \0 ' , " display-topo " , " display-topo " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display the topology as part of the process map (mostly intended for developers) just before launch " } ,
2011-11-03 18:22:07 +04:00
{ " rmaps " , " base " , " display_diffable_map " , ' \0 ' , " display-diffable-map " , " display-diffable-map " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display a diffable process map (mostly intended for developers) just before launch " } ,
2008-05-29 18:11:31 +04:00
{ NULL , NULL , NULL , ' H ' , " host " , " host " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" List of hosts to invoke processes on " } ,
{ " rmaps " , " base " , " no_schedule_local " , ' \0 ' , " nolocal " , " nolocal " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Do not run any MPI applications on the local node " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
{ " rmaps " , " base " , " no_oversubscribe " , ' \0 ' , " nooversubscribe " , " nooversubscribe " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Nodes are not to be oversubscribed, even if the system supports such operation " } ,
{ " rmaps " , " base " , " oversubscribe " , ' \0 ' , " oversubscribe " , " oversubscribe " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Nodes are allowed to be oversubscribed, even on a managed system " } ,
#if 0
2009-08-30 18:30:36 +04:00
{ " rmaps " , " base " , " cpus_per_rank " , ' \0 ' , " cpus-per-proc " , " cpus-per-proc " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_INT ,
" Number of cpus to use for each process [default=1] " } ,
2009-08-11 06:51:27 +04:00
{ " rmaps " , " base " , " cpus_per_rank " , ' \0 ' , " cpus-per-rank " , " cpus-per-rank " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_INT ,
2009-08-30 18:30:36 +04:00
" Synonym for cpus-per-proc " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# endif
/* backward compatiblity */
{ " rmaps " , " base " , " bynode " , ' \0 ' , " bynode " , " bynode " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Whether to map and rank processes round-robin by node " } ,
{ " rmaps " , " base " , " byslot " , ' \0 ' , " byslot " , " byslot " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Whether to map and rank processes round-robin by slot " } ,
/* Nperxxx options that do not require topology and are always
* available - included for backwards compatibility
*/
{ " rmaps " , " ppr " , " pernode " , ' \0 ' , " pernode " , " pernode " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Launch one process per available node " } ,
{ " rmaps " , " ppr " , " n_pernode " , ' \0 ' , " npernode " , " npernode " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_INT ,
" Launch n processes per node on all allocated nodes " } ,
# if OPAL_HAVE_HWLOC
/* declare hardware threads as independent cpus */
{ " hwloc " , " base " , " use_hwthreads_as_cpus " , ' \0 ' , " use-hwthread-cpus " , " use-hwthread-cpus " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Use hardware threads as independent cpus " } ,
/* include npersocket for backwards compatibility */
{ " rmaps " , " ppr " , " n_persocket " , ' \0 ' , " npersocket " , " npersocket " , 1 ,
2009-08-11 06:51:27 +04:00
NULL , OPAL_CMD_LINE_TYPE_INT ,
" Launch n processes per socket on all allocated nodes " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* Mapping options */
{ " rmaps " , " base " , " mapping_policy " , ' \0 ' , NULL , " map-by " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Mapping Policy [slot (default) | hwthread | core | socket | numa | board | node] " } ,
/* Ranking options */
{ " rmaps " , " base " , " ranking_policy " , ' \0 ' , NULL , " rank-by " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Ranking Policy [slot (default) | hwthread | core | socket | numa | board | node] " } ,
/* Binding options */
{ " hwloc " , " base " , " binding_policy " , ' \0 ' , NULL , " bind-to " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Policy for binding processes [none (default) | hwthread | core | socket | numa | board] (supported qualifiers: overload-allowed,if-supported) " } ,
/* backward compatiblity */
{ " hwloc " , " base " , " bind_to_core " , ' \0 ' , " bind-to-core " , " bind-to-core " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Bind processes to cores " } ,
{ " hwloc " , " base " , " bind_to_socket " , ' \0 ' , " bind-to-socket " , " bind-to-socket " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Bind processes to sockets " } ,
{ " hwloc " , " base " , " report_bindings " , ' \0 ' , " report-bindings " , " report-bindings " , 0 ,
2009-08-18 21:10:23 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
2009-09-28 07:17:15 +04:00
" Whether to report process bindings to stderr " } ,
2009-08-11 06:51:27 +04:00
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* slot list option */
{ " hwloc " , " base " , " slot_list " , ' \0 ' , " slot-list " , " slot-list " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" List of processor IDs to bind processes to [default=NULL] " } ,
/* generalized pattern mapping option */
{ " rmaps " , " ppr " , " pattern " , ' \0 ' , NULL , " ppr " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Comma-separated list of number of processes on a given resource type [default: none] " } ,
# else
/* Mapping options */
{ " rmaps " , " base " , " mapping_policy " , ' \0 ' , NULL , " map-by " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Mapping Policy [slot (default) | node] " } ,
/* Ranking options */
{ " rmaps " , " base " , " ranking_policy " , ' \0 ' , NULL , " rank-by " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Ranking Policy [slot (default) | node] " } ,
# endif
2008-05-29 18:11:31 +04:00
/* Allocation options */
2008-04-20 06:25:45 +04:00
{ " ras " , " base " , " display_alloc " , ' \0 ' , " display-allocation " , " display-allocation " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display the allocation being used by this job " } ,
2008-09-23 19:46:34 +04:00
{ " ras " , " base " , " display_devel_alloc " , ' \0 ' , " display-devel-allocation " , " display-devel-allocation " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Display a detailed list (mostly intended for developers) of the allocation being used by this job " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# if OPAL_HAVE_HWLOC
{ " hwloc " , " base " , " cpu_set " , ' \0 ' , " cpu-set " , " cpu-set " , 1 ,
2009-08-11 06:51:27 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Comma-separated list of ranges specifying logical cpus allocated to this job [default: none] " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# endif
{ NULL , NULL , NULL , ' H ' , " host " , " host " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" List of hosts to invoke processes on " } ,
2008-04-20 06:25:45 +04:00
2005-03-14 23:57:21 +03:00
/* mpiexec-like arguments */
{ NULL , NULL , NULL , ' \0 ' , " wdir " , " wdir " , 1 ,
2005-07-04 04:13:44 +04:00
& orterun_globals . wdir , OPAL_CMD_LINE_TYPE_STRING ,
2005-03-14 23:57:21 +03:00
" Set the working directory of the started processes " } ,
2007-05-08 23:09:32 +04:00
{ NULL , NULL , NULL , ' \0 ' , " wd " , " wd " , 1 ,
& orterun_globals . wdir , OPAL_CMD_LINE_TYPE_STRING ,
" Synonym for --wdir " } ,
2005-03-14 23:57:21 +03:00
{ NULL , NULL , NULL , ' \0 ' , " path " , " path " , 1 ,
2005-07-04 04:13:44 +04:00
& orterun_globals . path , OPAL_CMD_LINE_TYPE_STRING ,
2005-03-14 23:57:21 +03:00
" PATH to be used to look for executables to start processes " } ,
2006-07-05 00:12:35 +04:00
2005-11-20 19:06:53 +03:00
/* User-level debugger arguments */
{ NULL , NULL , NULL , ' \0 ' , " tv " , " tv " , 0 ,
& orterun_globals . debugger , OPAL_CMD_LINE_TYPE_BOOL ,
" Deprecated backwards compatibility flag; synonym for \" --debug \" " } ,
{ NULL , NULL , NULL , ' \0 ' , " debug " , " debug " , 0 ,
& orterun_globals . debugger , OPAL_CMD_LINE_TYPE_BOOL ,
" Invoke the user-level debugger indicated by the orte_base_user_debugger MCA parameter " } ,
{ " orte " , " base " , " user_debugger " , ' \0 ' , " debugger " , " debugger " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Sequence of debuggers to search for when \" --debug \" is used " } ,
2010-02-27 11:32:25 +03:00
{ " orte " , " output " , " debugger_proctable " , ' \0 ' , " output-proctable " , " output-proctable " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Output the debugger proctable after launch " } ,
2005-05-13 01:44:23 +04:00
/* OpenRTE arguments */
2008-08-14 22:59:01 +04:00
{ " orte " , " debug " , NULL , ' d ' , " debug-devel " , " debug-devel " , 0 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
2005-05-13 01:44:23 +04:00
" Enable debugging of OpenRTE " } ,
2006-10-11 19:18:57 +04:00
2008-08-14 22:59:01 +04:00
{ " orte " , " debug " , " daemons " , ' \0 ' , " debug-daemons " , " debug-daemons " , 0 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_INT ,
2005-05-13 01:44:23 +04:00
" Enable debugging of any OpenRTE daemons used by this application " } ,
2006-10-11 19:18:57 +04:00
2008-08-14 22:59:01 +04:00
{ " orte " , " debug " , " daemons_file " , ' \0 ' , " debug-daemons-file " , " debug-daemons-file " , 0 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
2005-05-13 01:44:23 +04:00
" Enable debugging of any OpenRTE daemons used by this application, storing output in files " } ,
2006-10-11 19:18:57 +04:00
2008-08-14 22:59:01 +04:00
{ " orte " , " leave " , " session_attached " , ' \0 ' , " leave-session-attached " , " leave-session-attached " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Enable debugging of OpenRTE " } ,
{ NULL , NULL , NULL , ' \0 ' , " tmpdir " , " tmpdir " , 1 ,
2009-03-06 00:56:03 +03:00
& orte_process_info . tmpdir_base , OPAL_CMD_LINE_TYPE_STRING ,
2005-05-13 01:44:23 +04:00
" Set the root for the session directory tree for orterun ONLY " } ,
2008-08-14 22:59:01 +04:00
{ " orte " , " do_not " , " launch " , ' \0 ' , " do-not-launch " , " do-not-launch " , 0 ,
2008-04-17 17:50:59 +04:00
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Perform all necessary operations to prepare to launch the application, but do not actually launch it " } ,
2006-12-13 07:51:38 +03:00
2006-02-28 14:52:12 +03:00
{ NULL , NULL , NULL , ' \0 ' , NULL , " prefix " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Prefix where Open MPI is installed on remote nodes " } ,
2006-10-06 17:02:56 +04:00
{ NULL , NULL , NULL , ' \0 ' , NULL , " noprefix " , 0 ,
2006-09-15 06:52:08 +04:00
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Disable automatic --prefix behavior " } ,
2006-03-23 19:53:11 +03:00
2009-06-03 03:52:59 +04:00
{ " orte " , " report " , " launch_progress " , ' \0 ' , " show-progress " , " show-progress " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Output a brief periodic report on launch progress " } ,
2009-06-24 00:25:38 +04:00
{ " orte " , " use " , " regexp " , ' \0 ' , " use-regexp " , " use-regexp " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Use regular expressions for launch " } ,
2009-09-09 09:28:45 +04:00
{ " orte " , " report " , " events " , ' \0 ' , " report-events " , " report-events " , 1 ,
NULL , OPAL_CMD_LINE_TYPE_STRING ,
" Report events to a tool listening at the specified URI " } ,
2010-04-28 08:06:57 +04:00
{ " orte " , " enable " , " recovery " , ' \0 ' , " enable-recovery " , " enable-recovery " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Enable recovery from process failure [Default = disabled] " } ,
2011-02-14 23:49:12 +03:00
{ " orte " , " max " , " restarts " , ' \0 ' , " max-restarts " , " max-restarts " , 1 ,
2010-04-28 08:06:57 +04:00
NULL , OPAL_CMD_LINE_TYPE_INT ,
2011-02-14 23:49:12 +03:00
" Max number of times to restart a failed process " } ,
2010-04-28 08:06:57 +04:00
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# if OPAL_HAVE_HWLOC
2011-11-01 22:43:10 +04:00
{ " orte " , " hetero " , " nodes " , ' \0 ' , NULL , " hetero-nodes " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Nodes in cluster may differ in topology, so send the topology back from each node [Default = false] " } ,
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
# endif
2011-11-01 22:43:10 +04:00
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
# if OPAL_ENABLE_CRDEBUG == 1
{ " opal " , " cr " , " enable_crdebug " , ' \0 ' , " crdebug " , " crdebug " , 0 ,
NULL , OPAL_CMD_LINE_TYPE_BOOL ,
" Enable C/R Debugging " } ,
# endif
2010-07-01 23:31:11 +04:00
{ NULL , NULL , NULL , ' \0 ' , " disable-recovery " , " disable-recovery " , 0 ,
& orterun_globals . disable_recovery , OPAL_CMD_LINE_TYPE_BOOL ,
" Disable recovery (resets all recovery options to off) " } ,
2005-03-14 23:57:21 +03:00
/* End of list */
{ NULL , NULL , NULL , ' \0 ' , NULL , NULL , 0 ,
2005-07-04 04:13:44 +04:00
NULL , OPAL_CMD_LINE_TYPE_NULL , NULL }
2005-03-14 23:57:21 +03:00
} ;
/*
* Local functions
*/
static int create_app ( int argc , char * argv [ ] , orte_app_context_t * * app ,
2005-08-08 20:42:28 +04:00
bool * made_app , char * * * app_env ) ;
2005-03-14 23:57:21 +03:00
static int init_globals ( void ) ;
2007-06-27 05:03:31 +04:00
static int parse_globals ( int argc , char * argv [ ] , opal_cmd_line_t * cmd_line ) ;
2005-03-14 23:57:21 +03:00
static int parse_locals ( int argc , char * argv [ ] ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
static int parse_appfile ( char * filename , char * * * env ) ;
2010-07-07 03:35:42 +04:00
static void run_debugger ( char * basename , opal_cmd_line_t * cmd_line ,
2010-08-31 16:21:13 +04:00
int argc , char * argv [ ] , int num_procs ) __opal_attribute_noreturn__ ;
2005-03-14 23:57:21 +03:00
2005-08-31 20:15:59 +04:00
int orterun ( int argc , char * argv [ ] )
2005-03-14 23:57:21 +03:00
{
2008-02-28 04:57:57 +03:00
int rc ;
2007-06-27 05:03:31 +04:00
opal_cmd_line_t cmd_line ;
2008-04-23 04:17:12 +04:00
char * tmp_env_var = NULL ;
2011-03-13 01:50:53 +03:00
orte_job_t * daemons ;
Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order.
Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use.
Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact.
Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code.
This commit was SVN r25551.
2011-11-30 23:58:24 +04:00
int32_t ljob ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
orte_app_context_t * app , * dapp ;
2005-03-14 23:57:21 +03:00
2007-06-27 05:03:31 +04:00
/* find our basename (the name of the executable) so that we can
use it in pretty - print error messages */
2010-07-18 01:03:27 +04:00
orte_basename = opal_basename ( argv [ 0 ] ) ;
2007-04-21 04:15:05 +04:00
2007-06-27 05:03:31 +04:00
/* Setup and parse the command line */
init_globals ( ) ;
opal_cmd_line_create ( & cmd_line , cmd_line_init ) ;
mca_base_cmd_line_setup ( & cmd_line ) ;
2008-02-28 04:57:57 +03:00
if ( ORTE_SUCCESS ! = ( rc = opal_cmd_line_parse ( & cmd_line , true ,
2011-03-13 01:50:53 +03:00
argc , argv ) ) ) {
2008-02-28 04:57:57 +03:00
return rc ;
2007-06-27 05:03:31 +04:00
}
2007-04-21 04:15:05 +04:00
2008-12-10 02:49:02 +03:00
/*
* Since this process can now handle MCA / GMCA parameters , make sure to
* process them .
*/
mca_base_cmd_line_process_args ( & cmd_line , & environ , & environ ) ;
/* Ensure that enough of OPAL is setup for us to be able to run */
/*
* NOTE : ( JJH )
* We need to allow ' mca_base_cmd_line_process_args ( ) ' to process command
* line arguments * before * calling opal_init_util ( ) since the command
* line could contain MCA parameters that affect the way opal_init_util ( )
* functions . AMCA parameters are one such option normally received on the
* command line that affect the way opal_init_util ( ) behaves .
* It is " safe " to call mca_base_cmd_line_process_args ( ) before
* opal_init_util ( ) since mca_base_cmd_line_process_args ( ) does * not *
* depend upon opal_init_util ( ) functionality .
*/
2007-06-27 05:03:31 +04:00
/* Need to initialize OPAL so that install_dirs are filled in */
2009-12-04 03:51:15 +03:00
if ( OPAL_SUCCESS ! = opal_init_util ( & argc , & argv ) ) {
2008-05-19 15:58:48 +04:00
exit ( 1 ) ;
}
2007-07-13 19:47:57 +04:00
2009-05-04 15:07:40 +04:00
/* flag that I am the HNP - needs to be done prior to
* registering params
*/
orte_process_info . proc_type = ORTE_PROC_HNP ;
2007-04-21 04:15:05 +04:00
/* Setup MCA params */
2008-06-19 06:58:14 +04:00
orte_register_params ( ) ;
2008-06-24 21:50:56 +04:00
2011-12-03 05:10:52 +04:00
/*** NOTIFY IF DEPRECATED OPAL_PAFFINITY_ALONE WAS SET ***/
if ( opal_paffinity_alone ) {
orte_show_help ( " help-opal-runtime.txt " ,
" opal_paffinity_alone:deprecated " ,
true ) ;
}
2005-03-14 23:57:21 +03:00
/* Check for some "global" command line params */
2007-06-27 05:03:31 +04:00
parse_globals ( argc , argv , & cmd_line ) ;
OBJ_DESTRUCT ( & cmd_line ) ;
2005-03-14 23:57:21 +03:00
2008-02-28 04:57:57 +03:00
/* create a new job object to hold the info for this one - the
* jobid field will be filled in by the PLM when the job is
* launched
*/
jdata = OBJ_NEW ( orte_job_t ) ;
if ( NULL = = jdata ) {
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* cannot call ORTE_ERROR_LOG as the errmgr
* hasn ' t been loaded yet !
*/
2008-02-28 04:57:57 +03:00
return ORTE_ERR_OUT_OF_RESOURCE ;
2005-07-08 22:48:25 +04:00
}
2009-08-11 06:51:27 +04:00
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
/* check what user wants us to do with stdin */
if ( 0 = = strcmp ( orterun_globals . stdin_target , " all " ) ) {
jdata - > stdin_target = ORTE_VPID_WILDCARD ;
} else if ( 0 = = strcmp ( orterun_globals . stdin_target , " none " ) ) {
jdata - > stdin_target = ORTE_VPID_INVALID ;
} else {
jdata - > stdin_target = strtoul ( orterun_globals . stdin_target , NULL , 10 ) ;
}
2008-02-28 04:57:57 +03:00
/* Parse each app, adding it to the job object */
parse_locals ( argc , argv ) ;
if ( 0 = = jdata - > num_apps ) {
2005-07-08 22:48:25 +04:00
/* This should never happen -- this case should be caught in
2011-03-13 01:50:53 +03:00
create_app ( ) , but let ' s just double check . . . */
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:nothing-to-do " ,
2010-07-18 01:03:27 +04:00
true , orte_basename ) ;
2008-03-05 04:46:30 +03:00
exit ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
2005-04-16 01:52:58 +04:00
}
2005-03-14 23:57:21 +03:00
2008-07-09 02:36:39 +04:00
/* save the environment for launch purposes. This MUST be
* done so that we can pass it to any local procs we
* spawn - otherwise , those local procs won ' t see any
* non - MCA envars were set in the enviro prior to calling
* orterun
*/
orte_launch_environ = opal_argv_copy ( environ ) ;
2010-04-15 22:10:50 +04:00
/* purge an ess flag set externally */
opal_unsetenv ( " OMPI_MCA_ess " , & orte_launch_environ ) ;
2010-03-13 02:57:50 +03:00
# if OPAL_ENABLE_FT_CR == 1
2007-03-17 02:11:45 +03:00
/* Disable OPAL CR notifications for this tool */
opal_cr_set_enabled ( false ) ;
2008-04-23 04:17:12 +04:00
tmp_env_var = mca_base_param_env_var ( " opal_cr_is_tool " ) ;
opal_setenv ( tmp_env_var ,
2007-10-17 17:47:36 +04:00
" 1 " ,
true , & environ ) ;
2008-04-23 04:17:12 +04:00
free ( tmp_env_var ) ;
2010-03-13 02:57:50 +03:00
# endif
2010-03-23 22:55:21 +03:00
tmp_env_var = NULL ; /* Silence compiler warning */
2010-03-13 02:57:50 +03:00
2008-02-28 04:57:57 +03:00
/* Intialize our Open RTE environment
* Set the flag telling orte_init that I am NOT a
2005-06-24 20:59:37 +04:00
* singleton , but am " infrastructure " - prevents setting
* up incorrect infrastructure that only a singleton would
* require
*/
2009-12-04 03:51:15 +03:00
if ( ORTE_SUCCESS ! = ( rc = orte_init ( & argc , & argv , ORTE_PROC_HNP ) ) ) {
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* cannot call ORTE_ERROR_LOG as it could be the errmgr
* never got loaded !
*/
2011-11-23 01:24:35 +04:00
fprintf ( stderr , " FAILED ORTE INIT \n " ) ;
2005-03-14 23:57:21 +03:00
return rc ;
2011-03-10 03:42:28 +03:00
}
2011-07-12 21:07:41 +04:00
/* finalize the OPAL utils. As they are opened again from orte_init->opal_init
* we continue to have a reference count on them . So we have to finalize them twice . . .
*/
opal_finalize_util ( ) ;
2009-09-02 22:03:10 +04:00
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* get the daemon job object */
daemons = orte_get_job_data_object ( ORTE_PROC_MY_NAME - > jobid ) ;
2008-12-10 20:10:39 +03:00
/* check for request to report uri */
2008-12-24 18:27:46 +03:00
if ( NULL ! = orterun_globals . report_uri ) {
FILE * fp ;
char * rml_uri ;
rml_uri = orte_rml . get_contact_info ( ) ;
if ( 0 = = strcmp ( orterun_globals . report_uri , " - " ) ) {
/* if '-', then output to stdout */
printf ( " %s \n " , ( NULL = = rml_uri ) ? " NULL " : rml_uri ) ;
} else if ( 0 = = strcmp ( orterun_globals . report_uri , " + " ) ) {
/* if '+', output to stderr */
fprintf ( stderr , " %s \n " , ( NULL = = rml_uri ) ? " NULL " : rml_uri ) ;
} else {
fp = fopen ( orterun_globals . report_uri , " w " ) ;
if ( NULL = = fp ) {
orte_show_help ( " help-orterun.txt " , " orterun:write_file " , false ,
2010-07-18 01:03:27 +04:00
orte_basename , " uri " , orterun_globals . report_uri ) ;
2008-12-24 18:27:46 +03:00
exit ( 0 ) ;
}
fprintf ( fp , " %s \n " , ( NULL = = rml_uri ) ? " NULL " : rml_uri ) ;
fclose ( fp ) ;
2008-12-10 20:10:39 +03:00
}
2008-12-24 18:27:46 +03:00
if ( NULL ! = rml_uri ) {
free ( rml_uri ) ;
}
2008-12-10 20:10:39 +03:00
}
2008-10-24 05:42:58 +04:00
/* Change the default behavior of libevent such that we want to
2011-03-13 01:50:53 +03:00
continually block rather than blocking for the default timeout
and then looping around the progress engine again . There
should be nothing in the orted that cannot block in libevent
until " something " happens ( i . e . , there ' s no need to keep
cycling through progress because the only things that should
happen will happen in libevent ) . This is a minor optimization ,
but what the heck . . . : - ) */
2008-10-24 05:42:58 +04:00
opal_progress_set_event_flag ( OPAL_EVLOOP_ONCE ) ;
2007-07-19 23:00:06 +04:00
/* If we have a prefix, then modify the PATH and
2011-03-13 01:50:53 +03:00
LD_LIBRARY_PATH environment variables in our copy . This
will ensure that any locally - spawned children will
have our executables and libraries in their path
2007-07-19 23:00:06 +04:00
2011-03-13 01:50:53 +03:00
For now , default to the prefix_dir provided in the first app_context .
Since there always MUST be at least one app_context , we are safe in
doing this .
2008-02-28 04:57:57 +03:00
*/
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
if ( NULL ! = ( app = ( orte_app_context_t * ) opal_pointer_array_get_item ( jdata - > apps , 0 ) ) & &
NULL ! = app - > prefix_dir ) {
2007-07-19 23:00:06 +04:00
char * oldenv , * newenv , * lib_base , * bin_base ;
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* copy the prefix into the daemon job so that any launcher
* can find the orteds when we launch the virtual machine
*/
if ( NULL = = ( dapp = ( orte_app_context_t * ) opal_pointer_array_get_item ( daemons - > apps , 0 ) ) ) {
/* that's an error in the ess */
ORTE_ERROR_LOG ( ORTE_ERR_NOT_FOUND ) ;
return ORTE_ERR_NOT_FOUND ;
}
dapp - > prefix_dir = strdup ( app - > prefix_dir ) ;
2007-07-19 23:00:06 +04:00
lib_base = opal_basename ( opal_install_dirs . libdir ) ;
bin_base = opal_basename ( opal_install_dirs . bindir ) ;
/* Reset PATH */
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
newenv = opal_os_path ( false , app - > prefix_dir , bin_base , NULL ) ;
2007-07-19 23:00:06 +04:00
oldenv = getenv ( " PATH " ) ;
if ( NULL ! = oldenv ) {
char * temp ;
asprintf ( & temp , " %s:%s " , newenv , oldenv ) ;
free ( newenv ) ;
newenv = temp ;
}
opal_setenv ( " PATH " , newenv , true , & orte_launch_environ ) ;
if ( orte_debug_flag ) {
2010-07-18 01:03:27 +04:00
opal_output ( 0 , " %s: reset PATH: %s " , orte_basename , newenv ) ;
2007-07-19 23:00:06 +04:00
}
free ( newenv ) ;
free ( bin_base ) ;
/* Reset LD_LIBRARY_PATH */
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
newenv = opal_os_path ( false , app - > prefix_dir , lib_base , NULL ) ;
2007-07-19 23:00:06 +04:00
oldenv = getenv ( " LD_LIBRARY_PATH " ) ;
if ( NULL ! = oldenv ) {
char * temp ;
asprintf ( & temp , " %s:%s " , newenv , oldenv ) ;
free ( newenv ) ;
newenv = temp ;
}
opal_setenv ( " LD_LIBRARY_PATH " , newenv , true , & orte_launch_environ ) ;
if ( orte_debug_flag ) {
2008-06-09 18:53:58 +04:00
opal_output ( 0 , " %s: reset LD_LIBRARY_PATH: %s " ,
2010-07-18 01:03:27 +04:00
orte_basename , newenv ) ;
2007-07-19 23:00:06 +04:00
}
free ( newenv ) ;
free ( lib_base ) ;
}
2006-09-14 19:27:17 +04:00
/* pre-condition any network transports that require it */
2008-02-28 04:57:57 +03:00
if ( ORTE_SUCCESS ! = ( rc = orte_pre_condition_transports ( jdata ) ) ) {
2006-09-14 19:27:17 +04:00
ORTE_ERROR_LOG ( rc ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:precondition " , false ,
2010-07-18 01:03:27 +04:00
orte_basename , NULL , NULL , rc ) ;
2009-02-25 06:10:21 +03:00
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
goto DONE ;
2006-09-14 19:27:17 +04:00
}
2007-07-12 23:53:18 +04:00
/* setup to listen for commands sent specifically to me, even though I would probably
* be the one sending them ! Unfortunately , since I am a participating daemon ,
* there are times I need to send a command to " all daemons " , and that means * I * have
* to receive it too
*/
2008-02-28 04:57:57 +03:00
rc = orte_rml . recv_buffer_nb ( ORTE_NAME_WILDCARD , ORTE_RML_TAG_DAEMON ,
ORTE_RML_NON_PERSISTENT , orte_daemon_recv , NULL ) ;
2011-11-23 01:24:35 +04:00
if ( rc ! = ORTE_SUCCESS & & rc ! = ORTE_ERR_NOT_IMPLEMENTED ) {
2007-07-12 23:53:18 +04:00
ORTE_ERROR_LOG ( rc ) ;
2009-02-25 06:10:21 +03:00
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
goto DONE ;
2007-07-12 23:53:18 +04:00
}
2008-02-28 04:57:57 +03:00
/* setup the data server */
if ( ORTE_SUCCESS ! = ( rc = orte_data_server_init ( ) ) ) {
ORTE_ERROR_LOG ( rc ) ;
2009-02-25 06:10:21 +03:00
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
goto DONE ;
2006-12-13 07:51:38 +03:00
}
2008-02-28 04:57:57 +03:00
2008-04-16 18:27:42 +04:00
/* if an uri for the ompi-server was provided, set the route */
if ( NULL ! = ompi_server ) {
opal_buffer_t buf ;
/* setup our route to the server */
OBJ_CONSTRUCT ( & buf , opal_buffer_t ) ;
opal_dss . pack ( & buf , & ompi_server , 1 , OPAL_STRING ) ;
2009-05-14 04:25:02 +04:00
if ( ORTE_SUCCESS ! = ( rc = orte_rml_base_update_contact_info ( & buf ) ) ) {
ORTE_ERROR_LOG ( rc ) ;
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
goto DONE ;
}
2008-04-16 18:27:42 +04:00
OBJ_DESTRUCT ( & buf ) ;
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-05 00:29:50 +04:00
/* check if we are to wait for the server to start - resolves
* a race condition that can occur when the server is run
* as a background job - e . g . , in scripts
*/
if ( orterun_globals . wait_for_server ) {
/* ping the server */
struct timeval timeout ;
timeout . tv_sec = orterun_globals . server_wait_timeout ;
timeout . tv_usec = 0 ;
if ( ORTE_SUCCESS ! = ( rc = orte_rml . ping ( ompi_server , & timeout ) ) ) {
/* try it one more time */
if ( ORTE_SUCCESS ! = ( rc = orte_rml . ping ( ompi_server , & timeout ) ) ) {
/* okay give up */
orte_show_help ( " help-orterun.txt " , " orterun:server-not-found " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , ompi_server ,
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-05 00:29:50 +04:00
( long ) orterun_globals . server_wait_timeout ,
ORTE_ERROR_NAME ( rc ) ) ;
2009-02-25 06:10:21 +03:00
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-05 00:29:50 +04:00
goto DONE ;
}
}
}
2008-04-16 18:27:42 +04:00
}
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* we may need to look at the apps for the user's job
* to get our full list of nodes , so prep the job for
* launch - start by getting a jobid for it */
if ( ORTE_SUCCESS ! = ( rc = orte_plm_base_create_jobid ( jdata ) ) ) {
ORTE_ERROR_LOG ( rc ) ;
goto DONE ;
}
/* store it on the global job data pool - this is the key
* step required before we launch the daemons . It allows
* the orte_rmaps_base_setup_virtual_machine routine to
* search all apps for any hosts to be used by the vm
*/
ljob = ORTE_LOCAL_JOBID ( jdata - > jobid ) ;
opal_pointer_array_set_item ( orte_job_data , ljob , jdata ) ;
2011-12-15 00:01:15 +04:00
2012-01-11 19:53:09 +04:00
/* setup for debugging */
orte_debugger_init_before_spawn ( jdata ) ;
2011-12-15 00:01:15 +04:00
/* spawn the job and its daemons */
2008-02-28 04:57:57 +03:00
rc = orte_plm . spawn ( jdata ) ;
2007-05-18 17:29:11 +04:00
2012-01-11 19:53:09 +04:00
/* complete debugger interface */
orte_debugger_init_after_spawn ( jdata ) ;
2008-02-28 04:57:57 +03:00
/* now wait until the termination event fires */
2010-10-28 19:22:46 +04:00
opal_event_dispatch ( opal_event_base ) ;
2008-02-28 04:57:57 +03:00
/* we only reach this point by jumping there due
* to an error - so just cleanup and leave
*/
2011-03-13 01:50:53 +03:00
DONE :
2009-04-30 19:08:02 +04:00
ORTE_UPDATE_EXIT_STATUS ( orte_exit_status ) ;
2010-07-18 01:03:27 +04:00
orte_quit ( ) ;
2011-03-10 03:42:28 +03:00
2008-02-28 04:57:57 +03:00
return orte_exit_status ;
}
2005-09-05 00:54:19 +04:00
static int init_globals ( void )
2005-03-14 23:57:21 +03:00
{
2005-03-19 02:58:36 +03:00
/* Only CONSTRUCT things once */
if ( ! globals_init ) {
2005-07-04 02:45:48 +04:00
OBJ_CONSTRUCT ( & orterun_globals . lock , opal_mutex_t ) ;
2006-10-23 07:34:08 +04:00
orterun_globals . env_val = NULL ;
orterun_globals . appfile = NULL ;
orterun_globals . wdir = NULL ;
orterun_globals . path = NULL ;
2008-02-28 04:57:57 +03:00
orterun_globals . ompi_server = NULL ;
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-05 00:29:50 +04:00
orterun_globals . wait_for_server = false ;
orterun_globals . server_wait_timeout = 10 ;
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
orterun_globals . stdin_target = " 0 " ;
2008-12-24 18:27:46 +03:00
orterun_globals . report_pid = NULL ;
orterun_globals . report_uri = NULL ;
2010-07-01 23:31:11 +04:00
orterun_globals . disable_recovery = false ;
2005-03-19 02:58:36 +03:00
}
2006-07-11 01:25:33 +04:00
/* Reset the other fields every time */
2005-03-19 02:58:36 +03:00
2006-10-23 07:34:08 +04:00
orterun_globals . help = false ;
orterun_globals . version = false ;
orterun_globals . verbose = false ;
orterun_globals . debugger = false ;
2006-12-12 03:54:05 +03:00
orterun_globals . num_procs = 0 ;
2006-11-16 01:59:01 +03:00
if ( NULL ! = orterun_globals . env_val )
2006-10-23 07:34:08 +04:00
free ( orterun_globals . env_val ) ;
orterun_globals . env_val = NULL ;
2006-11-16 01:59:01 +03:00
if ( NULL ! = orterun_globals . appfile )
2006-10-23 07:34:08 +04:00
free ( orterun_globals . appfile ) ;
orterun_globals . appfile = NULL ;
2006-11-16 01:59:01 +03:00
if ( NULL ! = orterun_globals . wdir )
2006-10-23 07:34:08 +04:00
free ( orterun_globals . wdir ) ;
orterun_globals . wdir = NULL ;
if ( NULL ! = orterun_globals . path )
free ( orterun_globals . path ) ;
orterun_globals . path = NULL ;
2005-03-19 02:58:36 +03:00
2007-03-17 02:11:45 +03:00
orterun_globals . preload_binary = false ;
orterun_globals . preload_files = NULL ;
orterun_globals . preload_files_dest_dir = NULL ;
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
# if OPAL_ENABLE_FT_CR == 1
orterun_globals . sstore_load = NULL ;
# endif
2005-03-19 02:58:36 +03:00
/* All done */
globals_init = true ;
2005-03-14 23:57:21 +03:00
return ORTE_SUCCESS ;
}
2007-06-27 05:03:31 +04:00
static int parse_globals ( int argc , char * argv [ ] , opal_cmd_line_t * cmd_line )
2005-03-14 23:57:21 +03:00
{
2006-06-09 21:21:23 +04:00
/* print version if requested. Do this before check for help so
that - - version - - help works as one might expect . */
2006-06-22 23:48:27 +04:00
if ( orterun_globals . version & &
! ( 1 = = argc | | orterun_globals . help ) ) {
2006-06-09 21:21:23 +04:00
char * project_name = NULL ;
2010-07-18 01:03:27 +04:00
if ( 0 = = strcmp ( orte_basename , " mpirun " ) ) {
2006-06-09 21:21:23 +04:00
project_name = " Open MPI " ;
} else {
project_name = " OpenRTE " ;
}
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:version " , false ,
2010-07-18 01:03:27 +04:00
orte_basename , project_name , OPAL_VERSION ,
2006-06-22 23:48:27 +04:00
PACKAGE_BUGREPORT ) ;
2006-06-09 21:21:23 +04:00
/* if we were the only argument, exit */
if ( 2 = = argc ) exit ( 0 ) ;
}
2005-07-29 01:17:48 +04:00
/* Check for help request */
2005-04-12 20:01:30 +04:00
if ( 1 = = argc | | orterun_globals . help ) {
2005-03-14 23:57:21 +03:00
char * args = NULL ;
2006-06-22 23:48:27 +04:00
char * project_name = NULL ;
2010-07-18 01:03:27 +04:00
if ( 0 = = strcmp ( orte_basename , " mpirun " ) ) {
2006-06-22 23:48:27 +04:00
project_name = " Open MPI " ;
} else {
project_name = " OpenRTE " ;
}
2007-06-27 05:03:31 +04:00
args = opal_cmd_line_get_usage_msg ( cmd_line ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:usage " , false ,
2010-07-18 01:03:27 +04:00
orte_basename , project_name , OPAL_VERSION ,
orte_basename , args ,
2006-06-22 23:48:27 +04:00
PACKAGE_BUGREPORT ) ;
2005-03-14 23:57:21 +03:00
free ( args ) ;
2005-09-05 00:54:19 +04:00
2005-03-14 23:57:21 +03:00
/* If someone asks for help, that should be all we do */
exit ( 0 ) ;
}
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
/* check for request to report pid */
2008-12-24 18:27:46 +03:00
if ( NULL ! = orterun_globals . report_pid ) {
FILE * fp ;
if ( 0 = = strcmp ( orterun_globals . report_pid , " - " ) ) {
/* if '-', then output to stdout */
printf ( " %d \n " , ( int ) getpid ( ) ) ;
} else if ( 0 = = strcmp ( orterun_globals . report_pid , " + " ) ) {
/* if '+', output to stderr */
fprintf ( stderr , " %d \n " , ( int ) getpid ( ) ) ;
} else {
fp = fopen ( orterun_globals . report_pid , " w " ) ;
if ( NULL = = fp ) {
orte_show_help ( " help-orterun.txt " , " orterun:write_file " , false ,
2010-07-18 01:03:27 +04:00
orte_basename , " pid " , orterun_globals . report_pid ) ;
2008-12-24 18:27:46 +03:00
exit ( 0 ) ;
}
fprintf ( fp , " %d \n " , ( int ) getpid ( ) ) ;
fclose ( fp ) ;
}
Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.
2. removes all wireup messaging during launch and shutdown.
3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.
4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.
5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".
6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.
7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"
This is not intended for the 1.3 release as it is a major change requiring considerable soak time.
This commit was SVN r19767.
2008-10-18 04:00:49 +04:00
}
2005-11-20 19:06:53 +03:00
/* Do we want a user-level debugger? */
2005-10-05 14:24:34 +04:00
2005-11-20 19:06:53 +03:00
if ( orterun_globals . debugger ) {
2010-07-18 01:03:27 +04:00
run_debugger ( orte_basename , cmd_line , argc , argv , orterun_globals . num_procs ) ;
2005-11-20 19:06:53 +03:00
}
2005-10-05 14:24:34 +04:00
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
/* if recovery was disabled on the cmd line, do so */
2010-07-01 23:31:11 +04:00
if ( orterun_globals . disable_recovery ) {
orte_enable_recovery = false ;
2011-02-14 23:49:12 +03:00
orte_max_restarts = 0 ;
2010-07-01 23:31:11 +04:00
}
2005-03-14 23:57:21 +03:00
return ORTE_SUCCESS ;
}
static int parse_locals ( int argc , char * argv [ ] )
{
int i , rc , app_num ;
int temp_argc ;
2005-08-08 20:42:28 +04:00
char * * temp_argv , * * env ;
2005-03-14 23:57:21 +03:00
orte_app_context_t * app ;
bool made_app ;
2006-08-15 23:54:10 +04:00
orte_std_cntr_t j , size1 ;
2005-03-14 23:57:21 +03:00
2008-02-28 04:57:57 +03:00
/* if the ompi-server was given, then set it up here */
if ( NULL ! = orterun_globals . ompi_server ) {
/* someone could have passed us a file instead of a uri, so
* we need to first check to see what we have - if it starts
* with " file " , then we know it is a file . Otherwise , we assume
* it is a uri as provided by the ompi - server ' s output
* of an ORTE - standard string . Note that this is NOT a standard
* uri as it starts with the process name !
*/
Repair the MPI-2 dynamic operations. This includes:
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.
2008-07-03 21:53:37 +04:00
if ( 0 = = strncmp ( orterun_globals . ompi_server , " file " , strlen ( " file " ) ) | |
0 = = strncmp ( orterun_globals . ompi_server , " FILE " , strlen ( " FILE " ) ) ) {
2008-02-28 04:57:57 +03:00
char input [ 1024 ] , * filename ;
FILE * fp ;
/* it is a file - get the filename */
filename = strchr ( orterun_globals . ompi_server , ' : ' ) ;
if ( NULL = = filename ) {
/* filename is not correctly formatted */
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-filename-bad " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orterun_globals . ompi_server ) ;
2008-02-28 04:57:57 +03:00
exit ( 1 ) ;
}
+ + filename ; /* space past the : */
if ( 0 > = strlen ( filename ) ) {
/* they forgot to give us the name! */
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-filename-missing " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orterun_globals . ompi_server ) ;
2008-02-28 04:57:57 +03:00
exit ( 1 ) ;
}
/* open the file and extract the uri */
fp = fopen ( filename , " r " ) ;
if ( NULL = = fp ) { /* can't find or read file! */
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-filename-access " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orterun_globals . ompi_server ) ;
2008-02-28 04:57:57 +03:00
exit ( 1 ) ;
}
if ( NULL = = fgets ( input , 1024 , fp ) ) {
/* something malformed about file */
fclose ( fp ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-file-bad " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orterun_globals . ompi_server ,
orte_basename ) ;
2008-02-28 04:57:57 +03:00
exit ( 1 ) ;
}
fclose ( fp ) ;
input [ strlen ( input ) - 1 ] = ' \0 ' ; /* remove newline */
ompi_server = strdup ( input ) ;
2008-12-10 20:10:39 +03:00
} else if ( 0 = = strncmp ( orterun_globals . ompi_server , " pid " , strlen ( " pid " ) ) | |
0 = = strncmp ( orterun_globals . ompi_server , " PID " , strlen ( " PID " ) ) ) {
opal_list_t hnp_list ;
opal_list_item_t * item ;
orte_hnp_contact_t * hnp ;
char * ptr ;
pid_t pid ;
ptr = strchr ( orterun_globals . ompi_server , ' : ' ) ;
if ( NULL = = ptr ) {
/* pid is not correctly formatted */
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-pid-bad " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orte_basename ,
orterun_globals . ompi_server , orte_basename ) ;
2008-12-10 20:10:39 +03:00
exit ( 1 ) ;
}
+ + ptr ; /* space past the : */
if ( 0 > = strlen ( ptr ) ) {
/* they forgot to give us the pid! */
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-pid-bad " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orte_basename ,
orterun_globals . ompi_server , orte_basename ) ;
2008-12-10 20:10:39 +03:00
exit ( 1 ) ;
}
pid = strtoul ( ptr , NULL , 10 ) ;
/* to search the local mpirun's, we have to partially initialize the
2009-03-06 00:56:03 +03:00
* orte_process_info structure . This won ' t fully be setup until orte_init ,
2008-12-10 20:10:39 +03:00
* but we finagle a little bit of it here
*/
2009-03-06 00:56:03 +03:00
if ( ORTE_SUCCESS ! = ( rc = orte_session_dir_get_name ( NULL , & orte_process_info . tmpdir_base ,
& orte_process_info . top_session_dir ,
2008-12-10 20:10:39 +03:00
NULL , NULL , NULL ) ) ) {
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-could-not-get-hnp-list " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orte_basename ) ;
2008-12-10 20:10:39 +03:00
exit ( 1 ) ;
}
OBJ_CONSTRUCT ( & hnp_list , opal_list_t ) ;
/* get the list of HNPs, but do -not- setup contact info to them in the RML */
if ( ORTE_SUCCESS ! = ( rc = orte_list_local_hnps ( & hnp_list , false ) ) ) {
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-could-not-get-hnp-list " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orte_basename ) ;
2008-12-10 20:10:39 +03:00
exit ( 1 ) ;
}
/* search the list for the desired pid */
while ( NULL ! = ( item = opal_list_remove_first ( & hnp_list ) ) ) {
hnp = ( orte_hnp_contact_t * ) item ;
if ( pid = = hnp - > pid ) {
ompi_server = strdup ( hnp - > rml_uri ) ;
goto hnp_found ;
}
OBJ_RELEASE ( item ) ;
}
/* if we got here, it wasn't found */
orte_show_help ( " help-orterun.txt " , " orterun:ompi-server-pid-not-found " , true ,
2010-07-18 01:03:27 +04:00
orte_basename , orte_basename , pid , orterun_globals . ompi_server ,
orte_basename ) ;
2008-12-10 20:10:39 +03:00
OBJ_DESTRUCT ( & hnp_list ) ;
exit ( 1 ) ;
hnp_found :
/* cleanup rest of list */
while ( NULL ! = ( item = opal_list_remove_first ( & hnp_list ) ) ) {
OBJ_RELEASE ( item ) ;
}
OBJ_DESTRUCT ( & hnp_list ) ;
2008-02-28 04:57:57 +03:00
} else {
ompi_server = strdup ( orterun_globals . ompi_server ) ;
}
}
2005-03-14 23:57:21 +03:00
/* Make the apps */
temp_argc = 0 ;
temp_argv = NULL ;
2005-07-04 04:13:44 +04:00
opal_argv_append ( & temp_argc , & temp_argv , argv [ 0 ] ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
2005-08-08 20:42:28 +04:00
/* NOTE: This bogus env variable is necessary in the calls to
create_app ( ) , below . See comment immediately before the
create_app ( ) function for an explanation . */
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
env = NULL ;
2005-03-14 23:57:21 +03:00
for ( app_num = 0 , i = 1 ; i < argc ; + + i ) {
if ( 0 = = strcmp ( argv [ i ] , " : " ) ) {
/* Make an app with this argv */
2005-07-04 04:13:44 +04:00
if ( opal_argv_count ( temp_argv ) > 1 ) {
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
if ( NULL ! = env ) {
2005-07-04 04:13:44 +04:00
opal_argv_free ( env ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
env = NULL ;
}
2006-03-24 18:28:42 +03:00
app = NULL ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
rc = create_app ( temp_argc , temp_argv , & app , & made_app , & env ) ;
2006-03-23 19:53:11 +03:00
/** keep track of the number of apps - point this app_context to that index */
2005-03-14 23:57:21 +03:00
if ( ORTE_SUCCESS ! = rc ) {
/* Assume that the error message has already been
printed ; no need to cleanup - - we can just
exit */
exit ( 1 ) ;
}
if ( made_app ) {
2006-03-24 18:28:42 +03:00
app - > idx = app_num ;
+ + app_num ;
2008-02-28 08:32:23 +03:00
opal_pointer_array_add ( jdata - > apps , app ) ;
2008-02-28 04:57:57 +03:00
+ + jdata - > num_apps ;
2005-03-14 23:57:21 +03:00
}
2005-09-05 00:54:19 +04:00
2005-03-14 23:57:21 +03:00
/* Reset the temps */
2005-09-05 00:54:19 +04:00
2005-03-14 23:57:21 +03:00
temp_argc = 0 ;
temp_argv = NULL ;
2005-07-04 04:13:44 +04:00
opal_argv_append ( & temp_argc , & temp_argv , argv [ 0 ] ) ;
2005-03-14 23:57:21 +03:00
}
} else {
2005-07-04 04:13:44 +04:00
opal_argv_append ( & temp_argc , & temp_argv , argv [ i ] ) ;
2005-03-14 23:57:21 +03:00
}
}
2005-07-04 04:13:44 +04:00
if ( opal_argv_count ( temp_argv ) > 1 ) {
2006-03-24 18:28:42 +03:00
app = NULL ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
rc = create_app ( temp_argc , temp_argv , & app , & made_app , & env ) ;
2005-03-14 23:57:21 +03:00
if ( ORTE_SUCCESS ! = rc ) {
/* Assume that the error message has already been printed;
no need to cleanup - - we can just exit */
exit ( 1 ) ;
}
if ( made_app ) {
2006-03-24 18:28:42 +03:00
app - > idx = app_num ;
+ + app_num ;
2008-02-28 08:32:23 +03:00
opal_pointer_array_add ( jdata - > apps , app ) ;
2008-02-28 04:57:57 +03:00
+ + jdata - > num_apps ;
2005-03-14 23:57:21 +03:00
}
}
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
if ( NULL ! = env ) {
2005-07-04 04:13:44 +04:00
opal_argv_free ( env ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
}
2005-07-04 04:13:44 +04:00
opal_argv_free ( temp_argv ) ;
2005-03-14 23:57:21 +03:00
2005-08-08 20:42:28 +04:00
/* Once we've created all the apps, add the global MCA params to
each app ' s environment ( checking for duplicates , of
course - - yay opal_environ_merge ( ) ) . */
if ( NULL ! = global_mca_env ) {
2008-02-28 08:32:23 +03:00
size1 = ( size_t ) opal_pointer_array_get_size ( jdata - > apps ) ;
2005-08-08 20:42:28 +04:00
/* Iterate through all the apps */
for ( j = 0 ; j < size1 ; + + j ) {
2005-09-05 00:54:19 +04:00
app = ( orte_app_context_t * )
2008-02-28 08:32:23 +03:00
opal_pointer_array_get_item ( jdata - > apps , j ) ;
2005-08-08 20:42:28 +04:00
if ( NULL ! = app ) {
/* Use handy utility function */
env = opal_environ_merge ( global_mca_env , app - > env ) ;
opal_argv_free ( app - > env ) ;
app - > env = env ;
}
}
}
/* Now take a subset of the MCA params and set them as MCA
overrides here in orterun ( so that when we orte_init ( ) later ,
all the components see these MCA params ) . Here ' s how we decide
which subset of the MCA params we set here in orterun :
1. If any global MCA params were set , use those
2. If no global MCA params were set and there was only one app ,
then use its app MCA params
3. Otherwise , don ' t set any
*/
env = NULL ;
if ( NULL ! = global_mca_env ) {
env = global_mca_env ;
} else {
2008-02-28 08:32:23 +03:00
if ( opal_pointer_array_get_size ( jdata - > apps ) > = 1 ) {
2005-08-08 20:42:28 +04:00
/* Remember that pointer_array's can be padded with NULL
entries ; so only use the app ' s env if there is exactly
1 non - NULL entry */
2005-09-05 00:54:19 +04:00
app = ( orte_app_context_t * )
2008-02-28 08:32:23 +03:00
opal_pointer_array_get_item ( jdata - > apps , 0 ) ;
2005-08-08 20:42:28 +04:00
if ( NULL ! = app ) {
env = app - > env ;
2008-02-28 08:32:23 +03:00
for ( j = 1 ; j < opal_pointer_array_get_size ( jdata - > apps ) ; + + j ) {
if ( NULL ! = opal_pointer_array_get_item ( jdata - > apps , j ) ) {
2005-08-08 20:42:28 +04:00
env = NULL ;
break ;
}
}
}
}
}
2005-09-05 00:54:19 +04:00
2005-08-08 20:42:28 +04:00
if ( NULL ! = env ) {
size1 = opal_argv_count ( env ) ;
for ( j = 0 ; j < size1 ; + + j ) {
2011-05-18 20:25:35 +04:00
/* Use-after-Free error possible here. putenv does not copy
* the string passed to it , and instead stores only the pointer .
* env [ j ] may be freed later , in which case the pointer
* in environ will now be left dangling into a deallocated
* region .
* So we make a copy of the variable .
*/
char * s = strdup ( env [ j ] ) ;
if ( NULL = = s ) {
return OPAL_ERR_OUT_OF_RESOURCE ;
}
putenv ( s ) ;
2005-08-08 20:42:28 +04:00
}
}
2005-03-14 23:57:21 +03:00
/* All done */
return ORTE_SUCCESS ;
}
2008-07-09 02:36:39 +04:00
static int capture_cmd_line_params ( int argc , int start , char * * argv )
{
int i , j , k ;
bool ignore ;
char * no_dups [ ] = {
" grpcomm " ,
" odls " ,
" rml " ,
" routed " ,
NULL
} ;
for ( i = 0 ; i < ( argc - start ) ; + + i ) {
if ( 0 = = strcmp ( " -mca " , argv [ i ] ) | |
0 = = strcmp ( " --mca " , argv [ i ] ) ) {
/* It would be nice to avoid increasing the length
* of the orted cmd line by removing any non - ORTE
* params . However , this raises a problem since
* there could be OPAL directives that we really
* - do - want the orted to see - it ' s only the OMPI
* related directives we could ignore . This becomes
* a very complicated procedure , however , since
* the OMPI mca params are not cleanly separated - so
* filtering them out is nearly impossible .
*
* see if this is already present so we at least can
* avoid growing the cmd line with duplicates
*/
ignore = false ;
if ( NULL ! = orted_cmd_line ) {
for ( j = 0 ; NULL ! = orted_cmd_line [ j ] ; j + + ) {
if ( 0 = = strcmp ( argv [ i + 1 ] , orted_cmd_line [ j ] ) ) {
/* already here - if the value is the same,
* we can quitely ignore the fact that they
* provide it more than once . However , some
* frameworks are known to have problems if the
* value is different . We don ' t have a good way
* to know this , but we at least make a crude
* attempt here to protect ourselves .
*/
if ( 0 = = strcmp ( argv [ i + 2 ] , orted_cmd_line [ j + 1 ] ) ) {
/* values are the same */
ignore = true ;
break ;
} else {
/* values are different - see if this is a problem */
for ( k = 0 ; NULL ! = no_dups [ k ] ; k + + ) {
if ( 0 = = strcmp ( no_dups [ k ] , argv [ i + 1 ] ) ) {
/* print help message
* and abort as we cannot know which one is correct
*/
orte_show_help ( " help-orterun.txt " , " orterun:conflicting-params " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , argv [ i + 1 ] ,
2008-07-09 02:36:39 +04:00
argv [ i + 2 ] , orted_cmd_line [ j + 1 ] ) ;
return ORTE_ERR_BAD_PARAM ;
}
}
/* this passed muster - just ignore it */
ignore = true ;
break ;
}
}
}
}
if ( ! ignore ) {
opal_argv_append_nosize ( & orted_cmd_line , argv [ i ] ) ;
opal_argv_append_nosize ( & orted_cmd_line , argv [ i + 1 ] ) ;
opal_argv_append_nosize ( & orted_cmd_line , argv [ i + 2 ] ) ;
}
i + = 2 ;
}
}
return ORTE_SUCCESS ;
}
2005-08-08 20:42:28 +04:00
/*
* This function takes a " char ***app_env " parameter to handle the
* specific case :
*
* orterun - - mca foo bar - app appfile
*
* That is , we ' ll need to keep foo = bar , but the presence of the app
* file will cause an invocation of parse_appfile ( ) , which will cause
* one or more recursive calls back to create_app ( ) . Since the
* foo = bar value applies globally to all apps in the appfile , we need
* to pass in the " base " environment ( that contains the foo = bar value )
* when we parse each line in the appfile .
*
* This is really just a special case - - when we have a simple case like :
*
* orterun - - mca foo bar - np 4 hostname
*
* Then the upper - level function ( parse_locals ( ) ) calls create_app ( )
* with a NULL value for app_env , meaning that there is no " base "
* environment that the app needs to be created from .
*/
2005-03-14 23:57:21 +03:00
static int create_app ( int argc , char * argv [ ] , orte_app_context_t * * app_ptr ,
2005-08-08 20:42:28 +04:00
bool * made_app , char * * * app_env )
2005-03-14 23:57:21 +03:00
{
2005-07-04 04:13:44 +04:00
opal_cmd_line_t cmd_line ;
2009-05-07 00:11:28 +04:00
char cwd [ OPAL_PATH_MAX ] ;
2006-02-07 06:32:36 +03:00
int i , j , count , rc ;
2005-03-14 23:57:21 +03:00
char * param , * value , * value2 ;
orte_app_context_t * app = NULL ;
2008-03-06 01:12:27 +03:00
bool cmd_line_made = false ;
2005-03-14 23:57:21 +03:00
* made_app = false ;
2008-03-06 01:12:27 +03:00
/* Pre-process the command line if we are going to parse an appfile later.
* save any mca command line args so they can be passed
* separately to the daemons .
* Use Case :
* $ cat launch . appfile
* - np 1 - mca aaa bbb . / my - app - mca ccc ddd
* - np 1 - mca aaa bbb . / my - app - mca eee fff
* $ mpirun - np 2 - mca foo bar - - app launch . appfile
* Only pick up ' - mca foo bar ' on this pass .
2005-11-03 21:15:47 +03:00
*/
2008-03-06 01:12:27 +03:00
if ( NULL ! = orterun_globals . appfile ) {
2008-07-09 02:36:39 +04:00
if ( ORTE_SUCCESS ! = ( rc = capture_cmd_line_params ( argc , 0 , argv ) ) ) {
goto cleanup ;
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 23:48:23 +04:00
}
2005-03-14 23:57:21 +03:00
}
2008-07-09 02:36:39 +04:00
2008-03-06 01:12:27 +03:00
/* Parse application command line options. */
2005-03-14 23:57:21 +03:00
init_globals ( ) ;
2005-07-04 04:13:44 +04:00
opal_cmd_line_create ( & cmd_line , cmd_line_init ) ;
2005-03-18 06:43:59 +03:00
mca_base_cmd_line_setup ( & cmd_line ) ;
2005-03-14 23:57:21 +03:00
cmd_line_made = true ;
2008-03-06 01:12:27 +03:00
rc = opal_cmd_line_parse ( & cmd_line , true , argc , argv ) ;
2006-02-12 04:33:29 +03:00
if ( ORTE_SUCCESS ! = rc ) {
2005-03-14 23:57:21 +03:00
goto cleanup ;
}
2005-08-08 20:42:28 +04:00
mca_base_cmd_line_process_args ( & cmd_line , app_env , & global_mca_env ) ;
2005-03-14 23:57:21 +03:00
/* Is there an appfile in here? */
if ( NULL ! = orterun_globals . appfile ) {
OBJ_DESTRUCT ( & cmd_line ) ;
2005-08-08 20:42:28 +04:00
return parse_appfile ( strdup ( orterun_globals . appfile ) , app_env ) ;
2005-03-14 23:57:21 +03:00
}
/* Setup application context */
app = OBJ_NEW ( orte_app_context_t ) ;
2006-02-07 06:32:36 +03:00
opal_cmd_line_get_tail ( & cmd_line , & count , & app - > argv ) ;
2005-03-14 23:57:21 +03:00
/* See if we have anything left */
2006-02-07 06:32:36 +03:00
if ( 0 = = count ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:executable-not-specified " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , orte_basename ) ;
2005-03-14 23:57:21 +03:00
rc = ORTE_ERR_NOT_FOUND ;
goto cleanup ;
}
2007-10-11 22:33:40 +04:00
/*
2008-03-06 01:12:27 +03:00
* Get mca parameters so we can pass them to the daemons .
* Use the count determined above to make sure we do not go past
* the executable name . Example :
2007-10-11 22:33:40 +04:00
* mpirun - np 2 - mca foo bar . / my - app - mca bip bop
* We want to pick up ' - mca foo bar ' but not ' - mca bip bop '
*/
2008-07-09 02:36:39 +04:00
if ( ORTE_SUCCESS ! = ( rc = capture_cmd_line_params ( argc , count , argv ) ) ) {
goto cleanup ;
2007-10-11 22:33:40 +04:00
}
2008-07-09 02:36:39 +04:00
2005-04-09 05:26:17 +04:00
/* Grab all OMPI_* environment variables */
2005-03-14 23:57:21 +03:00
2005-08-08 20:42:28 +04:00
app - > env = opal_argv_copy ( * app_env ) ;
2005-03-14 23:57:21 +03:00
for ( i = 0 ; NULL ! = environ [ i ] ; + + i ) {
2005-04-06 05:58:30 +04:00
if ( 0 = = strncmp ( " OMPI_ " , environ [ i ] , 5 ) ) {
2008-07-08 17:48:47 +04:00
/* check for duplicate in app->env - this
* would have been placed there by the
* cmd line processor . By convention , we
* always let the cmd line override the
* environment
*/
param = strdup ( environ [ i ] ) ;
value = strchr ( param , ' = ' ) ;
* value = ' \0 ' ;
value + + ;
opal_setenv ( param , value , false , & app - > env ) ;
free ( param ) ;
2005-03-14 23:57:21 +03:00
}
}
2008-12-10 02:49:02 +03:00
2008-02-28 04:57:57 +03:00
/* add the ompi-server, if provided */
if ( NULL ! = ompi_server ) {
2008-12-10 02:49:02 +03:00
opal_setenv ( " OMPI_MCA_pubsub_orte_server " , ompi_server , true , & app - > env ) ;
2008-02-28 04:57:57 +03:00
}
2011-12-07 01:31:22 +04:00
/* Did the user request to export any environment variables on the cmd line? */
2005-07-04 04:13:44 +04:00
if ( opal_cmd_line_is_taken ( & cmd_line , " x " ) ) {
j = opal_cmd_line_get_ninsts ( & cmd_line , " x " ) ;
2005-03-14 23:57:21 +03:00
for ( i = 0 ; i < j ; + + i ) {
2005-07-04 04:13:44 +04:00
param = opal_cmd_line_get_param ( & cmd_line , " x " , i , 0 ) ;
2005-03-14 23:57:21 +03:00
if ( NULL ! = strchr ( param , ' = ' ) ) {
2005-07-04 04:13:44 +04:00
opal_argv_append_nosize ( & app - > env , param ) ;
2005-03-14 23:57:21 +03:00
} else {
value = getenv ( param ) ;
if ( NULL ! = value ) {
if ( NULL ! = strchr ( value , ' = ' ) ) {
2005-07-04 04:13:44 +04:00
opal_argv_append_nosize ( & app - > env , value ) ;
2005-03-14 23:57:21 +03:00
} else {
asprintf ( & value2 , " %s=%s " , param , value ) ;
2005-07-04 04:13:44 +04:00
opal_argv_append_nosize ( & app - > env , value2 ) ;
2005-05-13 01:44:23 +04:00
free ( value2 ) ;
2005-03-14 23:57:21 +03:00
}
} else {
2008-06-09 18:53:58 +04:00
opal_output ( 0 , " Warning: could not find environment variable \" %s \" \n " , param ) ;
2005-03-14 23:57:21 +03:00
}
}
}
}
2011-12-07 01:31:22 +04:00
/* Did the user request to export any environment variables via MCA param? */
if ( NULL ! = orte_forward_envars ) {
char * * vars ;
vars = opal_argv_split ( orte_forward_envars , ' , ' ) ;
for ( i = 0 ; NULL ! = vars [ i ] ; i + + ) {
if ( NULL ! = strchr ( vars [ i ] , ' = ' ) ) {
/* user supplied a value */
opal_argv_append_nosize ( & app - > env , vars [ i ] ) ;
} else {
/* get the value from the environ */
value = getenv ( vars [ i ] ) ;
if ( NULL ! = value ) {
if ( NULL ! = strchr ( value , ' = ' ) ) {
opal_argv_append_nosize ( & app - > env , value ) ;
} else {
asprintf ( & value2 , " %s=%s " , vars [ i ] , value ) ;
opal_argv_append_nosize ( & app - > env , value2 ) ;
free ( value2 ) ;
}
} else {
opal_output ( 0 , " Warning: could not find environment variable \" %s \" \n " , param ) ;
}
}
}
opal_argv_free ( vars ) ;
}
2008-03-06 00:07:43 +03:00
/* If the user specified --path, store it in the user's app
environment via the OMPI_exec_path variable . */
2005-03-14 23:57:21 +03:00
if ( NULL ! = orterun_globals . path ) {
2008-03-06 00:07:43 +03:00
asprintf ( & value , " OMPI_exec_path=%s " , orterun_globals . path ) ;
2005-07-04 04:13:44 +04:00
opal_argv_append_nosize ( & app - > env , value ) ;
2005-03-14 23:57:21 +03:00
free ( value ) ;
}
/* Did the user request a specific wdir? */
if ( NULL ! = orterun_globals . wdir ) {
2009-01-25 15:39:24 +03:00
/* if this is a relative path, convert it to an absolute path */
if ( opal_path_is_absolute ( orterun_globals . wdir ) ) {
app - > cwd = strdup ( orterun_globals . wdir ) ;
} else {
/* get the cwd */
if ( OPAL_SUCCESS ! = ( rc = opal_getcwd ( cwd , sizeof ( cwd ) ) ) ) {
orte_show_help ( " help-orterun.txt " , " orterun:init-failure " ,
true , " get the cwd " , rc ) ;
goto cleanup ;
}
/* construct the absolute path */
app - > cwd = opal_os_path ( false , cwd , orterun_globals . wdir , NULL ) ;
}
2006-02-16 23:40:23 +03:00
app - > user_specified_cwd = true ;
2005-03-14 23:57:21 +03:00
} else {
2008-02-28 04:57:57 +03:00
if ( OPAL_SUCCESS ! = ( rc = opal_getcwd ( cwd , sizeof ( cwd ) ) ) ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:init-failure " ,
2008-02-28 04:57:57 +03:00
true , " get the cwd " , rc ) ;
goto cleanup ;
}
2005-03-14 23:57:21 +03:00
app - > cwd = strdup ( cwd ) ;
2006-02-16 23:40:23 +03:00
app - > user_specified_cwd = false ;
2005-03-14 23:57:21 +03:00
}
2006-09-15 06:52:08 +04:00
/* Check to see if the user explicitly wanted to disable automatic
- - prefix behavior */
if ( opal_cmd_line_is_taken ( & cmd_line , " noprefix " ) ) {
want_prefix_by_default = false ;
}
2006-02-28 14:52:12 +03:00
/* Did the user specify a specific prefix for this app_context_t
or provide an absolute path name to argv [ 0 ] ? */
if ( opal_cmd_line_is_taken ( & cmd_line , " prefix " ) | |
2006-09-15 06:52:08 +04:00
' / ' = = argv [ 0 ] [ 0 ] | | want_prefix_by_default ) {
2005-09-06 20:10:05 +04:00
size_t param_len ;
2011-04-29 02:20:55 +04:00
char * path_to_mpirun = NULL ;
2005-09-06 20:10:05 +04:00
2011-04-29 02:20:55 +04:00
if ( ' / ' = = argv [ 0 ] [ 0 ] ) {
char * tmp_basename = NULL ;
/* If they specified an absolute path, strip off the
/ bin / < exec_name > " and leave just the prefix */
path_to_mpirun = opal_dirname ( argv [ 0 ] ) ;
/* Quick sanity check to ensure we got
something / bin / < exec_name > and that the installation
tree is at least more or less what we expect it to
be */
tmp_basename = opal_basename ( path_to_mpirun ) ;
if ( 0 = = strcmp ( " bin " , tmp_basename ) ) {
char * tmp = path_to_mpirun ;
path_to_mpirun = opal_dirname ( tmp ) ;
free ( tmp ) ;
} else {
free ( path_to_mpirun ) ;
path_to_mpirun = NULL ;
}
free ( tmp_basename ) ;
}
2011-04-29 02:12:41 +04:00
/* if both are given, check to see if they match */
2011-04-29 02:20:55 +04:00
if ( opal_cmd_line_is_taken ( & cmd_line , " prefix " ) & & NULL ! = path_to_mpirun ) {
2011-04-29 02:12:41 +04:00
/* if they don't match, then that merits a warning */
2011-04-29 02:20:55 +04:00
param = strdup ( opal_cmd_line_get_param ( & cmd_line , " prefix " , 0 , 0 ) ) ;
2011-04-29 02:12:41 +04:00
if ( 0 ! = strcmp ( param , path_to_mpirun ) ) {
orte_show_help ( " help-orterun.txt " , " orterun:double-prefix " ,
true , orte_basename , orte_basename , param , path_to_mpirun ) ;
/* let the path-to-mpirun take precedence since we
* know that one is being used
*/
2011-04-29 02:20:55 +04:00
free ( param ) ;
2011-04-29 02:12:41 +04:00
param = path_to_mpirun ;
} else {
/* since they match, just use param */
free ( path_to_mpirun ) ;
}
2011-04-29 02:20:55 +04:00
} else if ( NULL ! = path_to_mpirun ) {
param = path_to_mpirun ;
2011-04-29 02:12:41 +04:00
} else if ( opal_cmd_line_is_taken ( & cmd_line , " prefix " ) ) {
/* must be --prefix alone */
param = strdup ( opal_cmd_line_get_param ( & cmd_line , " prefix " , 0 , 0 ) ) ;
} else {
/* --enable-orterun-prefix-default was given to orterun */
param = strdup ( opal_install_dirs . prefix ) ;
2006-09-15 06:52:08 +04:00
}
2005-09-06 20:10:05 +04:00
2006-02-28 14:52:12 +03:00
if ( NULL ! = param ) {
2006-08-24 20:18:42 +04:00
/* "Parse" the param, aka remove superfluous path_sep. */
2006-02-28 14:52:12 +03:00
param_len = strlen ( param ) ;
2006-08-22 01:55:41 +04:00
while ( 0 = = strcmp ( OPAL_PATH_SEP , & ( param [ param_len - 1 ] ) ) ) {
2006-02-28 14:52:12 +03:00
param [ param_len - 1 ] = ' \0 ' ;
param_len - - ;
if ( 0 = = param_len ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:empty-prefix " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , orte_basename ) ;
2006-02-28 14:52:12 +03:00
return ORTE_ERR_FATAL ;
}
}
app - > prefix_dir = strdup ( param ) ;
2011-04-29 02:12:41 +04:00
free ( param ) ;
2006-02-28 14:52:12 +03:00
}
2005-09-06 20:10:05 +04:00
}
2008-03-06 01:12:27 +03:00
/* Did the user specify a hostfile. Need to check for both
* hostfile and machine file .
* We can only deal with one hostfile per app context , otherwise give an error .
2008-02-28 04:57:57 +03:00
*/
2008-03-06 01:12:27 +03:00
if ( 0 < ( j = opal_cmd_line_get_ninsts ( & cmd_line , " hostfile " ) ) ) {
if ( 1 < j ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:multiple-hostfiles " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , NULL ) ;
2008-03-06 01:12:27 +03:00
return ORTE_ERR_FATAL ;
} else {
value = opal_cmd_line_get_param ( & cmd_line , " hostfile " , 0 , 0 ) ;
app - > hostfile = strdup ( value ) ;
2005-03-14 23:57:21 +03:00
}
2008-03-06 01:12:27 +03:00
}
if ( 0 < ( j = opal_cmd_line_get_ninsts ( & cmd_line , " machinefile " ) ) ) {
if ( 1 < j | | NULL ! = app - > hostfile ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:multiple-hostfiles " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , NULL ) ;
2008-03-06 01:12:27 +03:00
return ORTE_ERR_FATAL ;
} else {
value = opal_cmd_line_get_param ( & cmd_line , " machinefile " , 0 , 0 ) ;
app - > hostfile = strdup ( value ) ;
2005-03-14 23:57:21 +03:00
}
2008-03-06 01:12:27 +03:00
}
/* Did the user specify any hosts? */
if ( 0 < ( j = opal_cmd_line_get_ninsts ( & cmd_line , " host " ) ) ) {
2005-03-14 23:57:21 +03:00
for ( i = 0 ; i < j ; + + i ) {
2008-03-06 01:12:27 +03:00
value = opal_cmd_line_get_param ( & cmd_line , " host " , i , 0 ) ;
opal_argv_append_nosize ( & app - > dash_host , value ) ;
2005-03-14 23:57:21 +03:00
}
}
/* Get the numprocs */
2006-09-25 23:41:54 +04:00
app - > num_procs = ( orte_std_cntr_t ) orterun_globals . num_procs ;
2005-04-09 05:26:17 +04:00
2006-07-11 01:25:33 +04:00
total_num_apps + + ;
2007-03-17 02:11:45 +03:00
/* Preserve if we are to preload the binary */
app - > preload_binary = orterun_globals . preload_binary ;
if ( NULL ! = orterun_globals . preload_files )
app - > preload_files = strdup ( orterun_globals . preload_files ) ;
else
app - > preload_files = NULL ;
if ( NULL ! = orterun_globals . preload_files_dest_dir )
app - > preload_files_dest_dir = strdup ( orterun_globals . preload_files_dest_dir ) ;
else
app - > preload_files_dest_dir = NULL ;
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
# if OPAL_ENABLE_FT_CR == 1
if ( NULL ! = orterun_globals . sstore_load ) {
app - > sstore_load = strdup ( orterun_globals . sstore_load ) ;
} else {
app - > sstore_load = NULL ;
}
# endif
2007-03-17 02:11:45 +03:00
2006-02-16 23:40:23 +03:00
/* Do not try to find argv[0] here -- the starter is responsible
for that because it may not be relevant to try to find it on
the node where orterun is executing . So just strdup ( ) argv [ 0 ]
into app . */
2005-03-14 23:57:21 +03:00
2006-02-16 23:40:23 +03:00
app - > app = strdup ( app - > argv [ 0 ] ) ;
2005-03-14 23:57:21 +03:00
if ( NULL = = app - > app ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:call-failed " ,
2010-07-18 01:03:27 +04:00
true , orte_basename , " library " , " strdup returned NULL " , errno ) ;
2005-03-14 23:57:21 +03:00
rc = ORTE_ERR_NOT_FOUND ;
goto cleanup ;
}
* app_ptr = app ;
app = NULL ;
* made_app = true ;
/* All done */
cleanup :
if ( NULL ! = app ) {
OBJ_RELEASE ( app ) ;
}
if ( cmd_line_made ) {
OBJ_DESTRUCT ( & cmd_line ) ;
}
return rc ;
}
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
static int parse_appfile ( char * filename , char * * * env )
2005-03-14 23:57:21 +03:00
{
size_t i , len ;
FILE * fp ;
char line [ BUFSIZ ] ;
2006-03-23 20:55:25 +03:00
int rc , argc , app_num ;
2005-03-14 23:57:21 +03:00
char * * argv ;
orte_app_context_t * app ;
bool blank , made_app ;
char bogus [ ] = " bogus " ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
char * * tmp_env ;
2005-03-14 23:57:21 +03:00
2007-10-11 22:33:40 +04:00
/*
* Make sure to clear out this variable so we don ' t do anything odd in
* app_create ( )
*/
if ( NULL ! = orterun_globals . appfile ) {
free ( orterun_globals . appfile ) ;
orterun_globals . appfile = NULL ;
}
2005-03-14 23:57:21 +03:00
/* Try to open the file */
fp = fopen ( filename , " r " ) ;
if ( NULL = = fp ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-orterun.txt " , " orterun:appfile-not-found " , true ,
2005-03-14 23:57:21 +03:00
filename ) ;
return ORTE_ERR_NOT_FOUND ;
}
/* Read in line by line */
line [ sizeof ( line ) - 1 ] = ' \0 ' ;
2006-03-23 20:55:25 +03:00
app_num = 0 ;
2005-03-14 23:57:21 +03:00
do {
/* We need a bogus argv[0] (because when argv comes in from
the command line , argv [ 0 ] is " orterun " , so the parsing
logic ignores it ) . So create one here rather than making
an argv and then pre - pending a new argv [ 0 ] ( which would be
rather inefficient ) . */
line [ 0 ] = ' \0 ' ;
strcat ( line , bogus ) ;
2005-09-05 00:54:19 +04:00
if ( NULL = = fgets ( line + sizeof ( bogus ) - 1 ,
2005-03-14 23:57:21 +03:00
sizeof ( line ) - sizeof ( bogus ) - 1 , fp ) ) {
break ;
}
2005-04-12 22:42:34 +04:00
/* Remove a trailing newline */
2005-03-14 23:57:21 +03:00
len = strlen ( line ) ;
2005-04-12 22:42:34 +04:00
if ( len > 0 & & ' \n ' = = line [ len - 1 ] ) {
line [ len - 1 ] = ' \0 ' ;
if ( len > 0 ) {
- - len ;
}
}
/* Remove comments */
2005-03-14 23:57:21 +03:00
for ( i = 0 ; i < len ; + + i ) {
if ( ' # ' = = line [ i ] ) {
line [ i ] = ' \0 ' ;
break ;
} else if ( i + 1 < len & & ' / ' = = line [ i ] & & ' / ' = = line [ i + 1 ] ) {
line [ i ] = ' \0 ' ;
break ;
}
}
/* Is this a blank line? */
len = strlen ( line ) ;
for ( blank = true , i = sizeof ( bogus ) ; i < len ; + + i ) {
if ( ! isspace ( line [ i ] ) ) {
blank = false ;
break ;
}
}
if ( blank ) {
continue ;
}
/* We got a line with *something* on it. So process it */
2005-07-04 04:13:44 +04:00
argv = opal_argv_split ( line , ' ' ) ;
argc = opal_argv_count ( argv ) ;
2005-03-14 23:57:21 +03:00
if ( argc > 0 ) {
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
2005-08-08 20:42:28 +04:00
/* Create a temporary env to use in the recursive call --
that is : don ' t disturb the original env so that we can
have a consistent global env . This allows for the
case :
2005-09-05 00:54:19 +04:00
orterun - - mca foo bar - - appfile file
2005-08-08 20:42:28 +04:00
where the " file " contains multiple apps . In this case ,
each app in " file " will get * only * foo = bar as the base
environment from which its specific environment is
constructed . */
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
if ( NULL ! = * env ) {
2005-07-04 04:13:44 +04:00
tmp_env = opal_argv_copy ( * env ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
if ( NULL = = tmp_env ) {
return ORTE_ERR_OUT_OF_RESOURCE ;
}
} else {
tmp_env = NULL ;
}
rc = create_app ( argc , argv , & app , & made_app , & tmp_env ) ;
2005-03-14 23:57:21 +03:00
if ( ORTE_SUCCESS ! = rc ) {
/* Assume that the error message has already been
printed ; no need to cleanup - - we can just exit */
exit ( 1 ) ;
}
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
if ( NULL ! = tmp_env ) {
2005-07-04 04:13:44 +04:00
opal_argv_free ( tmp_env ) ;
While waiting for fortran compiles...
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
2005-05-13 18:36:36 +04:00
}
2005-03-14 23:57:21 +03:00
if ( made_app ) {
2006-03-24 18:28:42 +03:00
app - > idx = app_num ;
+ + app_num ;
2008-02-28 08:32:23 +03:00
opal_pointer_array_add ( jdata - > apps , app ) ;
2008-02-28 04:57:57 +03:00
+ + jdata - > num_apps ;
2005-03-14 23:57:21 +03:00
}
}
} while ( ! feof ( fp ) ) ;
fclose ( fp ) ;
/* All done */
free ( filename ) ;
return ORTE_SUCCESS ;
}
2010-07-07 03:35:42 +04:00
/*
* Process one line from the orte_base_user_debugger MCA param and
* look for that debugger in the path . If we find it , fill in
* new_argv .
*/
static int process ( char * orig_line , char * basename , opal_cmd_line_t * cmd_line ,
int argc , char * * argv , char * * * new_argv , int num_procs )
{
2011-01-28 16:01:06 +03:00
int ret = ORTE_SUCCESS ;
int i , j , count ;
char * line = NULL , * tmp = NULL , * full_line = strdup ( orig_line ) ;
char * * orterun_argv = NULL , * * executable_argv = NULL , * * line_argv = NULL ;
2010-07-07 03:35:42 +04:00
char cwd [ OPAL_PATH_MAX ] ;
bool used_num_procs = false ;
bool single_app = false ;
bool fail_needed_executable = false ;
line = full_line ;
if ( NULL = = line ) {
2011-01-28 16:01:06 +03:00
ret = ORTE_ERR_OUT_OF_RESOURCE ;
goto out ;
2010-07-07 03:35:42 +04:00
}
/* Trim off whitespace at the beginning and ending of line */
for ( i = 0 ; ' \0 ' ! = line [ i ] & & isspace ( line [ i ] ) ; + + line ) {
continue ;
}
for ( i = strlen ( line ) - 2 ; i > 0 & & isspace ( line [ i ] ) ; + + i ) {
line [ i ] = ' \0 ' ;
}
if ( strlen ( line ) < = 0 ) {
2011-01-28 16:01:06 +03:00
ret = ORTE_ERROR ;
goto out ;
2010-07-07 03:35:42 +04:00
}
/* Get the tail of the command line (i.e., the user executable /
argv ) */
2011-01-28 16:01:06 +03:00
opal_cmd_line_get_tail ( cmd_line , & i , & executable_argv ) ;
/* Make a new copy of the orterun command line args, without the
orterun token itself , and without the - - debug , - - debugger , and
- tv flags . */
orterun_argv = opal_argv_copy ( argv ) ;
count = opal_argv_count ( orterun_argv ) ;
opal_argv_delete ( & count , & orterun_argv , 0 , 1 ) ;
for ( i = 0 ; NULL ! = orterun_argv [ i ] ; + + i ) {
count = opal_argv_count ( orterun_argv ) ;
if ( 0 = = strcmp ( orterun_argv [ i ] , " -debug " ) | |
0 = = strcmp ( orterun_argv [ i ] , " --debug " ) ) {
opal_argv_delete ( & count , & orterun_argv , i , 1 ) ;
} else if ( 0 = = strcmp ( orterun_argv [ i ] , " -tv " ) | |
0 = = strcmp ( orterun_argv [ i ] , " --tv " ) ) {
opal_argv_delete ( & count , & orterun_argv , i , 1 ) ;
} else if ( 0 = = strcmp ( orterun_argv [ i ] , " --debugger " ) | |
0 = = strcmp ( orterun_argv [ i ] , " -debugger " ) ) {
opal_argv_delete ( & count , & orterun_argv , i , 2 ) ;
2010-07-07 03:35:42 +04:00
}
}
/* Replace @@ tokens - line should never realistically be bigger
than MAX_INT , so just cast to int to remove compiler warning */
2011-01-28 16:01:06 +03:00
* new_argv = NULL ;
line_argv = opal_argv_split ( line , ' ' ) ;
if ( NULL = = line_argv ) {
ret = ORTE_ERR_NOT_FOUND ;
goto out ;
}
for ( i = 0 ; NULL ! = line_argv [ i ] ; + + i ) {
if ( 0 = = strcmp ( line_argv [ i ] , " @mpirun@ " ) | |
0 = = strcmp ( line_argv [ i ] , " @orterun@ " ) ) {
opal_argv_append_nosize ( new_argv , argv [ 0 ] ) ;
} else if ( 0 = = strcmp ( line_argv [ i ] , " @mpirun_args@ " ) | |
0 = = strcmp ( line_argv [ i ] , " @orterun_args@ " ) ) {
for ( j = 0 ; NULL ! = orterun_argv & & NULL ! = orterun_argv [ j ] ; + + j ) {
opal_argv_append_nosize ( new_argv , orterun_argv [ j ] ) ;
}
} else if ( 0 = = strcmp ( line_argv [ i ] , " @np@ " ) ) {
asprintf ( & tmp , " %d " , num_procs ) ;
opal_argv_append_nosize ( new_argv , tmp ) ;
free ( tmp ) ;
} else if ( 0 = = strcmp ( line_argv [ i ] , " @single_app@ " ) ) {
2010-07-07 03:35:42 +04:00
/* This token is only a flag; it is not replaced with any
alternate text */
single_app = true ;
2011-01-28 16:01:06 +03:00
} else if ( 0 = = strcmp ( line_argv [ i ] , " @executable@ " ) ) {
2010-07-07 03:35:42 +04:00
/* If we found the executable, paste it in. Otherwise,
this is a possible error . */
2011-01-28 16:01:06 +03:00
if ( NULL ! = executable_argv ) {
opal_argv_append_nosize ( new_argv , executable_argv [ 0 ] ) ;
2010-07-07 03:35:42 +04:00
} else {
fail_needed_executable = true ;
}
2011-01-28 16:01:06 +03:00
} else if ( 0 = = strcmp ( line_argv [ i ] , " @executable_argv@ " ) ) {
2010-07-07 03:35:42 +04:00
/* If we found the tail, paste in the argv. Otherwise,
this is a possible error . */
2011-01-28 16:01:06 +03:00
if ( NULL ! = executable_argv ) {
for ( j = 1 ; NULL ! = executable_argv [ j ] ; + + j ) {
opal_argv_append_nosize ( new_argv , executable_argv [ j ] ) ;
2010-07-07 03:35:42 +04:00
}
} else {
fail_needed_executable = true ;
}
2011-01-28 16:01:06 +03:00
} else {
/* It wasn't a special token, so just copy it over */
opal_argv_append_nosize ( new_argv , line_argv [ i ] ) ;
2010-07-07 03:35:42 +04:00
}
}
/* Can we find argv[0] in the path? */
getcwd ( cwd , OPAL_PATH_MAX ) ;
tmp = opal_path_findv ( ( * new_argv ) [ 0 ] , X_OK , environ , cwd ) ;
if ( NULL ! = tmp ) {
free ( tmp ) ;
/* Ok, we found a good debugger. Check for some error
conditions . */
tmp = opal_argv_join ( argv , ' ' ) ;
/* We do not support launching a debugger that requires the
- np value if the user did not specify - np on the command
line . */
if ( used_num_procs & & 0 = = num_procs ) {
2011-01-28 16:01:06 +03:00
free ( tmp ) ;
tmp = opal_argv_join ( orterun_argv , ' ' ) ;
2010-07-07 03:35:42 +04:00
orte_show_help ( " help-orterun.txt " , " debugger requires -np " ,
2011-01-28 16:01:06 +03:00
true , ( * new_argv ) [ 0 ] , argv [ 0 ] , tmp ,
2010-07-07 03:35:42 +04:00
( * new_argv ) [ 0 ] ) ;
/* Fall through to free / fail, below */
}
/* Some debuggers do not support launching MPMD */
else if ( single_app & & NULL ! = strchr ( tmp , ' : ' ) ) {
orte_show_help ( " help-orterun.txt " ,
" debugger only accepts single app " , true ,
( * new_argv ) [ 0 ] , ( * new_argv ) [ 0 ] ) ;
/* Fall through to free / fail, below */
}
/* Some debuggers do not use orterun/mpirun, and therefore
must have an executable to run ( e . g . , cannot use mpirun ' s
app context file feature ) . */
else if ( fail_needed_executable ) {
orte_show_help ( " help-orterun.txt " ,
" debugger requires executable " , true ,
( * new_argv ) [ 0 ] , argv [ 0 ] , ( * new_argv ) [ 0 ] , argv [ 0 ] ,
( * new_argv ) [ 0 ] ) ;
/* Fall through to free / fail, below */
}
/* Otherwise, we succeeded. Return happiness. */
else {
2011-01-28 16:01:06 +03:00
goto out ;
2010-07-07 03:35:42 +04:00
}
free ( tmp ) ;
}
/* All done -- didn't find it */
opal_argv_free ( * new_argv ) ;
* new_argv = NULL ;
2011-01-28 16:01:06 +03:00
ret = ORTE_ERR_NOT_FOUND ;
out :
if ( NULL ! = orterun_argv ) {
opal_argv_free ( orterun_argv ) ;
}
if ( NULL ! = executable_argv ) {
opal_argv_free ( executable_argv ) ;
}
if ( NULL ! = line_argv ) {
opal_argv_free ( line_argv ) ;
}
if ( NULL ! = tmp ) {
free ( tmp ) ;
}
if ( NULL ! = full_line ) {
free ( full_line ) ;
}
return ret ;
2010-07-07 03:35:42 +04:00
}
/**
* Run a user - level debugger
*/
static void run_debugger ( char * basename , opal_cmd_line_t * cmd_line ,
int argc , char * argv [ ] , int num_procs )
{
int i , id ;
char * * new_argv = NULL ;
char * value , * * lines , * env_name ;
/* Get the orte_base_debug MCA parameter and search for a debugger
that can run */
id = mca_base_param_find ( " orte " , NULL , " base_user_debugger " ) ;
if ( id < 0 ) {
orte_show_help ( " help-orterun.txt " , " debugger-mca-param-not-found " ,
true ) ;
exit ( 1 ) ;
}
value = NULL ;
mca_base_param_lookup_string ( id , & value ) ;
if ( NULL = = value ) {
orte_show_help ( " help-orterun.txt " , " debugger-orte_base_user_debugger-empty " ,
true ) ;
exit ( 1 ) ;
}
/* Look through all the values in the MCA param */
lines = opal_argv_split ( value , ' : ' ) ;
free ( value ) ;
for ( i = 0 ; NULL ! = lines [ i ] ; + + i ) {
if ( ORTE_SUCCESS = = process ( lines [ i ] , basename , cmd_line , argc , argv ,
& new_argv , num_procs ) ) {
break ;
}
}
/* If we didn't find one, abort */
if ( NULL = = lines [ i ] ) {
orte_show_help ( " help-orterun.txt " , " debugger-not-found " , true ) ;
exit ( 1 ) ;
}
opal_argv_free ( lines ) ;
/* We found one */
/* cleanup the MPIR arrays in case the debugger doesn't set them */
memset ( ( char * ) MPIR_executable_path , 0 , MPIR_MAX_PATH_LENGTH ) ;
memset ( ( char * ) MPIR_server_arguments , 0 , MPIR_MAX_ARG_LENGTH ) ;
/* Set an MCA param so that everyone knows that they are being
launched under a debugger ; not all debuggers are consistent
about setting MPIR_being_debugged in both the launcher and the
MPI processes */
env_name = mca_base_param_environ_variable ( " orte " ,
" in_parallel_debugger " , NULL ) ;
if ( NULL ! = env_name ) {
opal_setenv ( env_name , " 1 " , true , & environ ) ;
free ( env_name ) ;
}
/* Launch the debugger */
execvp ( new_argv [ 0 ] , new_argv ) ;
value = opal_argv_join ( new_argv , ' ' ) ;
orte_show_help ( " help-orterun.txt " , " debugger-exec-failed " ,
true , basename , value , new_argv [ 0 ] ) ;
free ( value ) ;
opal_argv_free ( new_argv ) ;
exit ( 1 ) ;
}
2012-01-11 19:53:09 +04:00
/**** DEBUGGER CODE ****/
/*
* Debugger support for orterun
*
* We interpret the MPICH debugger interface as follows :
*
* a ) The launcher
* - spawns the other processes ,
* - fills in the table MPIR_proctable , and sets MPIR_proctable_size
* - sets MPIR_debug_state to MPIR_DEBUG_SPAWNED ( = 1 )
* - calls MPIR_Breakpoint ( ) which the debugger will have a
* breakpoint on .
*
* b ) Applications start and then spin until MPIR_debug_gate is set
* non - zero by the debugger .
*
* This file implements ( a ) .
*
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
*
* Note that we have presently tested both TotalView and DDT parallel
* debuggers . They both nominally subscribe to the Etnus attaching
* interface , but there are differences between the two .
*
* TotalView : user launches " totalview mpirun -a ...<mpirun args>... " .
* TV launches mpirun . mpirun launches the application and then calls
* MPIR_Breakpoint ( ) . This is the signal to TV that it ' s a parallel
* MPI job . TV then reads the proctable in mpirun and attaches itself
* to all the processes ( it takes care of launching itself on the
* remote nodes ) . Upon attaching to all the MPI processes , the
* variable MPIR_being_debugged is set to 1. When it has finished
* attaching itself to all the MPI processes that it wants to ,
* MPIR_Breakpoint ( ) returns .
*
* DDT : user launches " ddt bin -np X <mpi app name> " . DDT fork / exec ' s
* mpirun to launch ddt - debugger on the back - end nodes via " mpirun -np
* X ddt - debugger " (not the lack of other arguments -- we can't pass
* anything to mpirun ) . This app will eventually fork / exec the MPI
* app . DDT does not current set MPIR_being_debugged in the MPI app .
*
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
*
* We support two ways of waiting for attaching debuggers . The
* implementation spans this file and ompi / debuggers / ompi_debuggers . c .
*
* 1. If using orterun : MPI processes will have the
* orte_in_parallel_debugger MCA param set to true ( because not all
* debuggers consistently set MPIR_being_debugged in both the launcher
* and in the MPI procs ) . The HNP will call MPIR_Breakpoint ( ) and
* then RML send a message to VPID 0 ( MCW rank 0 ) when it returns
* ( MPIR_Breakpoint ( ) doesn ' t return until the debugger has attached
* to all relevant processes ) . Meanwhile , VPID 0 blocks waiting for
* the RML message . All other VPIDs immediately call the grpcomm
* barrier ( and therefore block until the debugger attaches ) . Once
* VPID 0 receives the RML message , we know that the debugger has
* attached to all processes that it cares about , and VPID 0 then
* joins the grpcomm barrier , allowing the job to continue . This
* scheme has the side effect of nicely supporting partial attaches by
* parallel debuggers ( i . e . , attaching to only some of the MPI
* processes ; not necessarily all of them ) .
*
* 2. If not using orterun : in this case , ORTE_DISABLE_FULL_SUPPORT
* will be true , and we know that there will not be an RML message
* sent to VPID 0. So we have to look for a magic environment
* variable from the launcher to know if the jobs will be attached by
* a debugger ( e . g . , set by yod , srun , . . . etc . ) , and if so , spin on
* MPIR_debug_gate . These environment variable names must be
* hard - coded in the OMPI layer ( see ompi / debuggers / ompi_debuggers . c ) .
*/
/* local globals and functions */
static void attach_debugger ( int fd , short event , void * arg ) ;
static void build_debugger_args ( orte_app_context_t * debugger ) ;
static void open_fifo ( void ) ;
static opal_event_t attach ;
static int attach_fd = - 1 ;
static bool fifo_active = false ;
# define DUMP_INT(X) fprintf(stderr, " %s = %d\n", # X, X);
# define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
struct MPIR_PROCDESC {
char * host_name ; /* something that can be passed to inet_addr */
char * executable_name ; /* name of binary */
int pid ; /* process pid */
} ;
static void orte_debugger_dump ( void )
{
int i ;
DUMP_INT ( MPIR_being_debugged ) ;
DUMP_INT ( MPIR_debug_state ) ;
DUMP_INT ( MPIR_partial_attach_ok ) ;
DUMP_INT ( MPIR_i_am_starter ) ;
DUMP_INT ( MPIR_forward_output ) ;
DUMP_INT ( MPIR_proctable_size ) ;
fprintf ( stderr , " MPIR_proctable: \n " ) ;
for ( i = 0 ; i < MPIR_proctable_size ; i + + ) {
fprintf ( stderr ,
" (i, host, exe, pid) = (%d, %s, %s, %d) \n " ,
i ,
MPIR_proctable [ i ] . host_name ,
MPIR_proctable [ i ] . executable_name ,
MPIR_proctable [ i ] . pid ) ;
}
fprintf ( stderr , " MPIR_executable_path: %s \n " ,
( ' \0 ' = = MPIR_executable_path [ 0 ] ) ?
" NULL " : ( char * ) MPIR_executable_path ) ;
fprintf ( stderr , " MPIR_server_arguments: %s \n " ,
( ' \0 ' = = MPIR_server_arguments [ 0 ] ) ?
" NULL " : ( char * ) MPIR_server_arguments ) ;
}
/**
* Initialization of data structures for running under a debugger
* using the MPICH / TotalView parallel debugger interface . Before the
* spawn we need to check if we are being run under a TotalView - like
* debugger ; if so then inform applications via an MCA parameter .
*/
static void orte_debugger_init_before_spawn ( orte_job_t * jdata )
{
char * env_name ;
orte_app_context_t * app ;
int i ;
int32_t ljob ;
char * attach_fifo ;
if ( ! MPIR_being_debugged & & ! orte_in_parallel_debugger ) {
/* if we were given a test debugger, then we still want to
* colaunch it
*/
if ( NULL ! = orte_debugger_test_daemon ) {
opal_output_verbose ( 2 , orte_debug_output ,
" %s No debugger test daemon specified " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ) ;
goto launchit ;
}
/* if we were given an auto-detect rate, then we want to setup
* an event so we periodically do the check
*/
if ( 0 < orte_debugger_check_rate ) {
opal_output_verbose ( 2 , orte_debug_output ,
" %s Setting debugger attach check rate for %d seconds " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ,
orte_debugger_check_rate ) ;
ORTE_TIMER_EVENT ( orte_debugger_check_rate , 0 , attach_debugger ) ;
2012-01-11 23:44:22 +04:00
} else {
/* create the attachment FIFO and setup readevent */
2012-01-11 19:53:09 +04:00
/* create a FIFO name in the session dir */
attach_fifo = opal_os_path ( false , orte_process_info . job_session_dir , " debugger_attach_fifo " , NULL ) ;
if ( ( mkfifo ( attach_fifo , FILE_MODE ) < 0 ) & & errno ! = EEXIST ) {
opal_output ( 0 , " CANNOT CREATE FIFO %s: errno %d " , attach_fifo , errno ) ;
free ( attach_fifo ) ;
return ;
}
strncpy ( MPIR_attach_fifo , attach_fifo , MPIR_MAX_PATH_LENGTH - 1 ) ;
free ( attach_fifo ) ;
open_fifo ( ) ;
}
return ;
}
launchit :
opal_output_verbose ( 1 , orte_debug_output , " Info: Spawned by a debugger " ) ;
/* tell the procs they are being debugged */
env_name = mca_base_param_environ_variable ( " orte " ,
" in_parallel_debugger " , NULL ) ;
for ( i = 0 ; i < jdata - > apps - > size ; i + + ) {
if ( NULL = = ( app = ( orte_app_context_t * ) opal_pointer_array_get_item ( jdata - > apps , i ) ) ) {
continue ;
}
opal_setenv ( env_name , " 1 " , true , & app - > env ) ;
}
free ( env_name ) ;
/* check if we need to co-spawn the debugger daemons */
if ( ' \0 ' ! = MPIR_executable_path [ 0 ] | | NULL ! = orte_debugger_test_daemon ) {
/* can only have one debugger */
if ( NULL ! = orte_debugger_daemon ) {
opal_output ( 0 , " ------------------------------------------- \n "
" Only one debugger can be used on a job. \n "
" ------------------------------------------- \n " ) ;
ORTE_UPDATE_EXIT_STATUS ( ORTE_ERROR_DEFAULT_EXIT_CODE ) ;
return ;
}
opal_output_verbose ( 2 , orte_debug_output ,
" %s Cospawning debugger daemons %s " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ,
( NULL = = orte_debugger_test_daemon ) ?
MPIR_executable_path : orte_debugger_test_daemon ) ;
/* add debugger info to launch message */
orte_debugger_daemon = OBJ_NEW ( orte_job_t ) ;
/* create a jobid for these daemons - this is done solely
* to avoid confusing the rest of the system ' s bookkeeping
*/
orte_plm_base_create_jobid ( orte_debugger_daemon ) ;
/* flag the job as being debugger daemons */
orte_debugger_daemon - > controls | = ORTE_JOB_CONTROL_DEBUGGER_DAEMON ;
/* unless directed, we do not forward output */
if ( ! MPIR_forward_output ) {
orte_debugger_daemon - > controls & = ~ ORTE_JOB_CONTROL_FORWARD_OUTPUT ;
}
/* add it to the global job pool */
ljob = ORTE_LOCAL_JOBID ( orte_debugger_daemon - > jobid ) ;
opal_pointer_array_set_item ( orte_job_data , ljob , orte_debugger_daemon ) ;
/* create an app_context for the debugger daemon */
app = OBJ_NEW ( orte_app_context_t ) ;
if ( NULL ! = orte_debugger_test_daemon ) {
app - > app = strdup ( orte_debugger_test_daemon ) ;
} else {
app - > app = strdup ( ( char * ) MPIR_executable_path ) ;
}
opal_argv_append_nosize ( & app - > argv , app - > app ) ;
build_debugger_args ( app ) ;
opal_pointer_array_add ( orte_debugger_daemon - > apps , app ) ;
orte_debugger_daemon - > num_apps = 1 ;
}
}
/*
* Initialization of data structures for running under a debugger
* using the MPICH / TotalView parallel debugger interface . This stage
* of initialization must occur after spawn
*
* NOTE : We - always - perform this step to ensure that any debugger
* that attaches to us post - launch of the application can get a
* completed proctable
*/
static void orte_debugger_init_after_spawn ( orte_job_t * jdata )
{
orte_proc_t * proc ;
orte_app_context_t * appctx ;
orte_vpid_t i , j ;
opal_buffer_t buf ;
orte_process_name_t rank0 ;
int rc ;
/* if we couldn't get thru the mapper stage, we might
* enter here with no procs . Avoid the " zero byte malloc "
* message by checking here
*/
if ( MPIR_proctable | | 0 = = jdata - > num_procs ) {
/* already initialized */
opal_output_verbose ( 5 , orte_debug_output ,
" %s: debugger already initialized or zero procs " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ) ;
return ;
}
/* fill in the proc table for the application processes */
opal_output_verbose ( 5 , orte_debug_output ,
" %s: Setting up debugger process table for applications " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ) ;
MPIR_debug_state = 1 ;
/* set the total number of processes in the job */
MPIR_proctable_size = jdata - > num_procs ;
/* allocate MPIR_proctable */
MPIR_proctable = ( struct MPIR_PROCDESC * ) malloc ( sizeof ( struct MPIR_PROCDESC ) *
MPIR_proctable_size ) ;
if ( MPIR_proctable = = NULL ) {
ORTE_ERROR_LOG ( ORTE_ERR_OUT_OF_RESOURCE ) ;
return ;
}
if ( orte_debugger_dump_proctable ) {
opal_output ( orte_clean_output , " MPIR Proctable for job %s " , ORTE_JOBID_PRINT ( jdata - > jobid ) ) ;
}
/* initialize MPIR_proctable */
for ( j = 0 ; j < jdata - > num_procs ; j + + ) {
if ( NULL = = ( proc = ( orte_proc_t * ) opal_pointer_array_get_item ( jdata - > procs , j ) ) ) {
continue ;
}
/* store this data in the location whose index
* corresponds to the proc ' s rank
*/
i = proc - > name . vpid ;
if ( NULL = = ( appctx = ( orte_app_context_t * ) opal_pointer_array_get_item ( jdata - > apps , proc - > app_idx ) ) ) {
continue ;
}
MPIR_proctable [ i ] . host_name = strdup ( proc - > node - > name ) ;
if ( 0 = = strncmp ( appctx - > app , OPAL_PATH_SEP , 1 ) ) {
MPIR_proctable [ i ] . executable_name =
opal_os_path ( false , appctx - > app , NULL ) ;
} else {
MPIR_proctable [ i ] . executable_name =
opal_os_path ( false , appctx - > cwd , appctx - > app , NULL ) ;
}
MPIR_proctable [ i ] . pid = proc - > pid ;
if ( orte_debugger_dump_proctable ) {
opal_output ( orte_clean_output , " %s: Host %s Exe %s Pid %d " ,
ORTE_VPID_PRINT ( i ) , MPIR_proctable [ i ] . host_name ,
MPIR_proctable [ i ] . executable_name , MPIR_proctable [ i ] . pid ) ;
}
}
if ( 0 < opal_output_get_verbosity ( orte_debug_output ) ) {
orte_debugger_dump ( ) ;
}
/* if we are being launched under a debugger, then we must wait
* for it to be ready to go and do some things to start the job
*/
if ( MPIR_being_debugged ) {
/* wait for all procs to have reported their contact info - this
* ensures that ( a ) they are all into mpi_init , and ( b ) the system
* has the contact info to successfully send a message to rank = 0
*/
ORTE_PROGRESSED_WAIT ( false , jdata - > num_reported , jdata - > num_procs ) ;
MPIR_Breakpoint ( ) ;
/* send a message to rank=0 to release it */
OBJ_CONSTRUCT ( & buf , opal_buffer_t ) ; /* don't need anything in this */
rank0 . jobid = jdata - > jobid ;
rank0 . vpid = 0 ;
if ( 0 > ( rc = orte_rml . send_buffer ( & rank0 , & buf , ORTE_RML_TAG_DEBUGGER_RELEASE , 0 ) ) ) {
opal_output ( 0 , " Error: could not send debugger release to MPI procs - error %s " , ORTE_ERROR_NAME ( rc ) ) ;
}
OBJ_DESTRUCT ( & buf ) ;
}
}
static void open_fifo ( void )
{
if ( attach_fd > 0 ) {
close ( attach_fd ) ;
}
attach_fd = open ( MPIR_attach_fifo , O_RDONLY | O_NONBLOCK , 0 ) ;
if ( attach_fd < 0 ) {
opal_output ( 0 , " %s unable to open debugger attach fifo " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ) ;
return ;
}
opal_output_verbose ( 2 , orte_debug_output ,
" %s Monitoring debugger attach fifo %s " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ,
MPIR_attach_fifo ) ;
opal_event_set ( opal_event_base , & attach , attach_fd , OPAL_EV_READ , attach_debugger , NULL ) ;
fifo_active = true ;
opal_event_add ( & attach , 0 ) ;
}
static void attach_debugger ( int fd , short event , void * arg )
{
orte_app_context_t * app ;
unsigned char fifo_cmd ;
int rc ;
int32_t ljob ;
orte_job_t * jdata ;
/* read the file descriptor to clear that event, if necessary */
if ( fifo_active ) {
opal_event_del ( & attach ) ;
fifo_active = false ;
rc = read ( attach_fd , & fifo_cmd , sizeof ( fifo_cmd ) ) ;
if ( ! rc ) {
/* reopen device to clear hangup */
open_fifo ( ) ;
return ;
}
if ( 1 ! = fifo_cmd ) {
/* ignore the cmd */
goto RELEASE ;
}
}
if ( ! MPIR_being_debugged & & ! orte_debugger_test_attach ) {
/* false alarm */
goto RELEASE ;
}
opal_output_verbose ( 1 , orte_debug_output ,
" %s Attaching debugger %s " , ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ,
( NULL = = orte_debugger_test_daemon ) ? MPIR_executable_path : orte_debugger_test_daemon ) ;
/* a debugger has attached! All the MPIR_Proctable
* data is already available , so we only need to
* check to see if we should spawn any daemons
*/
if ( ' \0 ' ! = MPIR_executable_path [ 0 ] | | NULL ! = orte_debugger_test_daemon ) {
/* can only have one debugger */
if ( NULL ! = orte_debugger_daemon ) {
opal_output ( 0 , " ------------------------------------------- \n "
" Only one debugger can be used on a job. \n "
" ------------------------------------------- \n " ) ;
goto RELEASE ;
}
opal_output_verbose ( 2 , orte_debug_output ,
" %s Spawning debugger daemons %s " ,
ORTE_NAME_PRINT ( ORTE_PROC_MY_NAME ) ,
( NULL = = orte_debugger_test_daemon ) ?
MPIR_executable_path : orte_debugger_test_daemon ) ;
/* this will be launched just like a regular job,
* so we do not use the global orte_debugger_daemon
* as this is reserved for co - location upon startup
*/
jdata = OBJ_NEW ( orte_job_t ) ;
/* create a jobid for these daemons - this is done solely
* to avoid confusing the rest of the system ' s bookkeeping
*/
orte_plm_base_create_jobid ( jdata ) ;
/* flag the job as being debugger daemons */
jdata - > controls | = ORTE_JOB_CONTROL_DEBUGGER_DAEMON ;
/* unless directed, we do not forward output */
if ( ! MPIR_forward_output ) {
jdata - > controls & = ~ ORTE_JOB_CONTROL_FORWARD_OUTPUT ;
}
/* add it to the global job pool */
ljob = ORTE_LOCAL_JOBID ( jdata - > jobid ) ;
opal_pointer_array_set_item ( orte_job_data , ljob , jdata ) ;
/* create an app_context for the debugger daemon */
app = OBJ_NEW ( orte_app_context_t ) ;
if ( NULL ! = orte_debugger_test_daemon ) {
app - > app = strdup ( orte_debugger_test_daemon ) ;
} else {
app - > app = strdup ( ( char * ) MPIR_executable_path ) ;
}
jdata - > state = ORTE_JOB_STATE_INIT ;
opal_argv_append_nosize ( & app - > argv , app - > app ) ;
build_debugger_args ( app ) ;
opal_pointer_array_add ( jdata - > apps , app ) ;
jdata - > num_apps = 1 ;
/* setup the mapping policy to pernode so we get one
* daemon on each node
*/
jdata - > map = OBJ_NEW ( orte_job_map_t ) ;
jdata - > map - > mapping = ORTE_MAPPING_PPR ;
jdata - > map - > ppr = strdup ( " 1:n " ) ;
/* now go ahead and spawn this job */
if ( ORTE_SUCCESS ! = ( rc = orte_plm . spawn ( jdata ) ) ) {
ORTE_ERROR_LOG ( rc ) ;
}
}
RELEASE :
/* reset the read or timer event */
if ( 0 = = orte_debugger_check_rate ) {
fifo_active = true ;
opal_event_add ( & attach , 0 ) ;
} else if ( ! MPIR_being_debugged ) {
ORTE_TIMER_EVENT ( orte_debugger_check_rate , 0 , attach_debugger ) ;
}
/* notify the debugger that all is ready */
MPIR_Breakpoint ( ) ;
}
static void build_debugger_args ( orte_app_context_t * debugger )
{
int i , j ;
char mpir_arg [ MPIR_MAX_ARG_LENGTH ] ;
if ( ' \0 ' ! = MPIR_server_arguments [ 0 ] ) {
j = 0 ;
memset ( mpir_arg , 0 , MPIR_MAX_ARG_LENGTH ) ;
for ( i = 0 ; i < MPIR_MAX_ARG_LENGTH ; i + + ) {
if ( MPIR_server_arguments [ i ] = = ' \0 ' ) {
if ( 0 < j ) {
opal_argv_append_nosize ( & debugger - > argv , mpir_arg ) ;
memset ( mpir_arg , 0 , MPIR_MAX_ARG_LENGTH ) ;
j = 0 ;
}
} else {
mpir_arg [ j ] = MPIR_server_arguments [ i ] ;
j + + ;
}
}
}
}