1
1
Граф коммитов

2880 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
fabd5cc153 Simplification of the ErrMgr framework by removing the 'stack'/composite functionality.
The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably.

Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component.

So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning.

This commit was SVN r23628.
2010-08-19 13:09:20 +00:00
Josh Hursey
77792c937d When we checkpoint with the --stop option, be sure to write out all the metadata before clearing the storage handle.
Here we tried to write out the session directory marker after we sync the directory, which happens early in the case of --stop.

Thanks to Ananda Mudar for noticing the bug.

This commit was SVN r23627.
2010-08-18 20:44:03 +00:00
Brian Barrett
13c827dda8 Make trunk compile on Red Storm again
This commit was SVN r23622.
2010-08-17 21:51:38 +00:00
Ralph Castain
bbf84fd92b Refine the protection from cross-dvm communications
This commit was SVN r23615.
2010-08-16 16:33:39 +00:00
Ralph Castain
930f7adb0f Check the return status and report any error
This commit was SVN r23611.
2010-08-13 15:04:59 +00:00
Ralph Castain
4491a0e5dc Add a channel for reporting errors, fix a bug in the tcp module
This commit was SVN r23610.
2010-08-13 15:04:22 +00:00
Ralph Castain
ace1f60429 Rename an mca param to something more intuitive and set its default to 0 so the module only runs if a non-zero value is provided
This commit was SVN r23609.
2010-08-13 15:03:45 +00:00
Rainer Keller
fc4cb0c0c1 - Allow changing ALPS run command
- Fix misnomer

This commit was SVN r23601.
2010-08-12 14:41:35 +00:00
Ralph Castain
5715a5b421 Let VM-based mappings include the updated nidmap
This commit was SVN r23596.
2010-08-11 21:04:28 +00:00
Josh Hursey
e12ca48cd9 A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php

Documentation:
  http://osl.iu.edu/research/ft/

Major Changes: 
-------------- 
 * Added C/R-enabled Debugging support. 
   Enabled with the --enable-crdebug flag. See the following website for more information: 
   http://osl.iu.edu/research/ft/crdebug/ 
 * Added Stable Storage (SStore) framework for checkpoint storage 
   * 'central' component does a direct to central storage save 
   * 'stage' component stages checkpoints to central storage while the application continues execution. 
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) 
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) 
 * Added Compression (compress) framework to support 
 * Add two new ErrMgr recovery policies 
   * {{{crmig}}} C/R Process Migration 
   * {{{autor}}} C/R Automatic Recovery 
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component 
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) 
   * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) 
   * {{{OMPI_CR_Restart}}} 
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) 
   * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) 
   * {{{OMPI_CR_Quiesce_start}}} 
   * {{{OMPI_CR_Quiesce_checkpoint}}} 
   * {{{OMPI_CR_Quiesce_end}}} 
   * {{{OMPI_CR_self_register_checkpoint_callback}}} 
   * {{{OMPI_CR_self_register_restart_callback}}} 
   * {{{OMPI_CR_self_register_continue_callback}}} 
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. 
 * Add a progress meter to: 
   * FileM rsh (filem_rsh_process_meter) 
   * SnapC full (snapc_full_progress_meter) 
   * SStore stage (sstore_stage_progress_meter) 
 * Added 2 new command line options to ompi-restart 
   * --showme : Display the full command line that would have been exec'ed. 
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) 
 * Deprecated some MCA params: 
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir 
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir 
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared 
   * snapc_base_store_in_place deprecated, replaced with different components of SStore 
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref 
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported 
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem 

Minor Changes: 
-------------- 
 * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. 
 * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components 
 * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. 
 * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} 
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. 
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. 
 * Cleanup the CRS framework and components to work with the SStore framework. 
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). 
 * Add 'quiesce' hook to CRCP for a future enhancement. 
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. 
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface). 
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. 
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. 
 * {{{opal-restart}}} also support local decompression before restarting 
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata 
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata 
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. 
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. 
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. 
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. 
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. 
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. 
 * Improve the checks for 'already checkpointing' error path. 
 * A a recovery output timer, to show how long it takes to restart a job 
 * Do a better job of cleaning up the old session directory on restart. 
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) 
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. 

This commit was SVN r23587.

The following Trac tickets were found above:
  Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
  Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
  Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
  Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
  Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
  Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
  Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
Terry Dontje
b74ef351b7 Added new solaris sysinfo module. Also added code to assign
orte_local_chip_type and orte_local_chip_model in MPI processes it the
appropriate sysinfo module found the values on the machine.

This commit was SVN r23581.
2010-08-09 19:28:56 +00:00
Eugene Loh
5e1f40ba12 Clean up English and clarify a point regarding process binding
in the mpirun man page.

This commit was SVN r23558.
2010-08-05 23:10:44 +00:00
Eugene Loh
1c12192c93 Fix grammatical typo in orterun man page.
This commit was SVN r23555.
2010-08-04 20:37:28 +00:00
Ralph Castain
9fbd7c1949 Fix a bug in tcp multicast
This commit was SVN r23547.
2010-08-04 01:37:54 +00:00
Ralph Castain
94f046e7f3 Fix an "oops" where we missed some instances of a name change. Thanks to Greg Koenig for spotting it!
This commit was SVN r23545.
2010-08-03 16:36:08 +00:00
Ralph Castain
8ccd14d508 Correct the -h output to reflect that --bind-to-none is the default behavior
This commit was SVN r23537.
2010-08-01 22:31:34 +00:00
Josh Hursey
01fbb729e3 Need to get the nidmap so that the apps have something to look at for wireup.
Without this we get errors like the following since the nidmap is empty in the apps:
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 234
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 281

This commit was SVN r23536.
2010-07-31 16:25:13 +00:00
Josh Hursey
ba7e94dd89 Some relatively minor C/R related cleanup
* Fix a configure warning for checking --enable-ft-thread
 * In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again.
 * Fix 'npernode' for restarting using the resilient rmaps component
 * export ompi_info_set, so that internal functionality can use it.

This commit was SVN r23535.
2010-07-30 18:59:34 +00:00
Ralph Castain
d8ec83f939 Remove an unneeded tag
This commit was SVN r23533.
2010-07-29 02:13:06 +00:00
Ralph Castain
94d4887452 Cleanup some race conditions at job start
This commit was SVN r23515.
2010-07-27 18:24:11 +00:00
Ralph Castain
97cfbdfc8c Add debug
This commit was SVN r23514.
2010-07-27 16:55:16 +00:00
Ralph Castain
cfbfbb75a2 Don't abort if a race condition causes a nid not to be found
This commit was SVN r23513.
2010-07-27 16:20:58 +00:00
Ralph Castain
06fe2c4c20 Add debug
This commit was SVN r23512.
2010-07-27 16:20:29 +00:00
Ralph Castain
e16ec40283 Check the correct envar. Thanks to Philippe for pointing out the error!
This commit was SVN r23499.
2010-07-27 03:49:03 +00:00
Ralph Castain
0c201486e4 Allow for multiple channels for the same channel "name" - could be both input and output channels
This commit was SVN r23498.
2010-07-27 01:38:39 +00:00
Ralph Castain
6158cab2e1 Add missing header
This commit was SVN r23497.
2010-07-27 01:37:41 +00:00
Ralph Castain
1ba8bbe1a9 Don't require the existence of a multicast interface if --enable-multicast wasn't specified.
This commit was SVN r23494.
2010-07-26 15:09:57 +00:00
Ralph Castain
e858b029aa Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options.
This commit was SVN r23493.
2010-07-24 23:35:20 +00:00
Ralph Castain
ff2d573f7e Adjust some default values, and ensure we don't start sending too soon
This commit was SVN r23492.
2010-07-23 19:37:16 +00:00
Ralph Castain
140e427a79 Ensure that wildcard recvs go to the end of the matching list so that recvs for specific tags take precedence.
Ensure we don't try to send tcp mcast messsages to procs that haven't reported back yet

This commit was SVN r23491.
2010-07-23 19:31:34 +00:00
Ralph Castain
95a908cb27 Provide run-time control over heartbeat
This commit was SVN r23490.
2010-07-23 17:11:12 +00:00
Ralph Castain
dfdb1eca0c Remove windows dist file requirement
This commit was SVN r23480.
2010-07-23 01:23:15 +00:00
Ralph Castain
67027cfce4 Add a new "generic" ess module that supports direct launched MPI jobs so long as certain minimum envars are provided, regardless of launch environment.
This commit was SVN r23478.
2010-07-22 21:51:43 +00:00
Ralph Castain
e7719f0aa4 Update platform files, adjust sensor heartbeat module selection rules
This commit was SVN r23477.
2010-07-22 21:50:46 +00:00
Jeff Squyres
5027915ead Hostname is not used in this function.
This commit was SVN r23454.
2010-07-21 11:07:28 +00:00
Jeff Squyres
64cb8f5d7f Another round of man page cleanups from Debian mantainer Manuel
Prinz.  Many thanks!

This commit was SVN r23445.
2010-07-20 14:07:18 +00:00
Ralph Castain
f3e13b9766 Improve the efficiency by making the check for uniqueness of incoming hnp contact info much faster by including the hnp_uri in the job_family tracker object. Replace the global buffer storage with a quick routine to build the buffer from the jobfams array
This commit was SVN r23443.
2010-07-20 08:30:47 +00:00
Ralph Castain
f85e69b64b Continue enabling connect_accept across large numbers of independent jobs by replacing the hash tables with pointer_arrays to store routes to remote hnps to gain flexibility we'll need in the future.
This commit was SVN r23439.
2010-07-20 04:47:31 +00:00
Ralph Castain
248320b91a Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started.
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.

Add a test provided by Philippe that tests this ability.

This commit was SVN r23438.
2010-07-20 04:22:45 +00:00
Ralph Castain
d9f7947a42 Arg - the other half of the prior commit: have the orted properly parse the resulting data to hand it to orte_odls.
Also, reindent per emacs (sigh).

This commit was SVN r23437.
2010-07-20 04:06:46 +00:00
Ralph Castain
72525a5850 Allow the passing of NULL to orte_plm.kill_local_procs to match what we allow for the equivalent orte_odls call
This commit was SVN r23436.
2010-07-20 04:05:41 +00:00
Ralph Castain
ad5eaee4c6 Protect against NULL and provide additional resource check/error report
This commit was SVN r23432.
2010-07-19 18:33:32 +00:00
Ralph Castain
0355da6335 Cleanup pointer array addressing and protect against NULL
This commit was SVN r23431.
2010-07-19 18:30:04 +00:00
Ralph Castain
3f7f8df40f Fix singleton comm_spawn by having the host orted create a map object for the singleton job. Even though this isn't filled in, the pidmap function will ignore any job that doesn't have a map object. Thus, this ensures that the singleton's location is included in the pidmap, thereby allowing messages to be routed to/from it.
This commit was SVN r23430.
2010-07-18 02:48:17 +00:00
Ralph Castain
12cd07c9a9 Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases.
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang.

This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is:

* pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place

* modified the errmgr to directly call the new routines when termination is detected

* removed the grpcomm.onesided_barrier and its associated RML tag

* add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree

* use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero

Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial.

This commit was SVN r23429.
2010-07-17 21:03:27 +00:00
Ralph Castain
23b166cbd4 Initialize pointers to NULL
This commit was SVN r23422.
2010-07-15 16:33:57 +00:00
Ralph Castain
a99e8cf132 Re-issue the read event so that multiple debugger attachments can occur
This commit was SVN r23421.
2010-07-15 15:35:26 +00:00
Ralph Castain
4af771a681 Update the job object pack-unpack fns
This commit was SVN r23392.
2010-07-13 21:23:14 +00:00
Jeff Squyres
c98729694a Fix a compile error in the xgrid plm. This does nothing to make it
work; it just now compiles on os x. 

This commit was SVN r23387.
2010-07-13 11:56:20 +00:00
Ralph Castain
570d19106b Allow singletons to use ompi-server for rendezvous via pubsub as well as comm_spawn without starting their own local daemons
This commit was SVN r23384.
2010-07-13 06:33:07 +00:00