1
1

649 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
fabd5cc153 Simplification of the ErrMgr framework by removing the 'stack'/composite functionality.
The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably.

Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component.

So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning.

This commit was SVN r23628.
2010-08-19 13:09:20 +00:00
Brian Barrett
13c827dda8 Make trunk compile on Red Storm again
This commit was SVN r23622.
2010-08-17 21:51:38 +00:00
Shiqing Fan
330999e36c Some fixes for C/R enhancement on Windows. Add the option and fix some type casts, just let it compile.
This commit was SVN r23599.
2010-08-12 13:31:37 +00:00
Ralph Castain
18f7b919d1 Update platform files to no-build new components and frameworks
This commit was SVN r23595.
2010-08-11 21:04:02 +00:00
Josh Hursey
e12ca48cd9 A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php

Documentation:
  http://osl.iu.edu/research/ft/

Major Changes: 
-------------- 
 * Added C/R-enabled Debugging support. 
   Enabled with the --enable-crdebug flag. See the following website for more information: 
   http://osl.iu.edu/research/ft/crdebug/ 
 * Added Stable Storage (SStore) framework for checkpoint storage 
   * 'central' component does a direct to central storage save 
   * 'stage' component stages checkpoints to central storage while the application continues execution. 
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) 
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) 
 * Added Compression (compress) framework to support 
 * Add two new ErrMgr recovery policies 
   * {{{crmig}}} C/R Process Migration 
   * {{{autor}}} C/R Automatic Recovery 
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component 
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) 
   * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) 
   * {{{OMPI_CR_Restart}}} 
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) 
   * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) 
   * {{{OMPI_CR_Quiesce_start}}} 
   * {{{OMPI_CR_Quiesce_checkpoint}}} 
   * {{{OMPI_CR_Quiesce_end}}} 
   * {{{OMPI_CR_self_register_checkpoint_callback}}} 
   * {{{OMPI_CR_self_register_restart_callback}}} 
   * {{{OMPI_CR_self_register_continue_callback}}} 
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. 
 * Add a progress meter to: 
   * FileM rsh (filem_rsh_process_meter) 
   * SnapC full (snapc_full_progress_meter) 
   * SStore stage (sstore_stage_progress_meter) 
 * Added 2 new command line options to ompi-restart 
   * --showme : Display the full command line that would have been exec'ed. 
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) 
 * Deprecated some MCA params: 
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir 
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir 
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared 
   * snapc_base_store_in_place deprecated, replaced with different components of SStore 
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref 
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported 
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem 

Minor Changes: 
-------------- 
 * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. 
 * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components 
 * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. 
 * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} 
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. 
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. 
 * Cleanup the CRS framework and components to work with the SStore framework. 
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). 
 * Add 'quiesce' hook to CRCP for a future enhancement. 
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. 
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface). 
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. 
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. 
 * {{{opal-restart}}} also support local decompression before restarting 
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata 
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata 
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. 
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. 
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. 
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. 
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. 
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. 
 * Improve the checks for 'already checkpointing' error path. 
 * A a recovery output timer, to show how long it takes to restart a job 
 * Do a better job of cleaning up the old session directory on restart. 
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) 
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. 

This commit was SVN r23587.

The following Trac tickets were found above:
  Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
  Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
  Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
  Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
  Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
  Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
  Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
Rainer Keller
c2d1002e50 - The directory $(MPT_DIR)/lib/snos64 containing libalpslli
does not exist anymore on JaguarPF... Fun.

This commit was SVN r23579.
2010-08-09 16:15:25 +00:00
Shiqing Fan
33719634da Use different variable for option definitions, otherwise CMake get confused somehow.
This commit was SVN r23553.
2010-08-04 19:11:27 +00:00
Shiqing Fan
6893021f7c Get rid of the week string in the date format, it might contain different/unusual characters based on windows language setting.
This commit was SVN r23551.
2010-08-04 09:13:20 +00:00
Shiqing Fan
3ef2be67b9 Add search paths for VS 2010.
This commit was SVN r23538.
2010-08-02 10:09:23 +00:00
Shiqing Fan
ea7bf2bd9e Correctly check the data type alignment for VS 2010 environment, and set the event include paths to global level, in order to make the clever VS load them.
This commit was SVN r23534.
2010-07-30 14:25:15 +00:00
Jeff Squyres
9ca4a4a154 Make it safe to call this script inside of an Open MPI tarball (not
just an SVN or HG checkout).

This commit was SVN r23528.
2010-07-28 17:22:04 +00:00
Shiqing Fan
d589f8289a need to set the result value.
This commit was SVN r23487.
2010-07-23 14:25:13 +00:00
Ralph Castain
e7719f0aa4 Update platform files, adjust sensor heartbeat module selection rules
This commit was SVN r23477.
2010-07-22 21:50:46 +00:00
George Bosilca
3d3677fa7d Update the suppression rules for valgrind to hide the uninitialized byte
in the TCP BTL header.

This commit was SVN r23466.
2010-07-21 17:30:13 +00:00
Ralph Castain
acd990ffe5 Add static configuration for IU and clarify its param files. Update cisco platform file
This commit was SVN r23428.
2010-07-17 20:21:23 +00:00
Shiqing Fan
5789c96525 Add the help and btl ini file into install list.
This commit was SVN r23423.
2010-07-15 18:52:29 +00:00
Shiqing Fan
13b26095cc Add a missing definition for memory implementation header.
This commit was SVN r23410.
2010-07-14 09:12:10 +00:00
Shiqing Fan
332be56b4c Turn the main configure script into macros.
Add checks for a few IBverbs functions and symbols.

This commit was SVN r23389.
2010-07-13 13:41:57 +00:00
Shiqing Fan
8de5654bf9 Add new files into the tarball.
This commit was SVN r23377.
2010-07-12 16:21:46 +00:00
Shiqing Fan
74120b46c1 Need to check another ofed library.
This commit was SVN r23375.
2010-07-12 16:15:22 +00:00
Shiqing Fan
e3be90ff22 Update CMake modules, adding initial support for openib.
This commit was SVN r23373.
2010-07-12 15:28:37 +00:00
Ralph Castain
09acea1ccc Update platform file
This commit was SVN r23326.
2010-07-01 19:30:15 +00:00
Ralph Castain
1102f0c171 Replace old platform file with newer ones
This commit was SVN r23322.
2010-06-29 15:00:10 +00:00
Ralph Castain
73eabc83d6 Add new platform files
This commit was SVN r23321.
2010-06-29 14:58:40 +00:00
Jeff Squyres
ad95e00b42 Remove an extraneous/misleading comment.
This commit was SVN r23320.
2010-06-29 14:42:03 +00:00
Jeff Squyres
9ac56c8674 Add "-j4" into the flags passed when we "make distcheck" (these flags
don't help when just running "make dist").  On my (somewhat older)
machines, it cut the wall clock time of make_dist_tarball down from
~55 minutes to ~40 minutes.

This commit was SVN r23318.
2010-06-29 14:32:20 +00:00
Shiqing Fan
681df0089b Add a few new files into the tarball.
This commit was SVN r23297.
2010-06-22 16:45:56 +00:00
Shiqing Fan
e32159d118 Updates and fixes for Fortran bindings on Windows, including two missing feature tests and CMake scripts improvements.
This commit was SVN r23279.
2010-06-18 13:03:16 +00:00
Ralph Castain
9ba3459135 Use the correct command to revert VERSION when making tarballs
This commit was SVN r23276.
2010-06-17 04:19:42 +00:00
Shiqing Fan
d391c57b0f A more proper fix for the HANDLE definition.
This commit was SVN r23269.
2010-06-14 14:17:07 +00:00
Ralph Castain
fdf9e5f92d Update cisco platform files
This commit was SVN r23268.
2010-06-12 16:05:39 +00:00
Ralph Castain
bb602694e6 Add a new example program, update cisco platform file
This commit was SVN r23262.
2010-06-09 18:21:06 +00:00
Samuel Gutierrez
2fb7c344fc Added a new System V (sysv) shared memory component for Open MPI.
Configure Option:
--enable-sysv

MCA Parameter:
mpi_common_sm

mpi_common_sm accepts a comma delimited list of: [sysv],mmap (order
dependent).  The first component that is successfully selected is used. For
example, -mca mpi_common_sm sysv,mmap will first try sysv. If sysv is not
successfully selected, then mmap will be used.  mmap will be used if 
mpi_common_sm is not provided.

Notes:
Please make certain that your system's shmmax limit, or equivalent, is larger
than mpool_sm_min_size.  Otherwise, shmget may fail.

This commit was SVN r23260.
2010-06-09 16:58:52 +00:00
Ralph Castain
17fd8b3607 Update cisco platform files
This commit was SVN r23243.
2010-06-07 14:14:41 +00:00
Shiqing Fan
8adea20297 Fix a variable name.
This commit was SVN r23231.
2010-06-01 17:37:58 +00:00
Ralph Castain
a1bc589f23 Include new cisco platform files in tarball
This commit was SVN r23209.
2010-05-25 22:39:10 +00:00
Ralph Castain
dc240f323a Update cisco platform files
This commit was SVN r23208.
2010-05-25 22:37:49 +00:00
Ralph Castain
ab6e06f5b3 Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc.
This commit was SVN r23207.
2010-05-25 22:36:57 +00:00
Shiqing Fan
12775c6b9a Add corresponding option for notifier on Windows.
This commit was SVN r23195.
2010-05-21 15:23:44 +00:00
Ralph Castain
12fae43969 Correct the makefile
This commit was SVN r23103.
2010-05-05 01:46:11 +00:00
Ralph Castain
99f223210d Add some contributed examples of how to start and configure the spread library. Do a little more cleanup on the spread module, and ensure that it isn't selected if spread isn't running.
This commit was SVN r23101.
2010-05-04 23:44:00 +00:00
Ralph Castain
10e410f454 Update cisco platform files
This commit was SVN r23087.
2010-05-04 02:39:00 +00:00
Ralph Castain
d3fda5d3b9 Update cisco platform files
This commit was SVN r23081.
2010-05-03 04:08:43 +00:00
Ralph Castain
b1577c4fcd Update cisco platform files to include sensor support
This commit was SVN r23044.
2010-04-26 22:16:48 +00:00
Ralph Castain
f711c4713f Add threading support to odin platform file
This commit was SVN r23022.
2010-04-23 04:31:04 +00:00
Shiqing Fan
077f6e6398 Type casts for building dynamical Fortran libraries.
And export correct function names.

This commit was SVN r23020.
2010-04-22 15:48:27 +00:00
Shiqing Fan
d1e66bdd01 Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Shiqing Fan
e539322807 Move definitions to the main config file.
This commit was SVN r23015.
2010-04-21 09:17:10 +00:00
Ralph Castain
a8586767a9 Update platform files
This commit was SVN r22983.
2010-04-16 18:52:22 +00:00
Ralph Castain
ccc0a076df Don't build the iof-tool module either
This commit was SVN r22974.
2010-04-14 01:20:06 +00:00